# **Titanic:** Predicting a Continuous Target with Regression

### **Author:** Evan Dobler
### **Date:** 4/15/2025
### **Purpose:** Predict fare, the amount of money paid for the journey, using features in the Titanic dataset

## Imports

In [40]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## Section 1. Import and Inspect the Data

In [41]:
# Load Titanic dataset from seaborn and verify
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Section 2. Data Exploration and Preparation

In [42]:
titanic['age'].fillna(titanic['age'].median(), inplace=True)

titanic = titanic.dropna(subset=['fare'])

titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)


## Section 3. Feature Selection and Justification

Case 1. age only

In [43]:
# Case 1. age
X1 = titanic[['age']]
y1 = titanic['fare']

Case 2. family_size only

In [44]:
# Case 2. family_size
X2 = titanic[['family_size']]
y2 = titanic['fare']

Case 3. age and family size

In [45]:
# Case 3. age, family_size
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']

Case 4. sex

In [46]:
# Case 4. ???
X4 = titanic[['age', 'pclass']]
y4 = titanic['fare']

Why might these features affect a passenger’s fare: These features might affect a passenger's fare because like today's world, age at amusement parks, zoos, museums, etc., usually have price tiers for kids, adults, and seniors.

List all available features: survived, pclass, sex, age, sibsp, parch, fare, embarked, class, who, adult_male,deck, embark_town, alive, alone, family_size

Which other features could improve predictions and why: Other features that could improve predictions could be etnicity or occupation because these sometimes are related to social status which was a bigger deal back then.

How many variables are in your Case 4: 2 variables.
Which variable(s) did you choose for Case 4 and why do you feel those could make good inputs: I chose pclass and age because passengers in a more priority class may have survived at a higher rate.

## Section 4. Train a Regression Model (Linear Regression)

#### 4.1 Split the Data

In [47]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=123)

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=123)

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=123)

4.2 Train and Evaluate Linear Regression Models (all 4 cases)

In [48]:
# Train linear regression models for all 4 cases
lr_model1 = LinearRegression().fit(X1_train, y1_train)
lr_model2 = LinearRegression().fit(X2_train, y2_train)
lr_model3 = LinearRegression().fit(X3_train, y3_train)
lr_model4 = LinearRegression().fit(X4_train, y4_train)

# Predictions for each case
y_pred_train1 = lr_model1.predict(X1_train)
y_pred_test1 = lr_model1.predict(X1_test)

y_pred_train2 = lr_model2.predict(X2_train)
y_pred_test2 = lr_model2.predict(X2_test)

y_pred_train3 = lr_model3.predict(X3_train)
y_pred_test3 = lr_model3.predict(X3_test)

y_pred_train4 = lr_model4.predict(X4_train)
y_pred_test4 = lr_model4.predict(X4_test)

print("Case 1 (age):")
print(f"Training R²: {r2_score(y1_train, y_pred_train1)}")
print(f"Test R²: {r2_score(y1_test, y_pred_test1)}")

print("\nCase 2 (family_size):")
print(f"Training R²: {r2_score(y2_train, y_pred_train2)}")
print(f"Test R²: {r2_score(y2_test, y_pred_test2)}")

print("\nCase 3 (age, family_size):")
print(f"Training R²: {r2_score(y3_train, y_pred_train3)}")
print(f"Test R²: {r2_score(y3_test, y_pred_test3)}")

print("\nCase 4 (pclass, age):")
print(f"Training R²: {r2_score(y4_train, y_pred_train4)}")
print(f"Test R²: {r2_score(y4_test, y_pred_test4)}")

Case 1 (age):
Training R²: 0.009950688019452203
Test R²: 0.0034163395508416405

Case 2 (family_size):
Training R²: 0.04991579236476085
Test R²: 0.02223118611013175

Case 3 (age, family_size):
Training R²: 0.07347466201590014
Test R²: 0.04978483276307333

Case 4 (pclass, age):
Training R²: 0.30893458196174806
Test R²: 0.3166169173431005


4.3 Report Performance

In [49]:
# Function to evaluate performance
def evaluate_performance(X, y, case_name):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

    # Train Linear Regression Model
    lr_model = LinearRegression().fit(X_train, y_train)

    # Predictions
    y_pred_train = lr_model.predict(X_train)
    y_pred_test = lr_model.predict(X_test)

    # Evaluate Performance
    print(f"--- {case_name} ---")
    print("Training R²:", r2_score(y_train, y_pred_train))
    print("Test R²:", r2_score(y_test, y_pred_test))
    print("Test RMSE:", mean_squared_error(y_test, y_pred_test) ** 0.5)
    print("Test MAE:", mean_absolute_error(y_test, y_pred_test))
    print("-" * 30)

# Run evaluations for all cases
evaluate_performance(X1, y1, "Case 1: Age Only")
evaluate_performance(X2, y2, "Case 2: Family Size Only")
evaluate_performance(X3, y3, "Case 3: Age & Family Size")
evaluate_performance(X4, y4, "Case 4: Age & Pclass")

--- Case 1: Age Only ---
Training R²: 0.009950688019452203
Test R²: 0.0034163395508416405
Test RMSE: 37.971641801729376
Test MAE: 25.286372931623628
------------------------------
--- Case 2: Family Size Only ---
Training R²: 0.04991579236476085
Test R²: 0.02223118611013175
Test RMSE: 37.61149400419671
Test MAE: 25.02534815941642
------------------------------
--- Case 3: Age & Family Size ---
Training R²: 0.07347466201590014
Test R²: 0.04978483276307333
Test RMSE: 37.077758664655896
Test MAE: 24.28493503047068
------------------------------
--- Case 4: Age & Pclass ---
Training R²: 0.30893458196174806
Test R²: 0.3166169173431005
Test RMSE: 31.443769640988414
Test MAE: 20.703744560366548
------------------------------


Compare the train vs test results for each.

Did Case 1 overfit or underfit? Explain: Underfit
Did Case 2 overfit or underfit? Explain: Underfit
Did Case 3 overfit or underfit? Explain: Slightly Underfit
Did Case 4 overfit or underfit? Explain: Balanced

Adding Age

Did adding age improve the model: Yes, but minimal effect on the model.
Propose a possible explanation (consider how age might affect ticket price, and whether the data supports that): Age alone doesn’t have a strong correlation with fare because ticket price is likely influenced by class, destination, and purchasing power rather than just age.

Worst

Which case performed the worst: Age only
How do you know: 
Do you think adding more training data would improve it (and why/why not): 
Best

Which case performed the best: Age and Pclass
How do you know: High R^2
Do you think adding more training data would improve it (and why/why not): 