# Titanic Data Analysis
**Author:** Derek Fintel

**Date:** April, 04th, 2025 

**Objective:** Predicting a Continuous Target with Regression.


## Introduction
In this project we utilize a trusted Titanic dataset to conduct various analyses, exercise functions, and provide meaningful predicitions of target data. 

This project is organized into the following Sections:
- Section 0: Imports
- Section 1: Load and Inspect the Data
- Section 2: Data Exploration and Preparation
- Section 3: Feature Selection and Justification
- Section 4: Train a Regression Model (Linear Regression)
- Section 5: Compare Alternative Models
- Section 6: Final Thoughts & Insights

## Imports  
Below are our modules used.

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

## Section 1. Load and Inspect the Data

### 1.1 Load the dataset and display its info
1.1 Load the dataset and display the first 10 rows.
1.2 Check for missing values and display summary statistics.

In [313]:
# We Load the 'titantic' dataset via sns.load_dataset
titanic = sns.load_dataset('titanic')

#We retrieve its summary info via '.info()'
titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


### 1.2 Display the first 10 rows.  

In [314]:
# Here we 'print' the first 10 rows via '.head(10)'
print(titanic.head(10))

   survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0         0       3    male  22.0      1      0   7.2500        S   Third   
1         1       1  female  38.0      1      0  71.2833        C   First   
2         1       3  female  26.0      0      0   7.9250        S   Third   
3         1       1  female  35.0      1      0  53.1000        S   First   
4         0       3    male  35.0      0      0   8.0500        S   Third   
5         0       3    male   NaN      0      0   8.4583        Q   Third   
6         0       1    male  54.0      0      0  51.8625        S   First   
7         0       3    male   2.0      3      1  21.0750        S   Third   
8         1       3  female  27.0      0      2  11.1333        S   Third   
9         1       2  female  14.0      1      0  30.0708        C  Second   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes

### Reflection 1: What do you notice about the dataset? Are there any data issues?

## Section 2. Data Exploration and Preparation

### 2.1 Explore Data Patterns and Distributions
Prepare the Titanic data for regression modeling. See the previous work.

Create histograms, boxplots, and count plots for categorical variables (as applicable).
Identify patterns, outliers, and anomalies in feature distributions.
Check for class imbalance in the target variable (as applicable).

In [315]:
# 1. Impute missing 'Age' values using median
titanic['age'].fillna(titanic['age'].median(), inplace=True)

# 2. Drop rows with missing 'fare' (or impute if preferred)
titanic.dropna(subset=['fare'], inplace=True)

# Alternatively, you can impute fare instead of dropping:
# titanic['fare'].fillna(titanic['fare'].median(), inplace=True)

# 3. Create 'family_size' feature
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# 4. Optional: Convert categorical features to numeric

# Convert 'sex' to binary
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})

# One-hot encode 'embarked' (drop_first avoids multicollinearity)
titanic = pd.get_dummies(titanic, columns=['embarked'], drop_first=True)

# Preview the cleaned data
print(titanic.head())


   survived  pclass  sex   age  sibsp  parch     fare  class    who  \
0         0       3    0  22.0      1      0   7.2500  Third    man   
1         1       1    1  38.0      1      0  71.2833  First  woman   
2         1       3    1  26.0      0      0   7.9250  Third  woman   
3         1       1    1  35.0      1      0  53.1000  First  woman   
4         0       3    0  35.0      0      0   8.0500  Third    man   

   adult_male deck  embark_town alive  alone  family_size  embarked_Q  \
0        True  NaN  Southampton    no  False            2       False   
1       False    C    Cherbourg   yes  False            2       False   
2       False  NaN  Southampton   yes   True            1       False   
3       False    C  Southampton   yes  False            2       False   
4        True  NaN  Southampton    no   True            1       False   

   embarked_S  
0        True  
1       False  
2        True  
3        True  
4        True  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)


### 2.2 Handle missing values and clean data
Impute or drop missing values (as applicable).
Remove or transform outliers (as applicable).
Convert categorical data to numerical format using encoding (as applicable).

### 2.3 Feature selection and engineering
Create new features (as applicable).
Transform or combine existing features to improve model performance (as applicable).
Scale or normalize data (as applicable).

### Reflection 2: 
What patterns or anomalies do you see? Do any features stand out? What preprocessing steps were necessary to clean and improve the data? Did you create or modify any features to improve performance?

## Section 3. Feature Selection and Justification
### 3.1 Choose features and target
Select two or more input features (numerical for regression, numerical and/or categorical for classification)
Select a target variable (as applicable)
Regression: Continuous target variable (e.g., price, temperature).
Classification: Categorical target variable (e.g., gender, species).
Clustering: No target variable.
Justify your selection with reasoning.


In [316]:
# Case 1. age
X1 = titanic[['age']]
y1 = titanic['fare']
# Case 2. family_size
X2 = titanic[['family_size']]
y2 = titanic['fare']
# Case 3. age, family_size
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']
# Case 4. parch
X4 = titanic[['parch']]
y4 = titanic['fare']

### 3.2 Define X and y
Assign input features to X
Assign target variable to y (as applicable)

### Reflection of Section 3:

Reflection 3: Why did you choose these features? How might they impact predictions or accuracy? 

## Section 4. Train a Regression Model (Linear Regression)

### 4.1 Split the Data
Split the data into training and test sets using train_test_split (or StratifiedShuffleSplit if class imbalance is an issue).

In [317]:
# Train Case 1
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)
# Train Case 2
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=123)
# Train Case 3
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=123)
# Train Case 4
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=123)


### 4.2 Train model using Scikit-Learn model.fit() method

In [318]:
# Linear Regression for Case 1
lr_model1 = LinearRegression().fit(X1_train, y1_train)
# Linear Regression for Case 2
lr_model2 = LinearRegression().fit(X2_train, y2_train)
# Linear Regression for Case 3
lr_model3 = LinearRegression().fit(X3_train, y3_train)
# Linear Regression for Case 4
lr_model4 = LinearRegression().fit(X4_train, y4_train)

# Predictions
# Case 1
y_pred_train1 = lr_model1.predict(X1_train)
y_pred_test1 = lr_model1.predict(X1_test)
# Case 2
y_pred_train2 = lr_model2.predict(X2_train)
y_pred_test2 = lr_model2.predict(X2_test)
# Predictions for Case 3
y_pred_train3 = lr_model3.predict(X3_train)
y_pred_test3 = lr_model3.predict(X3_test)
# Predictions for Case 4
y_pred_train4 = lr_model4.predict(X4_train)
y_pred_test4 = lr_model4.predict(X4_test)

### 4.3 Evalulate performance, for example:
Regression: R^2, MAE, RMSE (RMSE has been recently updated)
Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix
Clustering: Inertia, Silhouette Score

In [319]:
# Evaluation for Case 1
print("Case 1: Training R²:", r2_score(y1_train, y_pred_train1))
print("Case 1: Test R²:", r2_score(y1_test, y_pred_test1))
print("Case 1: Test RMSE:", mean_squared_error(y1_test, y_pred_test1) ** 0.5)
print("Case 1: Test MAE:", mean_absolute_error(y1_test, y_pred_test1))
# Evaluation for Case 2
print("\nCase 2: Training R²:", r2_score(y2_train, y_pred_train2))
print("Case 2: Test R²:", r2_score(y2_test, y_pred_test2))
print("Case 2: Test RMSE:", mean_squared_error(y2_test, y_pred_test2) ** 0.5)
print("Case 2: Test MAE:", mean_absolute_error(y2_test, y_pred_test2))
# Evaluation for Case 3
print("\nCase 3: Training R²:", r2_score(y3_train, y_pred_train3))
print("Case 3: Test R²:", r2_score(y3_test, y_pred_test3))
print("Case 3: Test RMSE:", mean_squared_error(y3_test, y_pred_test3) ** 0.5)
print("Case 3: Test MAE:", mean_absolute_error(y3_test, y_pred_test3))
# Evaluation for Case 4
print("\nCase 4: Training R²:", r2_score(y4_train, y_pred_train4))
print("Case 4: Test R²:", r2_score(y4_test, y_pred_test4))
print("Case 4: Test RMSE:", mean_squared_error(y4_test, y_pred_test4) ** 0.5)
print("Case 4: Test MAE:", mean_absolute_error(y4_test, y_pred_test4))

Case 1: Training R²: 0.009950688019452314
Case 1: Test R²: 0.0034163395508415295
Case 1: Test RMSE: 37.97164180172938
Case 1: Test MAE: 25.28637293162364

Case 2: Training R²: 0.049915792364760736
Case 2: Test R²: 0.022231186110131973
Case 2: Test RMSE: 37.6114940041967
Case 2: Test MAE: 25.02534815941641

Case 3: Training R²: 0.07347466201590014
Case 3: Test R²: 0.049784832763073106
Case 3: Test RMSE: 37.0777586646559
Case 3: Test MAE: 24.284935030470688

Case 4: Training R²: 0.051165530832692374
Case 4: Test R²: 0.0030800228833659515
Case 4: Test RMSE: 37.97804839823722
Case 4: Test MAE: 25.156643237188245


### Reflection 4: How well did the model perform? Any surprises in the results?


### Section 5. Improve the Model or Try Alternates (Implement Pipelines)
5.1 Implement Pipeline 1: Imputer → StandardScaler → Linear Regression.

In [320]:
# Assuming Case 1 (Age) as the best case for regression
X = titanic[['age']]
y = titanic['fare']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Dictionary to store models and results
models = {
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "ElasticNet": ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = {}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {
        "MSE": mse,
        "R²": r2,
        "Coefficients": model.coef_
    }
    print(f"{name} Regression - MSE: {mse:.3f}, R²: {r2:.3f}")

# You can check the results for each model
for name, result in results.items():
    print(f"\n{name} Results:")
    print(f"MSE: {result['MSE']:.3f}")
    print(f"R²: {result['R²']:.3f}")
    print(f"Coefficients: {result['Coefficients']}")


Ridge Regression - MSE: 1524.637, R²: 0.015
Lasso Regression - MSE: 1524.643, R²: 0.015
ElasticNet Regression - MSE: 1524.641, R²: 0.015

Ridge Results:
MSE: 1524.637
R²: 0.015
Coefficients: [0.36204943]

Lasso Results:
MSE: 1524.643
R²: 0.015
Coefficients: [0.36146061]

ElasticNet Results:
MSE: 1524.641
R²: 0.015
Coefficients: [0.36164951]


### 5.2 Elastic Net (L1 + L2 combined)
5.2 Implement Pipeline 2: Imputer → Polynomial Features (degree=3) → StandardScaler → Linear Regression.

In [321]:
# Assuming you want to use 'age', 'family_size', and 'parch' for the ElasticNet model
X = titanic[['age', 'family_size', 'parch']]
y = titanic['fare']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train ElasticNet with scaled features
elastic_model = ElasticNet(alpha=0.3, l1_ratio=0.5, random_state=42)
elastic_model.fit(X_scaled, y)

# Predict and evaluate
y_pred_elastic = elastic_model.predict(X_scaled)
mse_elastic = mean_squared_error(y, y_pred_elastic)
r2_elastic = r2_score(y, y_pred_elastic)

print(f"Elastic Net - MSE: {mse_elastic:.3f}, R²: {r2_elastic:.3f}")
print(f"Elastic Net Coefficients: {elastic_model.coef_}")


Elastic Net - MSE: 2282.922, R²: 0.074
Elastic Net Coefficients: [6.33426796 6.86793796 5.48097229]


In [322]:
elastic_model = ElasticNet(alpha=0.3, l1_ratio=0.5)
elastic_model.fit(X1_train, y1_train)
y_pred_elastic = elastic_model.predict(X1_test)

### 5.3 Compare performance of all models across the same performance metrics

In [323]:
# Set up the poly inputs
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X1_train)
X_test_poly = poly.transform(X1_test)

# Use the poly inputs in the LR model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y1_train)
y_pred_poly = poly_model.predict(X_test_poly)

# Use the correct variable name for predictions
y_pred_poly = poly_model.predict(X_test_poly)

# Evaluate the Polynomial Regression model
mse_poly = mean_squared_error(y1_test, y_pred_poly)
r2_poly = r2_score(y1_test, y_pred_poly)

print(f"Polynomial Regression - MSE: {mse_poly:.3f}, R²: {r2_poly:.3f}")

Polynomial Regression - MSE: 1451.569, R²: -0.003


### Reflection 5: 
Which models performed better? How does scaling impact results?

### Section 6. Final Thoughts & Insights
Your notebook should tell a data story. Use this section to demonstrate your thinking and value as an analyst.

### 6.1 Summarize Findings
1) What features were most useful? 
   1) Ans: Sex and Age were helpful base parameters.
2) What regression model performed best? 
   1) Ans: Poly appeared to smoothen everything out. 
3) How did model complexity or regularization affect results?
   1) Ans: Increasing the polynomial degree helped sharpen the plotting/insights. 

### 6.2 Discuss Challenges
1) Was fare hard to predict? Why?
   1) Ans: It was as the selected inputs did not reveal strong correlations to 'fare'.
2) Did skew or outliers impact the models?
   1) Ans: Yes, and particularly around how 'fare' data may have been affected/influenced.  

### 6.3 If you had more time, what would you try next?

Reflection 6: What did you learn from this project?