# 🛳️ Titanic Survival Classification – Supervised Learning Practice

This notebook explores a classification task using the Titanic dataset.  
We apply decision trees and random forests to predict passenger survival, with emphasis on model evaluation, tuning, and generalization.

## 1. Loading Data

We begin by loading the raw Titanic dataset and performing basic checks.

In [None]:
import pandas as pd

# Define file path and load Titanic dataset
file_path = 'data/titanic.csv'  # Adjust path if needed
titanic_data = pd.read_csv(file_path)

# Preview dataset shape and column names
print("Shape:", titanic_data.shape)
print("Columns:", titanic_data.columns.tolist())

# Preview the first few rows
titanic_data.head()

## 2. Exploratory Data Analysis

We explore the structure and quality of the dataset, identify missing values, and prepare for feature selection.

In [None]:
# Display summary statistics, including non-numeric columns
titanic_data.describe(include='all')

# Visualize missing values per column
missing = titanic_data.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)

print("Missing values per column:")
print(missing)

Initial inspection reveals:

- `Cabin` is missing for most rows — we will drop it.
- `Age` has moderate missingness — we will fill with the median.
- `Embarked` has two missing values — we will drop those rows.

## 3. Feature Engineering

We'll remove irrelevant or highly incomplete columns, handle missing data, and encode categorical variables for modeling.

In [None]:
# Drop columns unlikely to help with classification or too incomplete
titanic_data = titanic_data.drop(columns=['Cabin', 'Ticket', 'Name'])

# Drop rows with missing target ('Survived') or critical features
titanic_data = titanic_data.dropna(subset=['Survived', 'Embarked'])

# Fill missing Age with median
titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].median())

# Encode 'Sex' and 'Embarked'
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data['Embarked'] = titanic_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

In [None]:
# Define target and features
y = titanic_data['Survived']
feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = titanic_data[feature_names]

# Preview the final feature matrix and target
X.head()

### Notes

- Removed columns `Cabin`, `Ticket`, and `Name` due to missingness or low predictive value.
- Filled missing `Age` values with the median.
- Encoded categorical features `Sex` and `Embarked`.
- Selected 7 numeric features for model training.

In [None]:
# Save cleaned data
titanic_data.to_csv('data/titanic_cleaned.csv', index=False)

## 4. Model Training

We start by training a baseline `DecisionTreeClassifier` to predict passenger survival.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split the dataset into training and validation sets (80/20 split)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Train a Decision Tree classifier with default parameters
model = DecisionTreeClassifier(random_state=1)
model.fit(train_X, train_y)

## 5. Evaluation

### 5.1 Accuracy and Metrics
We evaluate model performance using accuracy, precision, and recall.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

# Predict on validation data
val_predictions = model.predict(val_X)

# Compute performance metrics
accuracy = accuracy_score(val_y, val_predictions)
precision = precision_score(val_y, val_predictions)
recall = recall_score(val_y, val_predictions)
cm = confusion_matrix(val_y, val_predictions)

print(f"Validation Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print("Confusion Matrix:\n", cm)

### 5.2 Confusion Matrix

| Actual \\ Predicted | 0 (Did Not Survive) | 1 (Survived) |
|---------------------|---------------------|--------------|
| 0                   | 113                 | 25           |
| 1                   | 19                  | 66           |

- **Precision:** 72.5% – Of those predicted to survive, 72.5% actually did.
- **Recall:** 77.6% – The model correctly identified 77.6% of true survivors.

The model performs well initially, though it shows mild overfitting. Further refinement can help.

## 6. Model Refinement and Evaluation

## 6. Model Refinement

We now experiment with ways to improve the model:
- Limit decision tree complexity using `max_depth`
- Try a Random Forest ensemble model

### 6.1 Tree Depth Tuning

To evaluate overfitting and underfitting, we trained several decision trees with increasing `max_depth` values.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

depths = [2, 3, 4, 5, 6, 7, 10, None]
results = []

# Evaluate each depth
for d in depths:
    model = DecisionTreeClassifier(max_depth=d, random_state=1)
    model.fit(train_X, train_y)
    preds = model.predict(val_X)

    acc = accuracy_score(val_y, preds)
    prec = precision_score(val_y, preds)
    rec = recall_score(val_y, preds)

    results.append({
        'Depth': d if d is not None else 'Full',
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec
    })

In [None]:
# Plot validation accuracy vs depth
import matplotlib.pyplot as plt

depth_labels = [str(r['Depth']) for r in results]
accs = [r['Accuracy'] for r in results]

plt.figure(figsize=(7, 5))
plt.plot(depth_labels, accs, marker='o', linestyle='-')
plt.title("Validation Accuracy vs Tree Depth")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.grid(True)
plt.tight_layout()
plt.savefig("plots/titanic_depth_accuracy.png")
plt.show()

- Accuracy improved up to depth 5–6, peaking around 85.7%.
- Beyond depth 6, the model began to overfit (lower validation accuracy).
- A tree of depth 5 or 6 balances bias and variance effectively.

### 6.2 Random Forest Classifier

We compare the tuned decision tree to a Random Forest ensemble.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Train and evaluate Random Forest
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(train_X, train_y)
rf_preds = rf_model.predict(val_X)

rf_acc = accuracy_score(val_y, rf_preds)
rf_prec = precision_score(val_y, rf_preds)
rf_rec = recall_score(val_y, rf_preds)
rf_cm = confusion_matrix(val_y, rf_preds)

In [None]:
# Evaluate best Decision Tree with max_depth=5
dt_model = DecisionTreeClassifier(max_depth=5, random_state=1)
dt_model.fit(train_X, train_y)
dt_preds = dt_model.predict(val_X)

dt_acc = accuracy_score(val_y, dt_preds)
dt_prec = precision_score(val_y, dt_preds)
dt_rec = recall_score(val_y, dt_preds)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Generate and plot confusion matrix
cm = confusion_matrix(val_y, rf_preds)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Did Not Survive', 'Survived'],
            yticklabels=['Did Not Survive', 'Survived'])

plt.title("Confusion Matrix – Random Forest")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.savefig("plots/titanic_confusion_matrix.png")  # Save to plots/
plt.show()

In [None]:
print(f"Decision Tree (d=5) → Accuracy: {dt_acc:.3f}, Precision: {dt_prec:.3f}, Recall: {dt_rec:.3f}")
print(f"Random Forest       → Accuracy: {rf_acc:.3f}, Precision: {rf_prec:.3f}, Recall: {rf_rec:.3f}")

In [None]:
# Visualize Random Forest feature importances
importances = rf_model.feature_importances_
feature_names = train_X.columns

pd.Series(importances, index=feature_names).sort_values().plot(
    kind='barh', title="Random Forest Feature Importances"
)

plt.xlabel("Importance")
plt.tight_layout()
plt.savefig("plots/titanic_feature_importance.png")  # Save to plots/
plt.show()

#### Results

| Model                | Accuracy | Precision | Recall |
|----------------------|----------|-----------|--------|
| Decision Tree (d=5)  | 0.857    | 0.884     | 0.718  |
| Random Forest        | 0.825    | 0.780     | 0.753  |

#### Interpretation

- The Decision Tree had stronger accuracy and precision but slightly lower recall.
- The Random Forest improved recall, detecting more true positives at the cost of more false alarms.
- This reflects a classic **precision–recall tradeoff**.

We’ll finalize visuals and conclusions next.

## 7. Conclusion

This notebook demonstrated a full supervised learning workflow on the Titanic dataset, applying both decision trees and random forests to predict passenger survival.

### Summary of Findings

- **Baseline Decision Tree** (no tuning) achieved 80.3% accuracy.
- **Tree depth tuning** showed that max depths of 5–6 provided the best tradeoff between underfitting and overfitting.
- **Random Forest** slightly underperformed in overall accuracy but improved recall, showing better generalization.
- **Feature engineering** (e.g., encoding Sex and Embarked, handling Age) significantly improved model readiness.
- Visualizations (accuracy vs. depth, confusion matrix, feature importances) clarified model behavior.

### Key Learnings

- Precision and recall offer different insights; the ideal balance depends on application context.
- Limiting model complexity is a simple but powerful way to reduce overfitting.
- Ensembles like Random Forest are often more robust out-of-the-box, but tuning and interpretability remain important.

---

Next steps could include:
- Hyperparameter tuning of `RandomForestClassifier` (e.g., `n_estimators`, `max_depth`)
- Exploring additional features (e.g., family size, title from name)
- Testing logistic regression or gradient boosting for comparison
- Performing cross-validation for more reliable evaluation