# Titanic Survival Prediction

**Author:** Anik Tahabilder  
**Project:** 4 of 22 - Kaggle ML Portfolio  
**Dataset:** Titanic  
**Difficulty:** 3/10 | **Learning Value:** 7/10

---

## Why This Project?

The Titanic dataset is the **"Hello World" of machine learning classification**. In Project 1, we explored and understood the data. Now we'll build models to predict survival!

### What You'll Learn:

| Topic | Description |
|-------|-------------|
| **Data Preprocessing** | Handle missing values, encode categories |
| **Feature Engineering** | Create meaningful features from raw data |
| **Model Building** | Train multiple classification algorithms |
| **Model Evaluation** | Use proper metrics beyond accuracy |
| **Model Selection** | Choose the best model for deployment |

### The ML Pipeline:

```
Raw Data → Clean → Engineer Features → Split → Scale → Train → Evaluate → Select Best
```

---

## Table of Contents

1. [Part 1: Setup and Data Loading](#part1)
2. [Part 2: Data Preprocessing](#part2)
3. [Part 3: Feature Engineering](#part3)
4. [Part 4: Prepare for Modeling](#part4)
5. [Part 5: Model Training](#part5)
6. [Part 6: Model Evaluation](#part6)
7. [Part 7: Feature Importance](#part7)
8. [Part 8: Summary and Conclusions](#part8)

---

<a id='part1'></a>
# Part 1: Setup and Data Loading

---

## 1.1 Import Libraries

| Library | Purpose |
|---------|--------|
| **pandas/numpy** | Data manipulation |
| **matplotlib/seaborn** | Visualization |
| **sklearn** | Machine learning algorithms and tools |

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Evaluation metrics
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report,
                             roc_auc_score, roc_curve, precision_recall_curve)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.precision', 3)

print("Libraries loaded successfully!")
print(f"scikit-learn version: {__import__('sklearn').__version__}")

## 1.2 Load the Data

In [None]:
# Load the Titanic dataset
df = sns.load_dataset('titanic')

print("=" * 60)
print("TITANIC DATASET LOADED")
print("=" * 60)
print(f"\nShape: {df.shape[0]} passengers, {df.shape[1]} features")
print(f"\nColumns: {list(df.columns)}")
df.head()

In [None]:
# Quick data overview
print("=" * 60)
print("DATA OVERVIEW")
print("=" * 60)
df.info()

In [None]:
# Check missing values
print("=" * 60)
print("MISSING VALUES")
print("=" * 60)

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percent': missing_pct})
print(missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False))

print("\nKey issues to handle:")
print("  - age: 19.87% missing")
print("  - deck: 77.22% missing (too much - will drop)")
print("  - embarked: 0.22% missing (fill with mode)")
print("  - embark_town: 0.22% missing (fill with mode)")

---

<a id='part2'></a>
# Part 2: Data Preprocessing

---

## Why Preprocessing Matters

Raw data is messy. Machine learning algorithms need clean, structured input.

| Issue | Strategy |
|-------|----------|
| **Missing values** | Impute or drop |
| **Categorical variables** | Encode to numbers |
| **Redundant columns** | Remove |
| **Different scales** | Standardize |

## 2.1 Handle Missing Values

In [None]:
# Create a clean copy
data = df.copy()

# Strategy 1: Impute Age with median by Pclass and Sex
# WHY: Age likely varies by class and gender
print("Imputing Age...")
data['age'] = data.groupby(['pclass', 'sex'])['age'].transform(
    lambda x: x.fillna(x.median())
)
print(f"  Age missing after imputation: {data['age'].isnull().sum()}")

# Strategy 2: Fill embarked with mode
print("\nFilling Embarked...")
data['embarked'] = data['embarked'].fillna(data['embarked'].mode()[0])
data['embark_town'] = data['embark_town'].fillna(data['embark_town'].mode()[0])
print(f"  Embarked missing: {data['embarked'].isnull().sum()}")

# Strategy 3: Drop columns with too many missing values or redundant info
print("\nDropping columns...")
columns_to_drop = ['deck', 'alive', 'who', 'adult_male', 'class', 'embark_town']
data = data.drop(columns=columns_to_drop)
print(f"  Dropped: {columns_to_drop}")
print(f"  Remaining columns: {list(data.columns)}")

In [None]:
# Verify no missing values remain
print("=" * 60)
print("MISSING VALUES CHECK")
print("=" * 60)
print(data.isnull().sum())
print(f"\nTotal missing: {data.isnull().sum().sum()}")

## 2.2 Encode Categorical Variables

ML algorithms need numbers, not strings!

| Encoding Method | When to Use |
|-----------------|-------------|
| **Label Encoding** | Ordinal categories (Low/Medium/High) |
| **One-Hot Encoding** | Nominal categories (no order) |
| **Binary Encoding** | Two categories |

In [None]:
# Check categorical columns
print("Categorical columns:")
for col in data.select_dtypes(include=['object', 'category']).columns:
    print(f"  {col}: {data[col].unique()}")

In [None]:
# Encode categorical variables
print("=" * 60)
print("ENCODING CATEGORICAL VARIABLES")
print("=" * 60)

# Binary encoding for sex
data['sex'] = data['sex'].map({'male': 0, 'female': 1})
print("sex: male=0, female=1")

# One-hot encoding for embarked
data = pd.get_dummies(data, columns=['embarked'], prefix='embarked', drop_first=True)
print("embarked: One-hot encoded (dropped first category)")

# Convert boolean to int
data['alone'] = data['alone'].astype(int)
print("alone: Converted to 0/1")

print(f"\nFinal columns: {list(data.columns)}")
print(f"Shape: {data.shape}")

In [None]:
# View the cleaned data
print("Cleaned Data Sample:")
data.head(10)

---

<a id='part3'></a>
# Part 3: Feature Engineering

---

## What is Feature Engineering?

Creating new features from existing data to help the model learn better patterns.

> "Feature engineering is the art of extracting more information from existing data."

### Ideas for Titanic:
- **Family size** = SibSp + Parch + 1
- **Is alone** = Family size == 1
- **Title** extracted from name
- **Age groups** (bins)
- **Fare per person**

In [None]:
# Feature Engineering
print("=" * 60)
print("FEATURE ENGINEERING")
print("=" * 60)

# 1. Family Size
data['family_size'] = data['sibsp'] + data['parch'] + 1
print("\n1. Created 'family_size' = sibsp + parch + 1")
print(f"   Range: {data['family_size'].min()} - {data['family_size'].max()}")

# 2. Is Small Family (optimal for survival based on EDA)
data['is_small_family'] = ((data['family_size'] >= 2) & (data['family_size'] <= 4)).astype(int)
print("\n2. Created 'is_small_family' (family size 2-4)")

# 3. Age Groups
data['age_group'] = pd.cut(data['age'], 
                           bins=[0, 12, 18, 35, 60, 100],
                           labels=[0, 1, 2, 3, 4])  # Child, Teen, Adult, Middle, Senior
data['age_group'] = data['age_group'].astype(int)
print("\n3. Created 'age_group' (0=Child, 1=Teen, 2=Adult, 3=Middle, 4=Senior)")

# 4. Fare per person
data['fare_per_person'] = data['fare'] / data['family_size']
print("\n4. Created 'fare_per_person' = fare / family_size")

# 5. Is Child
data['is_child'] = (data['age'] < 12).astype(int)
print("\n5. Created 'is_child' (age < 12)")

print(f"\nFinal feature count: {data.shape[1]} columns")

In [None]:
# View engineered features
print("Data with Engineered Features:")
data.head(10)

In [None]:
# Visualize engineered features vs survival
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Family Size
survival_by_family = data.groupby('family_size')['survived'].mean()
axes[0, 0].bar(survival_by_family.index, survival_by_family.values, color='steelblue', edgecolor='black')
axes[0, 0].set_xlabel('Family Size')
axes[0, 0].set_ylabel('Survival Rate')
axes[0, 0].set_title('Survival Rate by Family Size', fontweight='bold')
axes[0, 0].axhline(y=data['survived'].mean(), color='red', linestyle='--', label='Overall Rate')
axes[0, 0].legend()

# 2. Age Group
survival_by_age = data.groupby('age_group')['survived'].mean()
labels = ['Child', 'Teen', 'Adult', 'Middle', 'Senior']
axes[0, 1].bar(range(len(survival_by_age)), survival_by_age.values, color='darkorange', edgecolor='black')
axes[0, 1].set_xticks(range(len(labels)))
axes[0, 1].set_xticklabels(labels)
axes[0, 1].set_xlabel('Age Group')
axes[0, 1].set_ylabel('Survival Rate')
axes[0, 1].set_title('Survival Rate by Age Group', fontweight='bold')
axes[0, 1].axhline(y=data['survived'].mean(), color='red', linestyle='--')

# 3. Is Small Family
survival_by_small = data.groupby('is_small_family')['survived'].mean()
axes[1, 0].bar(['Large/Alone', 'Small (2-4)'], survival_by_small.values, 
               color=['crimson', 'forestgreen'], edgecolor='black')
axes[1, 0].set_ylabel('Survival Rate')
axes[1, 0].set_title('Survival Rate: Small Family Advantage', fontweight='bold')

# 4. Is Child
survival_by_child = data.groupby('is_child')['survived'].mean()
axes[1, 1].bar(['Adult', 'Child'], survival_by_child.values,
               color=['steelblue', 'gold'], edgecolor='black')
axes[1, 1].set_ylabel('Survival Rate')
axes[1, 1].set_title('Survival Rate: Children vs Adults', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("- Small families (2-4) had the highest survival rates")
print("- Children had better survival rates than adults")
print("- Being alone or in very large family reduced survival chances")

---

<a id='part4'></a>
# Part 4: Prepare for Modeling

---

## 4.1 Define Features and Target

In [None]:
# Define features (X) and target (y)
print("=" * 60)
print("FEATURE SELECTION")
print("=" * 60)

# All columns except 'survived'
feature_cols = [col for col in data.columns if col != 'survived']

X = data[feature_cols]
y = data['survived']

print(f"\nFeatures ({len(feature_cols)}): {feature_cols}")
print(f"\nX shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nSurvival rate: {y.mean()*100:.2f}%")

## 4.2 Train-Test Split

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Maintain class proportions
)

print("=" * 60)
print("TRAIN-TEST SPLIT")
print("=" * 60)
print(f"\nTraining set: {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Testing set: {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")
print(f"\nTraining survival rate: {y_train.mean()*100:.2f}%")
print(f"Testing survival rate: {y_test.mean()*100:.2f}%")
print("\nStratification maintained!")

## 4.3 Feature Scaling

Some algorithms are sensitive to feature scales:

| Needs Scaling | Doesn't Need Scaling |
|---------------|---------------------|
| Logistic Regression | Decision Tree |
| KNN | Random Forest |
| SVM | Gradient Boosting |
| Neural Networks | Naive Bayes |

In [None]:
# Scale features
scaler = StandardScaler()

# Fit on training data only, transform both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("=" * 60)
print("FEATURE SCALING")
print("=" * 60)
print("\nBefore scaling (first 3 features):")
print(f"  Mean: {X_train.iloc[:, :3].mean().values.round(2)}")
print(f"  Std:  {X_train.iloc[:, :3].std().values.round(2)}")

print("\nAfter scaling (first 3 features):")
print(f"  Mean: {X_train_scaled[:, :3].mean(axis=0).round(4)}")
print(f"  Std:  {X_train_scaled[:, :3].std(axis=0).round(4)}")

---

<a id='part5'></a>
# Part 5: Model Training

---

## Classification Algorithms

| Algorithm | Type | Strength |
|-----------|------|----------|
| **Logistic Regression** | Linear | Simple, interpretable |
| **KNN** | Instance-based | No assumptions about data |
| **Decision Tree** | Tree-based | Easy to understand |
| **Random Forest** | Ensemble | Robust, handles overfitting |
| **Gradient Boosting** | Ensemble | High accuracy |
| **SVM** | Kernel-based | Effective in high dimensions |
| **Naive Bayes** | Probabilistic | Fast, works with small data |

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42, probability=True),
    'Naive Bayes': GaussianNB()
}

print("Models to train:")
for i, name in enumerate(models.keys(), 1):
    print(f"  {i}. {name}")

In [None]:
# Train all models and collect results
results = {}

print("=" * 70)
print("TRAINING MODELS")
print("=" * 70)

for name, model in models.items():
    # Use scaled data for distance/gradient-based algorithms
    if name in ['Logistic Regression', 'K-Nearest Neighbors', 'SVM']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Store results
    results[name] = {
        'model': model,
        'predictions': y_pred,
        'probabilities': y_pred_proba,
        'accuracy': accuracy,
        'roc_auc': roc_auc
    }
    
    print(f"\n{name}:")
    print(f"  Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  ROC-AUC:  {roc_auc:.4f}")

print("\n" + "=" * 70)
print("All models trained successfully!")

---

<a id='part6'></a>
# Part 6: Model Evaluation

---

## 6.1 Accuracy Comparison

In [None]:
# Create comparison table
comparison = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [r['accuracy'] for r in results.values()],
    'ROC-AUC': [r['roc_auc'] for r in results.values()]
}).sort_values('Accuracy', ascending=False).reset_index(drop=True)

comparison['Rank'] = range(1, len(comparison) + 1)
comparison = comparison[['Rank', 'Model', 'Accuracy', 'ROC-AUC']]

print("=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
print(comparison.to_string(index=False))

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sort by accuracy
sorted_results = dict(sorted(results.items(), key=lambda x: x[1]['accuracy'], reverse=True))
names = list(sorted_results.keys())
accuracies = [sorted_results[n]['accuracy'] * 100 for n in names]
roc_aucs = [sorted_results[n]['roc_auc'] for n in names]

colors = plt.cm.RdYlGn(np.linspace(0.3, 0.9, len(names)))

# Accuracy
bars1 = axes[0].barh(names[::-1], accuracies[::-1], color=colors[::-1], edgecolor='black')
axes[0].set_xlim(70, 90)
axes[0].set_xlabel('Accuracy (%)')
axes[0].set_title('Model Accuracy Comparison', fontweight='bold')
for bar, acc in zip(bars1, accuracies[::-1]):
    axes[0].text(bar.get_width() + 0.3, bar.get_y() + bar.get_height()/2,
                 f'{acc:.1f}%', va='center', fontweight='bold')

# ROC-AUC
bars2 = axes[1].barh(names[::-1], roc_aucs[::-1], color=colors[::-1], edgecolor='black')
axes[1].set_xlim(0.7, 0.95)
axes[1].set_xlabel('ROC-AUC Score')
axes[1].set_title('Model ROC-AUC Comparison', fontweight='bold')
for bar, auc in zip(bars2, roc_aucs[::-1]):
    axes[1].text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,
                 f'{auc:.3f}', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

best_model = names[0]
print(f"\nBest Model: {best_model}")
print(f"  Accuracy: {sorted_results[best_model]['accuracy']*100:.2f}%")
print(f"  ROC-AUC: {sorted_results[best_model]['roc_auc']:.4f}")

## 6.2 Cross-Validation

A single train-test split might not be reliable. Cross-validation gives a more robust estimate.

In [None]:
# 5-Fold Cross-Validation
print("=" * 70)
print("5-FOLD CROSS-VALIDATION")
print("=" * 70)

cv_results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    # Use appropriate data
    if name in ['Logistic Regression', 'K-Nearest Neighbors', 'SVM']:
        X_cv = scaler.fit_transform(X)
    else:
        X_cv = X
    
    scores = cross_val_score(model, X_cv, y, cv=cv, scoring='accuracy')
    cv_results[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }
    
    print(f"\n{name}:")
    print(f"  Scores: {scores.round(4)}")
    print(f"  Mean: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

In [None]:
# Visualize CV results
cv_sorted = dict(sorted(cv_results.items(), key=lambda x: x[1]['mean'], reverse=True))

fig, ax = plt.subplots(figsize=(12, 6))

names_cv = list(cv_sorted.keys())
means = [cv_sorted[n]['mean'] * 100 for n in names_cv]
stds = [cv_sorted[n]['std'] * 100 for n in names_cv]

colors = plt.cm.RdYlGn(np.linspace(0.3, 0.9, len(names_cv)))
bars = ax.barh(names_cv[::-1], means[::-1], xerr=stds[::-1],
               color=colors[::-1], edgecolor='black', capsize=5)

ax.set_xlim(70, 90)
ax.set_xlabel('Cross-Validation Accuracy (%)')
ax.set_title('5-Fold Cross-Validation Results', fontweight='bold')

for bar, mean, std in zip(bars, means[::-1], stds[::-1]):
    ax.text(bar.get_width() + std + 0.5, bar.get_y() + bar.get_height()/2,
            f'{mean:.1f}%', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

best_cv = names_cv[0]
print(f"\nBest Model (CV): {best_cv}")
print(f"  Mean Accuracy: {cv_sorted[best_cv]['mean']*100:.2f}%")
print(f"  Std: +/- {cv_sorted[best_cv]['std']*200:.2f}%")

## 6.3 Confusion Matrices

In [None]:
# Plot confusion matrices for top 4 models
top_models = list(sorted_results.keys())[:4]

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, name in enumerate(top_models):
    cm = confusion_matrix(y_test, results[name]['predictions'])
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i],
                xticklabels=['Died', 'Survived'], yticklabels=['Died', 'Survived'],
                annot_kws={'size': 14, 'weight': 'bold'})
    axes[i].set_title(f'{name}\nAccuracy: {results[name]["accuracy"]*100:.1f}%', fontweight='bold')
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('Actual')

plt.suptitle('Confusion Matrices - Top 4 Models', fontweight='bold', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("Reading Confusion Matrix:")
print("  - Top-left (TN): Correctly predicted death")
print("  - Top-right (FP): Predicted survival but died")
print("  - Bottom-left (FN): Predicted death but survived")
print("  - Bottom-right (TP): Correctly predicted survival")

## 6.4 ROC Curves

In [None]:
# Plot ROC curves
fig, ax = plt.subplots(figsize=(10, 8))

colors = plt.cm.tab10(np.linspace(0, 1, len(results)))

for (name, res), color in zip(results.items(), colors):
    fpr, tpr, _ = roc_curve(y_test, res['probabilities'])
    ax.plot(fpr, tpr, color=color, linewidth=2,
            label=f"{name} (AUC = {res['roc_auc']:.3f})")

# Diagonal line (random classifier)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - All Models', fontweight='bold', fontsize=14)
ax.legend(loc='lower right')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])

plt.tight_layout()
plt.show()

print("\nROC Curve Interpretation:")
print("  - Curve closer to top-left = better model")
print("  - AUC (Area Under Curve) closer to 1.0 = better discrimination")
print("  - AUC = 0.5 means random guessing")

## 6.5 Classification Report for Best Model

In [None]:
# Detailed classification report
best_model_name = list(sorted_results.keys())[0]
best_predictions = results[best_model_name]['predictions']

print("=" * 70)
print(f"CLASSIFICATION REPORT: {best_model_name}")
print("=" * 70)
print(classification_report(y_test, best_predictions, target_names=['Died', 'Survived']))

print("\nMetric Definitions:")
print("  - Precision: Of predicted survivors, how many actually survived?")
print("  - Recall: Of actual survivors, how many did we identify?")
print("  - F1-Score: Balance between precision and recall")

---

<a id='part7'></a>
# Part 7: Feature Importance

---

Which features matter most for predicting survival?

In [None]:
# Feature importance from Random Forest
rf_model = results['Random Forest']['model']

feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("=" * 60)
print("FEATURE IMPORTANCE (Random Forest)")
print("=" * 60)
print(feature_importance.to_string(index=False))

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(10, 8))

colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(feature_importance)))
bars = ax.barh(feature_importance['Feature'][::-1], 
               feature_importance['Importance'][::-1] * 100,
               color=colors[::-1], edgecolor='black')

ax.set_xlabel('Importance (%)', fontsize=12)
ax.set_title('Feature Importance for Survival Prediction', fontweight='bold', fontsize=14)

for bar, imp in zip(bars, feature_importance['Importance'][::-1]):
    ax.text(bar.get_width() + 0.3, bar.get_y() + bar.get_height()/2,
            f'{imp*100:.1f}%', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
for i, row in feature_importance.head(5).iterrows():
    print(f"  {row['Feature']}: {row['Importance']*100:.1f}%")

In [None]:
# Compare feature importance across models
# Logistic Regression coefficients (absolute value)
lr_model = results['Logistic Regression']['model']
lr_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': np.abs(lr_model.coef_[0])
}).sort_values('Coefficient', ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Random Forest
axes[0].barh(feature_importance['Feature'][::-1], 
             feature_importance['Importance'][::-1] * 100,
             color='steelblue', edgecolor='black')
axes[0].set_xlabel('Importance (%)')
axes[0].set_title('Random Forest Feature Importance', fontweight='bold')

# Logistic Regression
axes[1].barh(lr_importance['Feature'][::-1], 
             lr_importance['Coefficient'][::-1],
             color='darkorange', edgecolor='black')
axes[1].set_xlabel('|Coefficient|')
axes[1].set_title('Logistic Regression Coefficients', fontweight='bold')

plt.tight_layout()
plt.show()

print("Both models agree: sex and fare are among the most important features!")

---

<a id='part8'></a>
# Part 8: Summary and Conclusions

---

## Summary Dashboard

In [None]:
# Create summary visualization
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)

# 1. Model Comparison (accuracy)
ax1 = fig.add_subplot(gs[0, :])
models_sorted = list(sorted_results.keys())
acc_sorted = [sorted_results[m]['accuracy']*100 for m in models_sorted]
colors = plt.cm.RdYlGn(np.linspace(0.3, 0.9, len(models_sorted)))
bars = ax1.bar(models_sorted, acc_sorted, color=colors, edgecolor='black')
ax1.set_ylim(70, 90)
ax1.set_ylabel('Accuracy (%)')
ax1.set_title('Model Accuracy Comparison', fontweight='bold', fontsize=14)
for bar, acc in zip(bars, acc_sorted):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
             f'{acc:.1f}%', ha='center', fontweight='bold')
ax1.tick_params(axis='x', rotation=15)

# 2. Feature Importance
ax2 = fig.add_subplot(gs[1, 0])
top_features = feature_importance.head(6)
ax2.barh(top_features['Feature'][::-1], 
         top_features['Importance'][::-1] * 100,
         color=plt.cm.viridis(np.linspace(0.3, 0.9, 6))[::-1], edgecolor='black')
ax2.set_xlabel('Importance (%)')
ax2.set_title('Top Features', fontweight='bold')

# 3. Best Model Confusion Matrix
ax3 = fig.add_subplot(gs[1, 1])
best_cm = confusion_matrix(y_test, results[best_model_name]['predictions'])
sns.heatmap(best_cm, annot=True, fmt='d', cmap='Blues', ax=ax3,
            xticklabels=['Died', 'Survived'], yticklabels=['Died', 'Survived'],
            annot_kws={'size': 14, 'weight': 'bold'})
ax3.set_title(f'Best Model: {best_model_name}', fontweight='bold')
ax3.set_xlabel('Predicted')
ax3.set_ylabel('Actual')

# 4. Survival Distribution
ax4 = fig.add_subplot(gs[1, 2])
survival_counts = y.value_counts()
ax4.pie(survival_counts.values, labels=['Died', 'Survived'], autopct='%1.1f%%',
        colors=['#ff6b6b', '#4ecdc4'], explode=(0.02, 0.02), shadow=True,
        textprops={'fontsize': 12, 'fontweight': 'bold'})
ax4.set_title('Survival Distribution', fontweight='bold')

plt.suptitle('TITANIC SURVIVAL PREDICTION - SUMMARY DASHBOARD', 
             fontweight='bold', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

---

## Key Findings

### 1. Model Performance
- All models achieved **78-83% accuracy**
- **Best models**: Gradient Boosting, Random Forest, Logistic Regression
- Cross-validation confirms stable performance

### 2. Most Important Features
| Feature | Importance |
|---------|------------|
| **Sex** | Highest - Women had much higher survival rates |
| **Fare** | Higher fares = higher class = better survival |
| **Age** | Children prioritized in evacuation |
| **Pclass** | 1st class had better access to lifeboats |

### 3. What We Learned

| Topic | Key Insight |
|-------|-------------|
| **Preprocessing** | Handling missing values is crucial |
| **Feature Engineering** | Domain knowledge improves predictions |
| **Model Selection** | Try multiple algorithms, compare fairly |
| **Evaluation** | Use cross-validation, not just accuracy |

---

## Classification Checklist

- [x] Load and explore data
- [x] Handle missing values
- [x] Encode categorical variables
- [x] Engineer new features
- [x] Split into train/test
- [x] Scale features
- [x] Train multiple models
- [x] Evaluate with multiple metrics
- [x] Cross-validate
- [x] Analyze feature importance

---

**End of Titanic Survival Prediction Tutorial**

In [None]:
# Final summary
print("="*70)
print("TITANIC SURVIVAL PREDICTION - FINAL SUMMARY")
print("="*70)

print(f"\nDATASET")
print(f"   Samples: {len(data)}")
print(f"   Features: {len(feature_cols)}")
print(f"   Survival rate: {y.mean()*100:.1f}%")

print(f"\nBEST MODEL")
print(f"   Name: {best_model_name}")
print(f"   Test Accuracy: {results[best_model_name]['accuracy']*100:.2f}%")
print(f"   ROC-AUC: {results[best_model_name]['roc_auc']:.4f}")

print(f"\nALL MODEL ACCURACIES")
for name in sorted_results:
    print(f"   {name}: {sorted_results[name]['accuracy']*100:.1f}%")

print(f"\nTOP FEATURES")
for _, row in feature_importance.head(5).iterrows():
    print(f"   {row['Feature']}: {row['Importance']*100:.1f}%")

print("\n" + "="*70)
print("PREDICTION COMPLETE!")
print("="*70)