# üö¢ Binary Classification Masterclass: Predicting Titanic Survival

## A Complete Guide for Beginners

**Welcome!** This notebook will take you from zero to hero in binary classification. We'll use the famous Titanic dataset to predict whether a passenger survived or not.

### üìö What You'll Learn:
1. **Data Exploration** - Understanding your dataset
2. **Statistical Testing** - Verifying ML assumptions
3. **Data Preprocessing** - Cleaning and preparing data
4. **Multiple Models** - Logistic Regression, Random Forest, SVM, XGBoost
5. **Model Evaluation** - Accuracy, Precision, Recall, F1, ROC-AUC
6. **Model Improvement** - Hyperparameter tuning and feature engineering

---

In [None]:
# üì¶ Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency, ttest_ind, shapiro, levene
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (classification_report, confusion_matrix, 
                             accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, roc_curve, 
                             precision_recall_curve)

# For advanced models (install if needed: pip install xgboost)
try:
    import xgboost as xgb
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not installed. Install with: pip install xgboost")

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")

## 1Ô∏è‚É£ Data Loading and Initial Exploration

The Titanic dataset is built into Seaborn. It contains information about passengers including:
- **survived**: 0 = No, 1 = Yes (our target variable)
- **pclass**: Ticket class (1st, 2nd, 3rd)
- **sex**: Male/Female
- **age**: Age in years
- **sibsp**: Number of siblings/spouses aboard
- **parch**: Number of parents/children aboard
- **fare**: Passenger fare
- **embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- **class**: Same as pclass but categorical
- **who**: man/woman/child
- **adult_male**: True/False
- **deck**: Deck level (many missing values)
- **embark_town**: Full name of embarkation port
- **alive**: yes/no (same as survived)
- **alone**: True/False

In [None]:
# üö¢ Load the Titanic dataset from Seaborn
df = sns.load_dataset('titanic')

print("Dataset Shape:", df.shape)
print("\n" + "="*50)
print("FIRST 5 ROWS:")
print("="*50)
display(df.head())

print("\n" + "="*50)
print("DATASET INFO:")
print("="*50)
df.info()

In [None]:
# üìä Basic statistics
print("="*50)
print("STATISTICAL SUMMARY:")
print("="*50)
display(df.describe())

print("\n" + "="*50)
print("MISSING VALUES:")
print("="*50)
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_percent})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Percentage', ascending=False)
display(missing_df)

print("\n" + "="*50)
print("TARGET VARIABLE DISTRIBUTION:")
print("="*50)
survival_rate = df['survived'].value_counts(normalize=True) * 100
print(f"Did not survive (0): {survival_rate[0]:.2f}%")
print(f"Survived (1): {survival_rate[1]:.2f}%")

## 2Ô∏è‚É£ Exploratory Data Analysis (EDA)

Let's visualize the data to understand patterns before building models.

In [None]:
# üìä Create comprehensive visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Survival distribution
ax1 = axes[0, 0]
survival_counts = df['survived'].value_counts()
colors = ['#ff6b6b', '#4ecdc4']
ax1.pie(survival_counts, labels=['Did not survive', 'Survived'], 
        autopct='%1.1f%%', colors=colors, startangle=90)
ax1.set_title('Survival Distribution', fontsize=14, fontweight='bold')

# 2. Survival by Gender
ax2 = axes[0, 1]
sns.countplot(data=df, x='sex', hue='survived', ax=ax2, palette=colors)
ax2.set_title('Survival by Gender', fontsize=14, fontweight='bold')
ax2.legend(['Did not survive', 'Survived'])

# 3. Survival by Class
ax3 = axes[0, 2]
sns.countplot(data=df, x='pclass', hue='survived', ax=ax3, palette=colors)
ax3.set_title('Survival by Passenger Class', fontsize=14, fontweight='bold')
ax3.legend(['Did not survive', 'Survived'])

# 4. Age distribution by survival
ax4 = axes[1, 0]
sns.histplot(data=df, x='age', hue='survived', bins=30, kde=True, ax=ax4, palette=colors)
ax4.set_title('Age Distribution by Survival', fontsize=14, fontweight='bold')

# 5. Fare distribution by survival
ax5 = axes[1, 1]
sns.boxplot(data=df, x='survived', y='fare', ax=ax5, palette=colors)
ax5.set_title('Fare Distribution by Survival', fontsize=14, fontweight='bold')
ax5.set_xticklabels(['Did not survive', 'Survived'])

# 6. Correlation heatmap
ax6 = axes[1, 2]
# Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=ax6)
ax6.set_title('Correlation Matrix', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# üìà Print key insights
print("\n" + "="*60)
print("üîç KEY INSIGHTS FROM EDA:")
print("="*60)
print(f"1. Overall survival rate: {df['survived'].mean():.2%}")
print(f"2. Female survival rate: {df[df['sex']=='female']['survived'].mean():.2%}")
print(f"3. Male survival rate: {df[df['sex']=='male']['survived'].mean():.2%}")
print(f"4. 1st class survival rate: {df[df['pclass']==1]['survived'].mean():.2%}")
print(f"5. 3rd class survival rate: {df[df['pclass']==3]['survived'].mean():.2%}")
print(f"6. Average age of survivors: {df[df['survived']==1]['age'].mean():.1f}")
print(f"7. Average age of non-survivors: {df[df['survived']==0]['age'].mean():.1f}")

## 3Ô∏è‚É£ Statistical Testing: Verifying ML Assumptions üî¨

Before applying machine learning algorithms, we should verify statistical assumptions. This is crucial for understanding whether our data is suitable for classification.

### Key Tests We'll Perform:

1. **Chi-Square Test** - Tests independence between categorical variables and survival
   - *Null Hypothesis (H0)*: The categorical variable is independent of survival
   - *Alternative Hypothesis (H1)*: The categorical variable is associated with survival

2. **T-Test** - Compares means of continuous variables between survived/did not survive groups
   - *Null Hypothesis (H0)*: There is no difference in means between groups
   - *Alternative Hypothesis (H1)*: There is a significant difference in means

3. **Shapiro-Wilk Test** - Tests for normality (important for parametric tests)
   - *Null Hypothesis (H0)*: Data is normally distributed
   - *Alternative Hypothesis (H1)*: Data is not normally distributed

4. **Levene's Test** - Tests for equal variances (homoscedasticity)
   - *Null Hypothesis (H0)*: Groups have equal variances
   - *Alternative Hypothesis (H1)*: Groups have unequal variances

**Significance Level (Œ±)**: 0.05 (5%)

In [None]:
# üî¨ STATISTICAL TESTING SECTION

print("="*70)
print("STATISTICAL TESTS FOR CLASSIFICATION ASSUMPTIONS")
print("="*70)

# Prepare data for testing
df_test = df.copy()

# =============================================================================
# TEST 1: CHI-SQUARE TESTS (Categorical vs Survival)
# =============================================================================
print("\n" + "="*70)
print("TEST 1: CHI-SQUARE TESTS (Categorical Variables vs Survival)")
print("="*70)

categorical_vars = ['sex', 'pclass', 'embarked', 'who', 'adult_male', 'alone']

chi2_results = []

for var in categorical_vars:
    if var in df_test.columns:
        # Create contingency table
        contingency_table = pd.crosstab(df_test[var], df_test['survived'])
        
        # Perform Chi-square test
        chi2, p_value, dof, expected = chi2_contingency(contingency_table)
        
        # Determine significance
        significant = "YES ‚úÖ" if p_value < 0.05 else "NO ‚ùå"
        
        chi2_results.append({
            'Variable': var,
            'Chi2': chi2,
            'p-value': p_value,
            'Significant (Œ±=0.05)': significant
        })
        
        print(f"\n{var.upper()}:")
        print(f"  Chi-square statistic: {chi2:.4f}")
        print(f"  p-value: {p_value:.2e}")
        print(f"  Degrees of freedom: {dof}")
        print(f"  Significant association with survival: {significant}")

chi2_df = pd.DataFrame(chi2_results)
print("\n" + "-"*70)
print("CHI-SQUARE SUMMARY TABLE:")
print("-"*70)
display(chi2_df)

print("\nüí° INTERPRETATION:")
print("   Variables with p < 0.05 are significantly associated with survival")
print("   and are good predictors for our classification model.")

In [None]:
# =============================================================================
# TEST 2: T-TESTS (Continuous Variables vs Survival Groups)
# =============================================================================
print("\n" + "="*70)
print("TEST 2: INDEPENDENT T-TESTS (Continuous Variables by Survival)")
print("="*70)

# Separate groups
survived = df_test[df_test['survived'] == 1]
not_survived = df_test[df_test['survived'] == 0]

continuous_vars = ['age', 'fare']

ttest_results = []

for var in continuous_vars:
    if var in df_test.columns:
        # Remove NaN values
        group1 = survived[var].dropna()
        group2 = not_survived[var].dropna()
        
        # Perform independent t-test
        t_stat, p_value = ttest_ind(group1, group2)
        
        # Calculate means
        mean_survived = group1.mean()
        mean_not_survived = group2.mean()
        
        significant = "YES ‚úÖ" if p_value < 0.05 else "NO ‚ùå"
        
        ttest_results.append({
            'Variable': var,
            'Mean (Survived)': mean_survived,
            'Mean (Not Survived)': mean_not_survived,
            't-statistic': t_stat,
            'p-value': p_value,
            'Significant': significant
        })
        
        print(f"\n{var.upper()}:")
        print(f"  Mean (Survived): {mean_survived:.2f}")
        print(f"  Mean (Not Survived): {mean_not_survived:.2f}")
        print(f"  t-statistic: {t_stat:.4f}")
        print(f"  p-value: {p_value:.4f}")
        print(f"  Significant difference: {significant}")

ttest_df = pd.DataFrame(ttest_results)
print("\n" + "-"*70)
print("T-TEST SUMMARY TABLE:")
print("-"*70)
display(ttest_df)

print("\nüí° INTERPRETATION:")
print("   Significant p-values indicate the variable differs between")
print("   survivors and non-survivors, making it useful for prediction.")

In [None]:
# =============================================================================
# TEST 3: NORMALITY TESTS (Shapiro-Wilk)
# =============================================================================
print("\n" + "="*70)
print("TEST 3: SHAPIRO-WILK NORMALITY TESTS")
print("="*70)
print("H0: Data is normally distributed")
print("H1: Data is NOT normally distributed\n")

# Sample size for Shapiro-Wilk (max 5000 recommended)
sample_size = min(5000, len(df_test))

normality_results = []

for var in continuous_vars:
    if var in df_test.columns:
        data = df_test[var].dropna()
        
        # Sample if too large
        if len(data) > 5000:
            data = data.sample(5000, random_state=42)
        
        # Shapiro-Wilk test
        stat, p_value = shapiro(data)
        
        normal = "YES ‚úÖ" if p_value > 0.05 else "NO ‚ùå"
        
        normality_results.append({
            'Variable': var,
            'W-statistic': stat,
            'p-value': p_value,
            'Normal Distribution': normal
        })
        
        print(f"{var.upper()}:")
        print(f"  W-statistic: {stat:.4f}")
        print(f"  p-value: {p_value:.2e}")
        print(f"  Normally distributed: {normal}\n")

normality_df = pd.DataFrame(normality_results)
print("-"*70)
print("NORMALITY TEST SUMMARY:")
print("-"*70)
display(normality_df)

print("\nüí° INTERPRETATION:")
print("   If p < 0.05, data is NOT normal. Many ML algorithms (like Logistic")
print("   Regression) assume normality, but tree-based methods don't.")

# =============================================================================
# TEST 4: LEVENE'S TEST (Equal Variances)
# =============================================================================
print("\n" + "="*70)
print("TEST 4: LEVENE'S TEST FOR EQUAL VARIANCES")
print("="*70)
print("H0: Groups have equal variances (homoscedasticity)")
print("H1: Groups have unequal variances (heteroscedasticity)\n")

levene_results = []

for var in continuous_vars:
    if var in df_test.columns:
        group1 = survived[var].dropna()
        group2 = not_survived[var].dropna()
        
        # Levene's test
        stat, p_value = levene(group1, group2)
        
        equal_var = "YES ‚úÖ" if p_value > 0.05 else "NO ‚ùå"
        
        levene_results.append({
            'Variable': var,
            'Statistic': stat,
            'p-value': p_value,
            'Equal Variances': equal_var
        })
        
        print(f"{var.upper()}:")
        print(f"  Statistic: {stat:.4f}")
        print(f"  p-value: {p_value:.4f}")
        print(f"  Equal variances: {equal_var}\n")

levene_df = pd.DataFrame(levene_results)
print("-"*70)
print("LEVENE'S TEST SUMMARY:")
print("-"*70)
display(levene_df)

print("\nüí° INTERPRETATION:")
print("   Equal variances are assumed by many statistical tests.")
print("   Violations may require data transformation or robust methods.")

In [None]:
# üìä Visualize Statistical Test Results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Chi-square results
ax1 = axes[0, 0]
colors_chi = ['#2ecc71' if 'YES' in str(x) else '#e74c3c' for x in chi2_df['Significant (Œ±=0.05)']]
bars1 = ax1.bar(chi2_df['Variable'], chi2_df['Chi2'], color=colors_chi)
ax1.set_title('Chi-Square Statistics\n(Green = Significant Association)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Chi-Square Value')
ax1.tick_params(axis='x', rotation=45)

# 2. T-test results
ax2 = axes[0, 1]
x_pos = np.arange(len(ttest_df))
width = 0.35
bars2 = ax2.bar(x_pos - width/2, ttest_df['Mean (Survived)'], width, label='Survived', color='#2ecc71')
bars3 = ax2.bar(x_pos + width/2, ttest_df['Mean (Not Survived)'], width, label='Not Survived', color='#e74c3c')
ax2.set_title('Mean Comparison by Survival Status', fontsize=12, fontweight='bold')
ax2.set_ylabel('Mean Value')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(ttest_df['Variable'])
ax2.legend()

# 3. Distribution plots for normality check
ax3 = axes[1, 0]
df_test['age'].dropna().hist(bins=30, ax=ax3, alpha=0.7, color='skyblue', edgecolor='black')
ax3.set_title('Age Distribution\n(Checking Normality)', fontsize=12, fontweight='bold')
ax3.set_xlabel('Age')
ax3.set_ylabel('Frequency')
ax3.axvline(df_test['age'].mean(), color='red', linestyle='--', label=f"Mean: {df_test['age'].mean():.1f}")
ax3.legend()

# 4. Q-Q plot for normality
ax4 = axes[1, 1]
from scipy.stats import probplot
probplot(df_test['fare'].dropna(), dist="norm", plot=ax4)
ax4.set_title('Q-Q Plot: Fare vs Normal Distribution', fontsize=12, fontweight='bold')
ax4.get_lines()[0].set_markerfacecolor('skyblue')
ax4.get_lines()[0].set_markersize(5)

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("‚úÖ STATISTICAL TESTING COMPLETE")
print("="*70)
print("\nCONCLUSIONS:")
print("1. Categorical variables (sex, pclass, etc.) show significant association with survival")
print("2. Continuous variables (fare) show significant mean differences")
print("3. Data may not be perfectly normal - consider this for model selection")
print("4. Proceed with classification modeling with confidence!")

## 4Ô∏è‚É£ Data Preprocessing üßπ

Now we prepare the data for machine learning. This includes:
- Handling missing values
- Encoding categorical variables
- Feature engineering
- Scaling features
- Splitting into train/test sets

In [None]:
# üßπ DATA PREPROCESSING

print("="*70)
print("STEP 1: HANDLING MISSING VALUES")
print("="*70)

# Create a copy for processing
df_processed = df.copy()

print("Missing values before processing:")
print(df_processed.isnull().sum()[df_processed.isnull().sum() > 0])

# 1. Drop 'deck' column (too many missing values - 77%)
df_processed = df_processed.drop('deck', axis=1)

# 2. Fill missing 'age' with median (robust to outliers)
df_processed['age'].fillna(df_processed['age'].median(), inplace=True)

# 3. Fill missing 'embarked' and 'embark_town' with mode (most frequent)
df_processed['embarked'].fillna(df_processed['embarked'].mode()[0], inplace=True)
df_processed['embark_town'].fillna(df_processed['embark_town'].mode()[0], inplace=True)

# 4. Fill missing 'fare' with median
df_processed['fare'].fillna(df_processed['fare'].median(), inplace=True)

print("\nMissing values after processing:")
print(df_processed.isnull().sum()[df_processed.isnull().sum() > 0])
print("\n‚úÖ Missing values handled!")

In [None]:
print("\n" + "="*70)
print("STEP 2: FEATURE ENGINEERING")
print("="*70)

# Create new features that might improve prediction

# 1. Family Size (sibsp + parch + 1)
df_processed['family_size'] = df_processed['sibsp'] + df_processed['parch'] + 1

# 2. Is Alone (1 if family_size == 1, else 0)
df_processed['is_alone'] = (df_processed['family_size'] == 1).astype(int)

# 3. Age Group (categorize age)
def categorize_age(age):
    if age < 13:
        return 'Child'
    elif age < 20:
        return 'Teenager'
    elif age < 60:
        return 'Adult'
    else:
        return 'Senior'

df_processed['age_group'] = df_processed['age'].apply(categorize_age)

# 4. Fare per person
df_processed['fare_per_person'] = df_processed['fare'] / df_processed['family_size']

# 5. Title extraction from name (if we had name column, we'll use 'who' instead)
# Since we don't have 'name', we'll use the 'who' column which is already processed

print("New features created:")
print("  - family_size: Total family members on board")
print("  - is_alone: Binary flag for solo travelers")
print("  - age_group: Categorical age groups")
print("  - fare_per_person: Fare divided by family size")

# Check survival rates by new features
print("\nSurvival rate by family size:")
print(df_processed.groupby('family_size')['survived'].mean().round(3))

print("\nSurvival rate by age group:")
print(df_processed.groupby('age_group')['survived'].mean().round(3))

In [None]:
print("\n" + "="*70)
print("STEP 3: ENCODING CATEGORICAL VARIABLES")
print("="*70)

# Select features for modeling
# Drop redundant or non-predictive columns
columns_to_drop = ['alive', 'class', 'embark_town', 'adult_male']  # Redundant with other columns
df_model = df_processed.drop(columns_to_drop, axis=1)

# Identify categorical columns
categorical_columns = df_model.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"Categorical columns to encode: {categorical_columns}")

# One-hot encoding for categorical variables
df_encoded = pd.get_dummies(df_model, columns=categorical_columns, drop_first=True)

print(f"\nDataset shape after encoding: {df_encoded.shape}")
print(f"Columns: {list(df_encoded.columns)}")

# Display first few rows of processed data
print("\nProcessed data sample:")
display(df_encoded.head())

In [None]:
print("\n" + "="*70)
print("STEP 4: TRAIN-TEST SPLIT & FEATURE SCALING")
print("="*70)

# Separate features and target
X = df_encoded.drop('survived', axis=1)
y = df_encoded['survived']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set survival rate: {y_train.mean():.2%}")
print(f"Test set survival rate: {y_test.mean():.2%}")

# Feature Scaling (important for Logistic Regression and SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)

print("\n‚úÖ Data split and scaled successfully!")
print("\nNote: We use StandardScaler to normalize features to mean=0, std=1")
print("This is crucial for algorithms like Logistic Regression and SVM.")

## 5Ô∏è‚É£ Model Training: Building Multiple Classifiers ü§ñ

We'll train and compare 5 different classification algorithms:

1. **Logistic Regression** - Linear model, good baseline, interpretable
2. **Random Forest** - Ensemble of decision trees, handles non-linearity well
3. **Support Vector Machine (SVM)** - Effective in high-dimensional spaces
4. **K-Nearest Neighbors (KNN)** - Instance-based learning
5. **Naive Bayes** - Probabilistic classifier, fast and simple
6. **XGBoost** (if available) - Gradient boosting, often top performer

Each model has different strengths and assumptions. Let's see which works best for Titanic!

In [None]:
# ü§ñ INITIALIZE MODELS

print("="*70)
print("INITIALIZING CLASSIFICATION MODELS")
print("="*70)

# Dictionary to store models
models = {}

# 1. Logistic Regression
models['Logistic Regression'] = LogisticRegression(
    random_state=42, 
    max_iter=1000,
    class_weight='balanced'  # Handle class imbalance
)

# 2. Random Forest
models['Random Forest'] = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight='balanced'
)

# 3. Support Vector Machine
models['SVM'] = SVC(
    probability=True,  # Enable probability estimates
    random_state=42,
    class_weight='balanced'
)

# 4. K-Nearest Neighbors
models['KNN'] = KNeighborsClassifier(
    n_neighbors=5
)

# 5. Naive Bayes
models['Naive Bayes'] = GaussianNB()

# 6. XGBoost (if available)
if XGBOOST_AVAILABLE:
    models['XGBoost'] = xgb.XGBClassifier(
        random_state=42,
        eval_metric='logloss'
    )

print(f"Models initialized: {list(models.keys())}")
print("\nModel descriptions:")
for name in models.keys():
    print(f"  ‚úÖ {name}")

In [None]:
# üèãÔ∏è TRAIN MODELS AND EVALUATE WITH CROSS-VALIDATION

print("="*70)
print("TRAINING MODELS WITH 5-FOLD CROSS-VALIDATION")
print("="*70)

# Store results
cv_results = {}
trained_models = {}

# Use StratifiedKFold to maintain class distribution
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Use scaled data for models that need it
    if name in ['Logistic Regression', 'SVM', 'KNN']:
        X_train_use = X_train_scaled
        X_test_use = X_test_scaled
    else:
        X_train_use = X_train
        X_test_use = X_test
    
    # Cross-validation scores
    cv_scores = cross_val_score(model, X_train_use, y_train, cv=cv, scoring='accuracy')
    
    # Train on full training set
    model.fit(X_train_use, y_train)
    
    # Store trained model
    trained_models[name] = model
    
    # Store CV results
    cv_results[name] = {
        'CV Mean Accuracy': cv_scores.mean(),
        'CV Std Accuracy': cv_scores.std(),
        'CV Scores': cv_scores
    }
    
    print(f"  CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

print("\n" + "="*70)
print("CROSS-VALIDATION SUMMARY")
print("="*70)

cv_summary = pd.DataFrame({
    name: {
        'Mean Accuracy': f"{res['CV Mean Accuracy']:.4f}",
        'Std Dev': f"{res['CV Std Accuracy']:.4f}"
    } for name, res in cv_results.items()
}).T

cv_summary = cv_summary.sort_values('Mean Accuracy', ascending=False)
display(cv_summary)

In [None]:
# üìä DETAILED EVALUATION ON TEST SET

print("="*70)
print("DETAILED EVALUATION ON TEST SET")
print("="*70)

# Store all predictions and metrics
predictions = {}
probabilities = {}
all_metrics = []

for name, model in trained_models.items():
    # Use appropriate data (scaled or unscaled)
    if name in ['Logistic Regression', 'SVM', 'KNN']:
        X_test_use = X_test_scaled
    else:
        X_test_use = X_test
    
    # Predictions
    y_pred = model.predict(X_test_use)
    predictions[name] = y_pred
    
    # Probabilities (for ROC curve)
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test_use)[:, 1]
        probabilities[name] = y_prob
    else:
        y_prob = None
    
    # Calculate metrics
    metrics = {
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_prob) if y_prob is not None else 'N/A'
    }
    
    all_metrics.append(metrics)

# Create metrics DataFrame
metrics_df = pd.DataFrame(all_metrics)
metrics_df = metrics_df.sort_values('Accuracy', ascending=False)

print("\nMODEL PERFORMANCE COMPARISON:")
print("-"*70)
display(metrics_df.round(4))

print("\n" + "="*70)
print("METRIC EXPLANATIONS:")
print("="*70)
print("‚Ä¢ Accuracy: Overall correctness (TP + TN) / Total")
print("‚Ä¢ Precision: Of predicted survivors, how many actually survived (TP / (TP + FP))")
print("‚Ä¢ Recall: Of actual survivors, how many did we predict (TP / (TP + FN))")
print("‚Ä¢ F1-Score: Harmonic mean of Precision and Recall")
print("‚Ä¢ ROC-AUC: Area under ROC curve (1.0 = perfect, 0.5 = random)")

In [None]:
# üìà VISUALIZE CONFUSION MATRICES

n_models = len(trained_models)
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for idx, (name, y_pred) in enumerate(predictions.items()):
    if idx < len(axes):
        cm = confusion_matrix(y_test, y_pred)
        
        # Create heatmap
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                   xticklabels=['Did not survive', 'Survived'],
                   yticklabels=['Did not survive', 'Survived'])
        axes[idx].set_title(f'{name}\nConfusion Matrix', fontsize=12, fontweight='bold')
        axes[idx].set_ylabel('Actual')
        axes[idx].set_xlabel('Predicted')

# Remove empty subplot
if n_models < 6:
    fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("CONFUSION MATRIX INTERPRETATION:")
print("="*70)
print("‚Ä¢ True Negatives (Top-Left): Correctly predicted non-survivors")
print("‚Ä¢ False Positives (Top-Right): Predicted survival, but didn't survive")
print("‚Ä¢ False Negatives (Bottom-Left): Predicted death, but survived")
print("‚Ä¢ True Positives (Bottom-Right): Correctly predicted survivors")

In [None]:
# üìâ PLOT ROC CURVES

plt.figure(figsize=(12, 8))

# Plot ROC curve for each model
for name, y_prob in probabilities.items():
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc_score = roc_auc_score(y_test, y_prob)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})', linewidth=2)

# Plot random guess line
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess (AUC = 0.500)', linewidth=1)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Sensitivity/Recall)', fontsize=12)
plt.title('ROC Curves Comparison\n(Higher curve = Better model)', fontsize=14, fontweight='bold')
plt.legend(loc="lower right", fontsize=10)
plt.grid(True, alpha=0.3)
plt.show()

print("\n" + "="*70)
print("ROC CURVE INTERPRETATION:")
print("="*70)
print("‚Ä¢ X-axis: False Positive Rate (FPR) - how many non-survivors we wrongly label as survivors")
print("‚Ä¢ Y-axis: True Positive Rate (TPR) - how many actual survivors we correctly identify")
print("‚Ä¢ Ideal model: Goes straight up to (0,1) then right to (1,1)")
print("‚Ä¢ Random guess: Diagonal line from (0,0) to (1,1)")
print("‚Ä¢ AUC = 1.0: Perfect classifier")
print("‚Ä¢ AUC = 0.5: No better than random guessing")

In [None]:
# üîç FEATURE IMPORTANCE ANALYSIS

print("="*70)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Random Forest Feature Importance
rf_model = trained_models['Random Forest']
rf_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False).head(10)

axes[0].barh(rf_importance['feature'], rf_importance['importance'], color='forestgreen')
axes[0].set_title('Random Forest\nTop 10 Feature Importances', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Importance')
axes[0].invert_yaxis()

# Logistic Regression Coefficients
lr_model = trained_models['Logistic Regression']
lr_coef = pd.DataFrame({
    'feature': X.columns,
    'coefficient': lr_model.coef_[0]
})
lr_coef['abs_coef'] = np.abs(lr_coef['coefficient'])
lr_coef = lr_coef.sort_values('abs_coef', ascending=False).head(10)

colors = ['red' if x < 0 else 'blue' for x in lr_coef['coefficient']]
axes[1].barh(lr_coef['feature'], lr_coef['coefficient'], color=colors)
axes[1].set_title('Logistic Regression\nTop 10 Coefficients (Blue=Positive, Red=Negative)', 
                  fontsize=12, fontweight='bold')
axes[1].set_xlabel('Coefficient Value')
axes[1].invert_yaxis()
axes[1].axvline(x=0, color='black', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features (Random Forest):")
for idx, row in rf_importance.head(5).iterrows():
    print(f"  {idx+1}. {row['feature']}: {row['importance']:.4f}")

print("\nTop 5 Most Influential Features (Logistic Regression):")
for idx, row in lr_coef.head(5).iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"  {idx+1}. {row['feature']}: {row['coefficient']:.4f} ({direction} survival probability)")

## 6Ô∏è‚É£ Model Improvement: Hyperparameter Tuning üéØ

Now let's improve our best models by tuning their hyperparameters. We'll use **GridSearchCV** to find the optimal combination of parameters.

### What is Hyperparameter Tuning?

Hyperparameters are settings that control the learning process. Unlike model parameters (which are learned from data), hyperparameters must be set before training. Examples include:
- **Random Forest**: Number of trees, max depth, min samples per leaf
- **Logistic Regression**: Regularization strength (C), penalty type
- **SVM**: Kernel type, C parameter, gamma

**Grid Search** systematically tries all combinations of specified parameters to find the best performing set.

In [None]:
# üéØ HYPERPARAMETER TUNING WITH GRIDSEARCHCV

print("="*70)
print("HYPERPARAMETER TUNING")
print("="*70)
print("This may take 1-2 minutes...\n")

# Dictionary to store best models
best_models = {}

# =============================================================================
# 1. TUNE RANDOM FOREST
# =============================================================================
print("Tuning Random Forest...")

rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42, class_weight='balanced'),
    rf_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)

rf_grid.fit(X_train, y_train)
best_models['Random Forest'] = rf_grid.best_estimator_

print(f"  Best Parameters: {rf_grid.best_params_}")
print(f"  Best CV Score: {rf_grid.best_score_:.4f}")

# =============================================================================
# 2. TUNE LOGISTIC REGRESSION
# =============================================================================
print("\nTuning Logistic Regression...")

lr_params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # liblinear supports both l1 and l2
}

lr_grid = GridSearchCV(
    LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000),
    lr_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

lr_grid.fit(X_train_scaled, y_train)
best_models['Logistic Regression'] = lr_grid.best_estimator_

print(f"  Best Parameters: {lr_grid.best_params_}")
print(f"  Best CV Score: {lr_grid.best_score_:.4f}")

# =============================================================================
# 3. TUNE SVM
# =============================================================================
print("\nTuning SVM...")

svm_params = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto', 0.001, 0.01]
}

svm_grid = GridSearchCV(
    SVC(probability=True, random_state=42, class_weight='balanced'),
    svm_params,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

svm_grid.fit(X_train_scaled, y_train)
best_models['SVM'] = svm_grid.best_estimator_

print(f"  Best Parameters: {svm_grid.best_params_}")
print(f"  Best CV Score: {svm_grid.best_score_:.4f}")

print("\n‚úÖ Hyperparameter tuning complete!")

In [None]:
# üìä COMPARE TUNED MODELS VS BASELINE

print("="*70)
print("COMPARISON: TUNED MODELS VS BASELINE")
print("="*70)

comparison_results = []

for name in ['Random Forest', 'Logistic Regression', 'SVM']:
    # Baseline model
    baseline = trained_models[name]
    
    # Tuned model
    tuned = best_models[name]
    
    # Use appropriate data
    if name in ['Logistic Regression', 'SVM']:
        X_test_use = X_test_scaled
    else:
        X_test_use = X_test
    
    # Baseline predictions
    baseline_pred = baseline.predict(X_test_use)
    baseline_acc = accuracy_score(y_test, baseline_pred)
    
    # Tuned predictions
    tuned_pred = tuned.predict(X_test_use)
    tuned_acc = accuracy_score(y_test, tuned_pred)
    
    # Improvement
    improvement = tuned_acc - baseline_acc
    
    comparison_results.append({
        'Model': name,
        'Baseline Accuracy': baseline_acc,
        'Tuned Accuracy': tuned_acc,
        'Improvement': improvement
    })

comparison_df = pd.DataFrame(comparison_results)
comparison_df = comparison_df.sort_values('Tuned Accuracy', ascending=False)

print("\n")
display(comparison_df.round(4))

# Visualize improvement
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(comparison_df))
width = 0.35

bars1 = ax.bar(x - width/2, comparison_df['Baseline Accuracy'], width, 
               label='Baseline', color='lightcoral', alpha=0.8)
bars2 = ax.bar(x + width/2, comparison_df['Tuned Accuracy'], width, 
               label='Tuned', color='lightgreen', alpha=0.8)

ax.set_xlabel('Model', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Model Performance: Baseline vs Tuned', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(comparison_df['Model'])
ax.legend()
ax.set_ylim([0.7, 0.9])

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Identify best model
best_model_name = comparison_df.iloc[0]['Model']
best_accuracy = comparison_df.iloc[0]['Tuned Accuracy']

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print(f"   Test Accuracy: {best_accuracy:.4f} ({best_accuracy:.2%})")

In [None]:
# üèÜ FINAL EVALUATION OF BEST MODEL

print("="*70)
print("FINAL MODEL EVALUATION")
print("="*70)

# Get the best model
final_model = best_models[best_model_name]

# Use appropriate test data
if best_model_name in ['Logistic Regression', 'SVM']:
    X_test_final = X_test_scaled
else:
    X_test_final = X_test

# Final predictions
y_pred_final = final_model.predict(X_test_final)
y_prob_final = final_model.predict_proba(X_test_final)[:, 1]

# Comprehensive metrics
print(f"\nModel: {best_model_name}")
print("-"*70)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_final):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_final):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_final):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_final):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob_final):.4f}")

print("\n" + "="*70)
print("DETAILED CLASSIFICATION REPORT")
print("="*70)
print(classification_report(y_test, y_pred_final, 
                           target_names=['Did not survive', 'Survived']))

# Confusion Matrix for final model
plt.figure(figsize=(8, 6))
cm_final = confusion_matrix(y_test, y_pred_final)
sns.heatmap(cm_final, annot=True, fmt='d', cmap='Blues',
           xticklabels=['Did not survive', 'Survived'],
           yticklabels=['Did not survive', 'Survived'])
plt.title(f'Final Model: {best_model_name}\nConfusion Matrix', 
          fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Calculate specific metrics from confusion matrix
tn, fp, fn, tp = cm_final.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives:  {tn} (correctly predicted non-survivors)")
print(f"  False Positives: {fp} (predicted survival, actually died)")
print(f"  False Negatives: {fn} (predicted death, actually survived)")
print(f"  True Positives:  {tp} (correctly predicted survivors)")

In [None]:
# üìà LEARNING CURVES (Diagnosing Bias vs Variance)

from sklearn.model_selection import learning_curve

print("="*70)
print("LEARNING CURVES ANALYSIS")
print("="*70)
print("Learning curves show how model performance changes with training set size.")
print("This helps diagnose:")
print("  ‚Ä¢ High Bias (Underfitting): Both curves converge at low score")
print("  ‚Ä¢ High Variance (Overfitting): Large gap between curves")
print("  ‚Ä¢ Good Fit: Curves converge at high score\n")

def plot_learning_curve(model, X, y, title, ax, cv=5):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=cv, n_jobs=-1, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy'
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    ax.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    ax.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
    
    ax.set_xlabel('Training Set Size')
    ax.set_ylabel('Accuracy Score')
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)

# Plot learning curves for top 3 models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models_to_plot = ['Random Forest', 'Logistic Regression', 'SVM']

for idx, name in enumerate(models_to_plot):
    model = best_models[name]
    if name in ['Logistic Regression', 'SVM']:
        X_plot = X_train_scaled
    else:
        X_plot = X_train
    
    plot_learning_curve(model, X_plot, y_train, f'{name}', axes[idx])

plt.tight_layout()
plt.show()

print("\nüí° INTERPRETATION:")
print("   If validation score is much lower than training score = Overfitting")
print("   If both scores are low and close together = Underfitting")
print("   If both scores are high and close together = Good fit!")

## 7Ô∏è‚É£ Summary and Key Takeaways üéì

Congratulations! You've completed a comprehensive binary classification project. Here's what we accomplished:

### üìä What We Did:

1. **Data Exploration**
   - Loaded and explored the Titanic dataset
   - Identified key patterns (women and 1st class passengers had higher survival rates)
   - Visualized distributions and correlations

2. **Statistical Testing**
   - ‚úÖ Chi-Square tests confirmed categorical variables are associated with survival
   - ‚úÖ T-tests showed significant mean differences in age and fare
   - ‚úÖ Verified (or identified violations of) ML assumptions

3. **Data Preprocessing**
   - Handled missing values strategically
   - Created new features (family_size, is_alone, age_group)
   - Encoded categorical variables
   - Scaled features for algorithms that need it

4. **Model Building**
   - Trained 6 different classification algorithms
   - Used cross-validation for robust evaluation
   - Compared multiple performance metrics

5. **Model Improvement**
   - Performed hyperparameter tuning with GridSearchCV
   - Improved model performance
   - Analyzed learning curves to diagnose overfitting/underfitting

### üèÜ Best Practices Learned:

- **Always split data** into train/test sets to avoid data leakage
- **Use cross-validation** for more reliable performance estimates
- **Scale features** for distance-based algorithms (SVM, KNN, Logistic Regression)
- **Tune hyperparameters** to optimize model performance
- **Evaluate multiple metrics** (not just accuracy) - especially important for imbalanced data
- **Check statistical assumptions** before applying certain algorithms

### üöÄ Next Steps to Explore:

1. **Feature Engineering**: Try creating more complex features
2. **Ensemble Methods**: Combine multiple models (VotingClassifier, Stacking)
3. **Advanced Techniques**: Try Neural Networks or Gradient Boosting (XGBoost, LightGBM)
4. **Imbalanced Data**: Explore SMOTE, class weights, or threshold tuning
5. **Model Interpretation**: Use SHAP or LIME for explainable AI

---

**Remember**: Machine learning is an iterative process. Start simple, evaluate thoroughly, and gradually improve! üéØ

In [None]:
# üéØ BONUS: MAKE PREDICTIONS ON NEW DATA

print("="*70)
print("MAKING PREDICTIONS WITH THE TRAINED MODEL")
print("="*70)

# Example: Create a passenger profile
new_passenger = pd.DataFrame({
    'pclass': [1],
    'age': [25],
    'sibsp': [0],
    'parch': [0],
    'fare': [100],
    'sex_male': [0],  # Female = 0, Male = 1
    'embarked_Q': [0],
    'embarked_S': [1],
    'who_man': [0],
    'who_woman': [1],
    'alone_True': [1],
    'family_size': [1],
    'is_alone': [1],
    'age_group_Child': [0],
    'age_group_Senior': [0],
    'age_group_Teenager': [0],
    'fare_per_person': [100]
})

# Ensure column order matches training data
new_passenger = new_passenger[X.columns]

# Scale if needed
if best_model_name in ['Logistic Regression', 'SVM']:
    new_passenger_scaled = scaler.transform(new_passenger)
    new_passenger_scaled = pd.DataFrame(new_passenger_scaled, columns=X.columns)
    prediction_input = new_passenger_scaled
else:
    prediction_input = new_passenger

# Make prediction
prediction = final_model.predict(prediction_input)[0]
probability = final_model.predict_proba(prediction_input)[0]

print("\nNew Passenger Profile:")
print(f"  Class: 1st, Age: 25, Female, Alone, Fare: $100")
print(f"\nPrediction: {'SURVIVED' if prediction == 1 else 'DID NOT SURVIVE'}")
print(f"Survival Probability: {probability[1]:.2%}")
print(f"Death Probability: {probability[0]:.2%}")

print("\n" + "="*70)
print("END OF NOTEBOOK - HAPPY LEARNING! üéâ")
print("="*70)