# Module 14: Machine Learning Classification - Student Lab

## Titanic Survival Prediction Challenge

Welcome to the hands-on lab for machine learning classification! In this lab, you'll apply what you've learned about classification algorithms to predict Titanic passenger survival.

### Learning Objectives:
- Practice data preprocessing for machine learning
- Implement and compare different classification algorithms
- Evaluate model performance using appropriate metrics
- Handle overfitting and perform cross-validation
- Interpret feature importance and model results

### Instructions:
1. Follow the TODO comments to complete each section
2. Run all cells in order
3. Answer the questions in the markdown cells
4. Experiment with different parameters and techniques

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

print("=== Module 14: Machine Learning Classification Lab ===")
print("Titanic Survival Prediction Challenge")

## Part 1: Data Loading and Exploration

**TODO 1.1:** Load the Titanic dataset and examine its structure

In [None]:
# TODO 1.1: Load the Titanic dataset
# Hint: Use pd.read_csv() with the URL provided
titanic = # YOUR CODE HERE

print(f"Dataset shape: {titanic.shape}")
print("\nFirst 5 rows:")
display(titanic.head())

# Basic statistics
print("\nBasic statistics:")
display(titanic.describe())

# Data types and missing values
print("\nData types and missing values:")
print(titanic.info())
print("\nMissing values count:")
print(titanic.isnull().sum())

**Question 1.1:** What are the key characteristics of this dataset? How many passengers? What types of features? What are the main missing value issues?

**TODO 1.2:** Create visualizations to understand the relationships between features and survival

In [None]:
# TODO 1.2: Create exploratory visualizations
# Hint: Use seaborn for bar plots, histograms, and correlation heatmaps

# Survival rate by different categorical features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Survival by Sex
# YOUR CODE HERE

# Survival by Pclass
# YOUR CODE HERE

# Survival by Embarked
# YOUR CODE HERE

# Age distribution by survival
# YOUR CODE HERE

plt.tight_layout()
plt.show()

# Correlation heatmap (you'll need to preprocess first for this)
# YOUR CODE HERE

**Question 1.2:** Which features seem most correlated with survival? Why might this make sense?

## Part 2: Data Preprocessing

**TODO 2.1:** Implement data preprocessing steps

In [None]:
# TODO 2.1: Implement data preprocessing
def preprocess_data(df):
    # Create a copy
    df_clean = df.copy()
    
    # Handle missing Age values
    # Hint: Use median age by Pclass and Sex
    # YOUR CODE HERE
    
    # Handle missing Embarked values
    # Hint: Fill with mode
    # YOUR CODE HERE
    
    # Handle missing Fare values (if any)
    # Hint: Use median fare by Pclass
    # YOUR CODE HERE
    
    # Create Has_Cabin feature from Cabin column
    # YOUR CODE HERE
    
    # Extract Title from Name
    # YOUR CODE HERE
    
    # Create Family_Size and Is_Alone features
    # YOUR CODE HERE
    
    # Encode categorical variables
    # Hint: Use LabelEncoder for Sex, Embarked, Title
    # YOUR CODE HERE
    
    # Drop unnecessary columns
    # YOUR CODE HERE
    
    return df_clean

# Apply preprocessing
titanic_clean = preprocess_data(titanic)
print("Preprocessed data shape:", titanic_clean.shape)
print("\nRemaining missing values:")
print(titanic_clean.isnull().sum())

# Display first few rows of preprocessed data
print("\nFirst 5 rows of preprocessed data:")
display(titanic_clean.head())

**TODO 2.2:** Prepare features and target, then split the data

In [None]:
# TODO 2.2: Prepare features and split data
# Separate features and target
X = # YOUR CODE HERE
y = # YOUR CODE HERE

# Split into train and test sets
# Hint: Use train_test_split with stratify=y and random_state=42
# YOUR CODE HERE

# Scale the features
# Hint: Use StandardScaler
# YOUR CODE HERE

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Survival rate in training set: {y_train.mean():.2%}")
print(f"Survival rate in test set: {y_test.mean():.2%}")

**Question 2.2:** Why do we use stratified splitting? What does feature scaling do?

## Part 3: Model Training and Evaluation

**TODO 3.1:** Train and evaluate multiple classification models

In [None]:
# TODO 3.1: Train and evaluate models
# Define models dictionary
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    print(f"\n=== {name} ===")
    
    # Train the model
    # YOUR CODE HERE
    
    # Make predictions
    # YOUR CODE HERE
    
    # Calculate metrics
    # YOUR CODE HERE
    
    # Cross-validation score
    # YOUR CODE HERE
    
    # Store results
    results[name] = {
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'model': model
    }
    
    # Print results
    print(f"Accuracy: {accuracy:.4f}")
    print(f"ROC-AUC: {roc_auc:.4f}")
    print(f"CV ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion matrix
    # YOUR CODE HERE

**Question 3.1:** Which model performed best? Why do you think that is?

**TODO 3.2:** Compare models with ROC curves and performance metrics

In [None]:
# TODO 3.2: Compare models
# Create performance comparison table
# YOUR CODE HERE

print("Model Performance Comparison:")
display(model_comparison)

# Plot ROC curves
# YOUR CODE HERE

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend()
plt.grid(True)
plt.show()

**Question 3.2:** What does the ROC curve tell us about model performance? How do you interpret the AUC score?

## Part 4: Feature Importance and Overfitting Analysis

**TODO 4.1:** Analyze feature importance

In [None]:
# TODO 4.1: Feature importance analysis
# Get the best performing model (Random Forest)
# YOUR CODE HERE

# Create feature importance DataFrame
# YOUR CODE HERE

# Plot feature importance
# YOUR CODE HERE

print("\nTop 5 Most Important Features:")
display(feature_importance.head())

**Question 4.1:** Which features are most important for predicting survival? Does this match your intuition?

**TODO 4.2:** Analyze overfitting

In [None]:
# TODO 4.2: Overfitting analysis
# Compare training vs test performance for all models
# YOUR CODE HERE

# Plot training vs test performance
# YOUR CODE HERE

plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Training vs Test Accuracy - Overfitting Analysis')
plt.xticks(x, model_names, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

**Question 4.2:** Which model shows signs of overfitting? How can you tell?

## Part 5: Hyperparameter Tuning (Optional Advanced Challenge)

**TODO 5.1:** Perform hyperparameter tuning on the best model

In [None]:
# TODO 5.1: Hyperparameter tuning (Optional)
# Use GridSearchCV to tune Random Forest hyperparameters

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create GridSearchCV object
# YOUR CODE HERE

# Fit the grid search
# YOUR CODE HERE

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_scaled)
y_pred_proba_tuned = best_model.predict_proba(X_test_scaled)[:, 1]

print(f"\nTuned Model Test Accuracy: {best_model.score(X_test_scaled, y_test):.4f}")
print(f"Tuned Model Test ROC-AUC: {roc_auc_score(y_test, y_pred_proba_tuned):.4f}")

# Compare with original model
original_rf = results['Random Forest']['model']
original_auc = results['Random Forest']['roc_auc']
print(f"\nOriginal Random Forest ROC-AUC: {original_auc:.4f}")
print(f"Improvement: {roc_auc_score(y_test, y_pred_proba_tuned) - original_auc:.4f}")

## Summary and Reflection

**TODO:** Complete the summary questions below

### Key Learnings:
1. What was the most surprising result from your analysis?
2. Which model would you recommend for this prediction task and why?
3. What are the most important features for predicting survival?
4. How did you handle overfitting in your models?
5. What would you do differently if you had more time/data?

### Answers:
1. 
2. 
3. 
4. 
5. 

### Next Steps:
- Try advanced techniques like stacking or feature selection
- Experiment with different preprocessing approaches
- Consider domain-specific features
- Explore model interpretability techniques
- Try RAPIDS GPU acceleration for larger datasets