# Business Understanding: Titanic Machine Learning from Disaster

## Overview

The sinking of the RMS Titanic on April 15, 1912, during her maiden voyage is one of the most infamous maritime disasters in history. After colliding with an iceberg, the "unsinkable" ship sank, resulting in the death of 1,502 out of 2,224 passengers and crew due to insufficient lifeboats.

While survival had elements of luck, certain groups of people had higher survival rates than others. This Kaggle competition challenges us to build a predictive model to answer: **"What sorts of people were more likely to survive?"** using passenger data including name, age, gender, socio-economic class, and other factors.

This notebook follows the **CRISP-DM methodology** for the Business Understanding phase.

## Business Objectives

### Primary Goal
**Predict passenger survival** on the Titanic using machine learning techniques. Specifically:
- Build a binary classification model to predict the `Survived` target variable (0 = did not survive, 1 = survived)
- Use passenger characteristics from the training data to make predictions on the test set
- Submit predictions in the required format: `PassengerId`, `Survived` columns

### Competition Context
- **Platform**: Kaggle - "Titanic: Machine Learning from Disaster"
- **Type**: Binary classification problem
- **Target**: Predict survival (0 or 1) for passengers in the test set
- **Historical significance**: Understanding survival patterns from one of history's most famous maritime disasters

## Success Criteria

### Primary Success Metrics
**Kaggle Public Leaderboard Performance:**
- **Target**: Achieve accuracy ≥ 0.76555 (gender baseline)
- **Competitive**: Aim for top 50% of submissions
- **Stretch**: Reach top 25% performance

**Evaluation Metric:** Accuracy = (Correct Predictions) / (Total Predictions)

### Secondary Success Metrics
**Model Reliability:**
- Cross-validation accuracy stable (standard deviation ≤ 0.02)
- Minimal gap between CV and leaderboard scores (≤ 0.03) to avoid overfitting
- Consistent performance across different data splits

**Technical Deliverables:**
- Reproducible analysis pipeline
- Well-documented feature engineering process
- Clear model interpretation and insights
- Professional submission format

## Business Questions & Hypotheses

### Key Questions to Answer
1. **Demographics**: What role did gender, age, and class play in survival?
2. **Social Status**: Did passenger class (1st, 2nd, 3rd) significantly impact survival rates?
3. **Family Structure**: How did traveling with family members affect survival chances?
4. **Economic Factors**: Did fare paid correlate with survival probability?
5. **Location**: Did cabin location or port of embarkation matter?

### Initial Hypotheses
Based on historical knowledge of the disaster:

**Strong Predictors (Expected):**
- **Gender**: Women had higher survival rates ("women and children first" protocol)
- **Class**: First-class passengers had better access to lifeboats
- **Age**: Children were prioritized for rescue

**Moderate Predictors (Possible):**
- **Family Size**: Small families might coordinate better than large groups
- **Fare**: Higher fares might indicate better cabin locations
- **Embarked**: Different ports might reflect different passenger demographics

**Weak Predictors (Uncertain):**
- **Name/Title**: Might indicate social status beyond class
- **Ticket**: Could reveal group bookings or special arrangements

## Constraints, Risks & Assumptions

### Data Constraints
**Limited Dataset Size:**
- Training set: 891 passengers (small for ML standards)
- Test set: 418 passengers
- High variance risk due to limited data
- **Mitigation**: Use regularized models, careful cross-validation

**Missing Data:**
- Age: Significant missing values expected
- Cabin: Likely extensive missing information
- Other features: Some missing values anticipated

### Risk Assessment

**1. Overfitting Risk** 🔴 **HIGH**
- Small dataset increases variance
- Risk of memorizing training patterns
- **Mitigation**: Simple models, robust CV, regularization

**2. Data Leakage Risk** 🟡 **MEDIUM**
- Careful feature engineering required
- Avoid using future information or test set statistics
- **Mitigation**: Strict train/test separation, grouped CV for families

**3. Historical Bias** 🟡 **MEDIUM**
- 1912 social structures different from modern context
- Survival may reflect societal biases of the era
- **Impact**: Accept as historical reality, focus on prediction accuracy

### Key Assumptions
- Training data is representative of the test data
- Missing data patterns are consistent between train/test
- Historical records are accurate and complete
- Survival was influenced by measurable passenger characteristics

## Solution Approach

### Methodology: CRISP-DM Framework
Following the **Cross-Industry Standard Process for Data Mining**:

1. **Business Understanding** ✅ (This notebook)
2. **Data Understanding** - Explore data patterns and quality
3. **Data Preparation** - Clean, transform, and engineer features
4. **Modeling** - Build and tune predictive models
5. **Evaluation** - Assess model performance and business value
6. **Deployment** - Submit predictions to Kaggle

### Modeling Strategy

**Baseline Models:**
- Majority class prediction
- Gender-based rules (known to achieve ~76.6% accuracy)

**Primary Models:**
1. **Logistic Regression** - Interpretable linear model
2. **Random Forest** - Robust ensemble method
3. **Gradient Boosting** - High-performance ensemble

**Advanced Techniques (if time permits):**
- Feature engineering: Title extraction, family grouping
- Hyperparameter tuning with cross-validation
- Model stacking/ensembling

### Cross-Validation Strategy
- **Stratified K-Fold** (k=5) to maintain class balance
- Consider **GroupKFold** by family/ticket if needed to prevent leakage

## Deliverables & Timeline

### Final Deliverables

**1. Kaggle Submission** 🎯
- `submission.csv` with PassengerId and Survived predictions
- Achieve target accuracy > 0.76555
- Competition ranking and leaderboard score

**2. Technical Artifacts** 📊
- Complete Jupyter notebook pipeline (6 notebooks total)
- Trained model artifacts (pickled pipelines)
- Feature engineering functions (`src/features.py`)
- Model training utilities (`src/modeling.py`)

**3. Analysis Reports** 📝
- Executive summary of key findings
- Feature importance analysis
- Model performance comparison
- Insights on survival patterns

### Project Timeline

**Phase 1: Data Understanding** (Day 1)
- Load and explore train/test datasets
- Identify missing data patterns
- Initial survival rate analysis

**Phase 2: Data Preparation** (Day 2)
- Handle missing values (Age, Cabin, Embarked)
- Engineer features (Title, FamilySize, IsAlone)
- Create preprocessing pipeline

**Phase 3: Modeling** (Day 3)
- Build baseline models
- Train multiple algorithms
- Hyperparameter tuning

**Phase 4: Evaluation & Submission** (Day 4)
- Cross-validation analysis
- Final model selection
- Generate and submit predictions

## Initial Data Overview

Let's take a quick look at the available data to confirm our understanding:

In [1]:
import pandas as pd
import numpy as np
import os

# Load the datasets
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')
sample_submission = pd.read_csv('../data/raw/gender_submission.csv')

print("📊 DATASET OVERVIEW")
print("=" * 50)
print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"Sample submission shape: {sample_submission.shape}")

print("\n🎯 TARGET VARIABLE DISTRIBUTION")
print("=" * 50)
survival_counts = train_df['Survived'].value_counts()
survival_rate = train_df['Survived'].mean()
print(f"Survival rate: {survival_rate:.3f} ({survival_rate*100:.1f}%)")
print(f"Survivors: {survival_counts[1]} passengers")
print(f"Non-survivors: {survival_counts[0]} passengers")

📊 DATASET OVERVIEW
Training set shape: (891, 12)
Test set shape: (418, 11)
Sample submission shape: (418, 2)

🎯 TARGET VARIABLE DISTRIBUTION
Survival rate: 0.384 (38.4%)
Survivors: 342 passengers
Non-survivors: 549 passengers


In [2]:
print("\n📋 AVAILABLE FEATURES")
print("=" * 50)
print("Training set columns:")
for i, col in enumerate(train_df.columns, 1):
    print(f"{i:2d}. {col}")

print(f"\nTest set columns (missing 'Survived'):")
for i, col in enumerate(test_df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n🔍 SAMPLE DATA")
print("=" * 50)
print("First 3 rows of training data:")
print(train_df.head(3))


📋 AVAILABLE FEATURES
Training set columns:
 1. PassengerId
 2. Survived
 3. Pclass
 4. Name
 5. Sex
 6. Age
 7. SibSp
 8. Parch
 9. Ticket
10. Fare
11. Cabin
12. Embarked

Test set columns (missing 'Survived'):
 1. PassengerId
 2. Pclass
 3. Name
 4. Sex
 5. Age
 6. SibSp
 7. Parch
 8. Ticket
 9. Fare
10. Cabin
11. Embarked

🔍 SAMPLE DATA
First 3 rows of training data:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON

In [3]:
print("\n🏆 BASELINE PERFORMANCE CHECK")
print("=" * 50)

# Gender baseline (predict all females survive, all males don't)
gender_baseline = train_df.groupby('Sex')['Survived'].mean()
print("Survival rate by gender:")
print(gender_baseline)

# Calculate accuracy if we predict based on gender
gender_predictions = train_df['Sex'].map({'female': 1, 'male': 0})
gender_accuracy = (gender_predictions == train_df['Survived']).mean()
print(f"\nGender-based prediction accuracy: {gender_accuracy:.4f} ({gender_accuracy*100:.2f}%)")

# Class baseline
class_baseline = train_df.groupby('Pclass')['Survived'].mean()
print(f"\nSurvival rate by passenger class:")
print(class_baseline)


🏆 BASELINE PERFORMANCE CHECK
Survival rate by gender:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

Gender-based prediction accuracy: 0.7868 (78.68%)

Survival rate by passenger class:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64


## 📊 Key Data Insights from Business Understanding

### Dataset Overview
- **Training data**: 891 passengers with 12 features (including target)
- **Test data**: 418 passengers with 11 features (missing 'Survived' target)
- **Class imbalance**: Only 38.4% survived (342 out of 891) - this is important for model evaluation

### 🎯 Critical Baseline Findings

**1. Gender is the Strongest Predictor** ⭐
- **Female survival rate**: 74.2% (3 out of 4 women survived)
- **Male survival rate**: 18.9% (only 1 out of 5 men survived)
- **Gender-only accuracy**: 78.68% - This confirms our target baseline!

**2. Passenger Class Shows Clear Hierarchy** 📈
- **1st Class**: 63.0% survival rate
- **2nd Class**: 47.3% survival rate
- **3rd Class**: 24.2% survival rate
- Clear social stratification effect on survival

### 🔍 Available Features Analysis
The sample data reveals rich information:

- **PassengerId**: Unique identifier
- **Pclass**: 1, 2, 3 (clear ordinal relationship)
- **Name**: Contains titles (Mr., Mrs., Miss.) - potential feature engineering
- **Sex**: Strong binary predictor confirmed
- **Age**: Continuous variable (notice missing values - NaN)
- **SibSp/Parch**: Family relationship counts
- **Ticket**: Alphanumeric codes (may indicate groups)
- **Fare**: Continuous price variable (economic indicator)
- **Cabin**: Deck/room info (lots of missing data expected)
- **Embarked**: Port codes (S, C, Q)

### 🚨 Business Intelligence Alerts

**✅ Baseline Target Validated**
- Gender baseline (78.68%) exceeds our target of 76.6%
- This gives us confidence in our success criteria

**⚠️ Class Imbalance Noted**
- 61.6% didn't survive vs 38.4% survived
- Need stratified sampling in cross-validation

**🔍 Feature Engineering Opportunities**
- **Name titles**: Mr./Mrs./Miss. visible in sample
- **Family size**: SibSp + Parch + 1
- **Economic status**: Fare + Pclass combination
- **Missing data patterns**: Age and Cabin have NaN values

### 📋 Validated Business Understanding

**Problem Confirmed**: Binary classification with clear historical patterns  
**Success Metrics**: Gender baseline of 78.68% sets realistic target  
**Risk Assessment**:
- Small dataset (891 samples) confirmed
- Missing data visible (Age, Cabin)
- Strong class imbalance present

**Next Phase Priorities**:
1. **Missing data analysis** - How much Age/Cabin data is missing?
2. **Feature distribution exploration** - Outliers in Age/Fare?
3. **Family dynamics investigation** - Do SibSp/Parch patterns matter?
4. **Text feature extraction** - Mine titles from Name field