# Titanic Survival Analysis and Prediction

This is a comprehensive Python solution for the Titanic competition.

# Introduction

In this notebook, we delve into the infamous Titanic tragedy and utilize machine learning to predict passenger survival. We'll employ a combination of data cleaning, feature engineering, and advanced modeling techniques to uncover hidden insights and build a robust predictive model. Our analysis will focus on leveraging to improve accuracy and gain valuable knowledge from the historical data.

# Required libraries

In [1]:
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
from sklearn.model_selection import train_test_split  # For splitting data
from sklearn.preprocessing import StandardScaler  # For scaling features
from sklearn.ensemble import RandomForestClassifier  # The ML model
from sklearn.impute import SimpleImputer  # For handling missing values
from sklearn.metrics import accuracy_score, classification_report  # For evaluation

# 1. Load and inspect the Data

Download the competition data files (train.csv and test.csv) and place them in the same directory as the script.

In [2]:
# Load the data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

**train.csv**: Contains data with known outcomes (in this case, whether passengers survived)

* Used to train the model
* Includes the target variable ('Survived')
* Helps the model learn patterns


**test.csv**: Contains data without the outcome

* Used to make predictions on new data
* Doesn't include the target variable
* Tests how well your model generalizes to new data

In [3]:
print("\nDataset Shape - train_data:", train_data.shape)
print("\nDataset Shape - test_data:", test_data.shape)


Dataset Shape - train_data: (891, 12)

Dataset Shape - test_data: (418, 11)


In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# 2. Data PreProcessing Function

* This function takes raw Titanic passenger data and cleans it up for machine learning

* Key preprocessing steps include:
    * Extracting titles from passenger names (Mr, Mrs, Miss, etc.)
    * Creating family size features
    * Handling missing values (age, fare, embarked status)
    * Creating bins for fare and age ranges
    * Converting categorical variables into numerical format (one-hot encoding)

In [6]:
def preprocess_data(df):
    """
    Preprocess the Titanic dataset by handling missing values, creating new features,
    and encoding categorical variables.
    """
    # Create a copy to avoid modifying original data
    data = df.copy()

    # Extract titles from names
    data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    title_mapping = {
        "Mr": "Mr",
        "Miss": "Miss",
        "Mrs": "Mrs",
        "Master": "Master",
        "Dr": "Rare",
        "Rev": "Rare",
        "Col": "Rare",
        "Major": "Rare",
        "Mlle": "Miss",
        "Countess": "Rare",
        "Ms": "Miss",
        "Lady": "Rare",
        "Jonkheer": "Rare",
        "Don": "Rare",
        "Dona": "Rare",
        "Mme": "Mrs",
        "Capt": "Rare",
        "Sir": "Rare"
    }
    data['Title'] = data['Title'].map(title_mapping)

    # Create family size feature where SibSp(siblings/spouses) and Parch(parents/children)
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

    # Create is_alone feature for solo travelers
    data['IsAlone'] = (data['FamilySize'] == 1).astype(int)

    # Fill missing ages with median age by Title
    data['Age'] = data.groupby('Title')['Age'].transform(
        lambda x: x.fillna(x.median())
    )

    # Fill missing embarked with the most common port
    data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])

    # Fill missing Fare with median fare
    data['Fare'] = data['Fare'].fillna(data['Fare'].median())
    
    #Converts continous variables into categorical ranges    
    # Create fare bins
    data['FareBin'] = pd.qcut(data['Fare'], 4, labels=['Low', 'Mid', 'Mid-High', 'High'])

    # Create age bins
    data['AgeBin'] = pd.cut(data['Age'], 5, labels=['Child', 'Young', 'Adult', 'Middle', 'Senior'])

    # Remove collumns that won´t be used for prediction
    # Drop unnecessary columns
    columns_to_drop = ['Name', 'Ticket', 'Cabin', 'PassengerId']
    data = data.drop(columns_to_drop, axis=1)

    # Convert categorical variables to dummy variables
    categorical_columns = ['Sex', 'Embarked', 'Title', 'FareBin', 'AgeBin']
    data = pd.get_dummies(data, columns=categorical_columns)

    return data

# 3. Model Training Function

* Sets up and trains a Random Forest Classifier
* Splits data into training and validation sets (80%/20%)
* Prints out model performance metrics
* Returns both the trained model and feature names

In [7]:
def train_model(train_data):
    """
    Train a Random Forest model on the preprocessed data.
    """
    # Separate features and target
    X = train_data.drop('Survived', axis=1) # Features
    y = train_data['Survived'] # Target variable
    
    # Split the data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Initialize and train the model
    model = RandomForestClassifier(
        n_estimators=100, # Number of trees
        max_depth=10, # Maximum tree depth
        min_samples_split=5, # Minimum samples needed to split a node
        min_samples_leaf=2, # Minimum samples in a leaf node
        random_state=42 #For reproducibility
    )
    
    model.fit(X_train, y_train)
    
    # Make predictions on validation set
    val_predictions = model.predict(X_val)
    
    # Print validation metrics
    print("\nValidation Metrics:")
    print("Accuracy:", accuracy_score(y_val, val_predictions))
    print("\nClassification Report:")
    print(classification_report(y_val, val_predictions))
    
    return model, X.columns

# 4. Prediction Function

* Takes new data and makes survival predictions
* Ensures the test data matches the training data format
* Creates a submission file with PassengerId and predictions

In [8]:
def make_predictions(model, feature_names, test_data):
    """
    Make predictions on test data.
    """
    # Preprocess test data
    processed_test = preprocess_data(test_data)
    
    # Ensure test data has same columns as training data
    for col in feature_names:
        if col not in processed_test.columns:
            processed_test[col] = 0
    
    # Reorder columns to match training data
    processed_test = processed_test[feature_names]
    
    # Make predictions
    predictions = model.predict(processed_test)
    
    # Create submission dataframe
    submission = pd.DataFrame({
        'PassengerId': test_data['PassengerId'],
        'Survived': predictions
    })
    
    return submission

This will:
* Preprocesses new test data using same steps as training data
* Ensures all features match training data (adds missing columns if necessary)
* Makes predictions on test data
* Creates a submission file with predictions

# 5. MAIN EXECUTION BLOCK

In [9]:
if __name__ == "__main__":
    
    # Process, train, and predict
    processed_train = preprocess_data(train_data)
    model, feature_names = train_model(processed_train)
    submission = make_predictions(model, feature_names, test_data)
    
    # Save predictions
    submission.to_csv('submission.csv', index=False)


Validation Metrics:
Accuracy: 0.8491620111731844

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.88      0.87       105
           1       0.82      0.81      0.82        74

    accuracy                           0.85       179
   macro avg       0.84      0.84      0.84       179
weighted avg       0.85      0.85      0.85       179



# Titanic survival prediction model results


1. **Overall Accuracy: 0.849 (or 84.9%)**
- This means the model correctly predicted survival/non-survival for about 85% of the passengers in the validation set
- Out of 179 total passengers in the validation set, the model correctly classified roughly 152 passengers
- This is generally a good accuracy for the Titanic dataset, as it's well above random chance (50%)

2. **Classification Report Breakdown:**

The report shows metrics for two classes:
- Class 0: Did not survive
- Class 1: Survived

Let's analyze each metric:

a) **For Non-Survivors (Class 0):**
- Precision: 0.87
  - Of all passengers predicted to die, 87% actually did die
  - Low false positive rate (rarely predicted death for survivors)
- Recall: 0.88
  - Of all passengers who actually died, the model identified 88% of them
  - The model caught most of the actual deaths
- F1-score: 0.87
  - Harmonic mean of precision and recall
  - Shows balanced performance between precision and recall
- Support: 105
  - There were 105 non-survivors in the validation set

b) **For Survivors (Class 1):**
- Precision: 0.82
  - Of all passengers predicted to survive, 82% actually survived
  - Slightly more false positives than with death predictions
- Recall: 0.81
  - Of all passengers who actually survived, the model identified 81% of them
  - Slightly more missed survivors than missed deaths
- F1-score: 0.82
  - Shows balanced performance between precision and recall
- Support: 74
  - There were 74 survivors in the validation set

3. **Macro vs Weighted Averages:**
- Macro avg (0.84): Simple average of metrics across both classes
- Weighted avg (0.85): Average weighted by class size (more weight to non-survivors since there were more of them)

**Key Conclusions:**

1. **Model Performance:**
   - The model is performing well with 85% accuracy
   - It's significantly better than random guessing (50%)
   - Performance is balanced between survivors and non-survivors

2. **Class Balance:**
   - The validation set has more non-survivors (105) than survivors (74)
   - This reflects the historical reality of the Titanic disaster

3. **Prediction Quality:**
   - Slightly better at predicting deaths (87-88%) than survivals (81-82%)
   - Very balanced between precision and recall for both classes
   - Few false positives in either direction

4. **Model Reliability:**
   - The model is slightly more reliable when predicting deaths than survivals
   - The small difference between macro and weighted averages suggests good performance across both classes

**Potential Areas for Improvement:**
1. The gap between survivor and non-survivor prediction accuracy could potentially be narrowed
2. The recall for survivors (0.81) could potentially be improved to catch more actual survivors
3. Consider if the slightly better performance for non-survivors is acceptable given the use case

Would you like me to elaborate on any of these aspects or explain how we might try to improve the model's performance further?