# Titanic Survival Analysis and Prediction

This is a comprehensive Python solution for the Titanic competition.

# Introduction

In this notebook, we delve into the infamous Titanic tragedy and utilize machine learning to predict passenger survival. We'll employ a combination of data cleaning, feature engineering, and advanced modeling techniques to uncover hidden insights and build a robust predictive model. Our analysis will focus on leveraging to improve accuracy and gain valuable knowledge from the historical data.

# Required libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

# 1. Load the Data

Download the competition data files (train.csv and test.csv) and place them in the same directory as the script.

In [2]:
# Load the data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

**train.csv** dataset will contain the details of a subset of the passengers on board  and importantly, will reveal whether they survived or not, also known as the “ground truth”.

**test.csv** dataset contains similar information but does not disclose the “ground truth” for each passenger.

# Inspect the data to understand the features available.

In [3]:
print("\nDataset Shape - train_data:", train_data.shape)
print("\nDataset Shape - test_data:", test_data.shape)


Dataset Shape - train_data: (891, 12)

Dataset Shape - test_data: (418, 11)


In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# 2. Preprocess the Data

Handle missing values, encode categorical variables, and engineer new features.

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [8]:
print("\nMissing Values train_data:")
print(train_data.isnull().sum())


Missing Values train_data:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [9]:
print("\nMissing Values test_data:")
print(test_data.isnull().sum())


Missing Values test_data:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


This is a crucial step that can significantly impact model performance

In [10]:
def preprocess_data(df):
    # Handle missing values
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())

    # Convert non-numeric columns to numeric
    non_numeric_cols = df.select_dtypes(include=['object']).columns
    for col in non_numeric_cols:
        try:
            df[col] = df[col].astype(float)
        except ValueError:
            # If conversion to float fails, use label encoding
            label_encoder = LabelEncoder()
            df[col] = label_encoder.fit_transform(df[col])

    # Encode categorical variables
    label_encoder = LabelEncoder()
    df['Sex'] = label_encoder.fit_transform(df['Sex'])
    df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

    # One-hot encode the 'Name' column
    df = pd.get_dummies(df, columns=['Name'])

    # Convert data types
    df['Pclass'] = df['Pclass'].astype(int)
    df['SibSp'] = df['SibSp'].astype(int)
    df['Parch'] = df['Parch'].astype(int)
    df['Fare'] = df['Fare'].astype(float)

    # Create new features
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

    return df

Let's go through the updated `preprocess_data()` function step-by-step:

1. **Handle missing values**:
   - `df['Age'] = df['Age'].fillna(df['Age'].median())`: Fills in any missing 'Age' values with the median age from the dataset.
   - `df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])`: Fills in any missing 'Embarked' values with the most common (mode) embarkation point.
   - `df['Fare'] = df['Fare'].fillna(df['Fare'].median())`: Fills in any missing 'Fare' values with the median fare from the dataset.

2. **Convert non-numeric columns to numeric**:
   - `non_numeric_cols = df.select_dtypes(include=['object']).columns`: Identifies any columns with non-numeric (object) data types.
   - `for col in non_numeric_cols:`: Iterates through each non-numeric column.
   - `try: df[col] = df[col].astype(float)`: Attempts to convert the column to a float data type.
   - `except ValueError:`: If the conversion to float fails, it enters this block.
   - `label_encoder = LabelEncoder()`: Creates a label encoder object to convert categorical variables to numerical values.
   - `df[col] = label_encoder.fit_transform(df[col])`: Encodes the non-numeric column using the label encoder.

   This section is the key change from the previous version. It now handles non-numeric columns in two ways:
   1. It first attempts to convert the column to float. This will work for columns that contain numeric-like strings (e.g., '1.0', '2.5').
   2. If the conversion to float fails, it uses label encoding to assign a unique numerical label to each unique string value in the column.

   This ensures that all columns are properly converted to numeric values, even if some of the original string values cannot be directly converted to floats.

3. **Encode categorical variables**:
   - `label_encoder = LabelEncoder()`: Creates a label encoder object to convert categorical variables to numerical values.
   - `df['Sex'] = label_encoder.fit_transform(df['Sex'])`: Encodes the 'Sex' column using the label encoder.
   - `df['Embarked'] = label_encoder.fit_transform(df['Embarked'])`: Encodes the 'Embarked' column using the label encoder.

4. **One-hot encode the 'Name' column**:
   - `df = pd.get_dummies(df, columns=['Name'])`: Creates new binary columns for each unique value in the 'Name' column, indicating the presence or absence of that name.

5. **Convert data types**:
   - `df['Pclass'] = df['Pclass'].astype(int)`: Converts the 'Pclass' column to integer data type.
   - `df['SibSp'] = df['SibSp'].astype(int)`: Converts the 'SibSp' column to integer data type.
   - `df['Parch'] = df['Parch'].astype(int)`: Converts the 'Parch' column to integer data type.
   - `df['Fare'] = df['Fare'].astype(float)`: Converts the 'Fare' column to float data type.

6. **Create new features**:
   - `df['FamilySize'] = df['SibSp'] + df['Parch'] + 1`: Creates a new 'FamilySize' feature by summing the 'SibSp' and 'Parch' columns and adding 1 (for the passenger themselves).
   - `df['IsAlone'] = (df['FamilySize'] == 1).astype(int)`: Creates a new binary 'IsAlone' feature, where 1 indicates the passenger was traveling alone.

Finally, the function returns the preprocessed dataframe.

* Meaning of the 'IsAlone' column: The 'IsAlone' column is a new feature that we created in the preprocess_data function. It's derived from the 'FamilySize' feature.
* The 'SibSp' column represents the number of siblings/spouses the passenger had aboard the Titanic, and the 'Parch' column represents the number of parents/children the passenger had aboard.
* By adding 1 to the sum of 'SibSp' and 'Parch', we get the total family size, including the passenger themselves.
* The 'IsAlone' feature is a binary (0 or 1) indicator of whether the passenger was traveling alone (1) or not (0).
* The rationale behind including this feature is that traveling alone may have been a factor in a passenger's likelihood of survival. Passengers who were traveling with family members may have had a better chance of being assigned to a lifeboat, for example.

* The 'Cabin' column likely contains a lot of missing values, and simply filling them in with a default value may not be the best approach. Missing cabin information can be an important feature for the model to learn from.
* Including the 'Cabin' column as-is, without any preprocessing, can sometimes be beneficial for the model to learn from the missing data pattern.
* The LabelEncoder from scikit-learn is used to convert the string categorical variables into numerical values that the Random Forest Classifier can understand.

In [11]:
train_data = preprocess_data(train_data)
test_data = preprocess_data(test_data)

In the preprocess_data function, we handle missing values and create new features, such as FamilySize and IsAlone. We also encode categorical variables like Sex and Embarked.

In [12]:
print("\nMissing Values train_data:")
print(train_data.isnull().sum())


Missing Values train_data:
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
              ..
Name_888       0
Name_889       0
Name_890       0
FamilySize     0
IsAlone        0
Length: 904, dtype: int64


In [13]:
print(test_data.isnull().sum())

PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
              ..
Name_415       0
Name_416       0
Name_417       0
FamilySize     0
IsAlone        0
Length: 430, dtype: int64


* The 'Cabin' column likely contains a lot of missing values, and simply filling them in with a default value may not be the best approach. Missing cabin information can be an important feature for the model to learn from.
* Including the 'Cabin' column as-is, without any preprocessing, can sometimes be beneficial for the model to learn from the missing data pattern.

# 3. Split the Data

Divide the training data into training and validation sets.

In [14]:
# 3. Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    train_data.drop('Survived', axis=1),
    train_data['Survived'],
    test_size=0.2,
    random_state=42
)

Spliting the training data into training and validation sets using train_test_split from scikit-learn. This will allow us to evaluate our model's performance during development.

# 4. Train a Machine Learning Model

* Choose an appropriate model, such as a decision tree, random forest, or logistic regression.
* Fit the model to the training data.

In [15]:
# 4. Train a machine learning model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

We create a Random Forest Classifier and fit it to the training data.

# 5. Evaluate the Model

* Use the validation set to evaluate the model's accuracy.
* Iterate on the preprocessing and model selection as needed.

In [16]:
# 5. Evaluate the model
y_pred = model.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))

Validation Accuracy: 0.8100558659217877
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.89      0.85       105
           1       0.81      0.70      0.75        74

    accuracy                           0.81       179
   macro avg       0.81      0.79      0.80       179
weighted avg       0.81      0.81      0.81       179

