# Titanic Dataset

The goal is to predict whether a passenger survived the Titanic using binary classification

### Step 1: Data Exploration

In [17]:
# Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the training data
df_train = pd.read_csv('assets/train.csv')

# Load the test data
df_test = pd.read_csv('assets/test.csv')

# Display the first few rows of the dataframe
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
# drop the ticket column as it's not likely to contain any useful information
df_train.drop('Ticket', axis=1, inplace=True)
df_test.drop('Ticket', axis=1, inplace=True)

### We have the training data loaded. The train.csv file contains the details of a subset of the passengers on board the Titanic, along with a Survived column indicating whether each passenger survived the disaster.

### Now, let's explore the data and understand each column:

**PassengerId**: Unique ID assigned to each passenger.

**Survived**: This is our target variable which we're trying to predict. It's a binary variable where '1' indicates that the passenger survived and '0' indicates that they did not.

**Pclass**: This is the ticket class and can be seen as a proxy for socio-economic status. It's a categorical variable with '1' for 1st class, '2' for 2nd class, and '3' for 3rd class.

**Name**: The name of the passenger.

**Sex**: The gender of the passenger, either 'male' or 'female'.

**Age**: The age of the passenger.

**SibSp**: This indicates the number of siblings or spouses the passenger had aboard the Titanic.

**Parch**: This indicates the number of parents or children the passenger had aboard the Titanic.

**Ticket**: The ticket number of the passenger.

**Fare**: How much the passenger paid for the ticket.

**Cabin**: The cabin number where the passenger stayed.

**Embarked**: The port where the passenger embarked. It's a categorical variable with 'C' for Cherbourg, 'Q' for Queenstown, and 'S' for Southampton.

In [19]:
# Quick overview of the dataset
df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


### This output gives us some important information about the structure of the dataset:

**Missing Values**: The Age, Cabin, and Embarked columns have missing values. For Age, we might want to fill in missing values using an approach that makes sense given the distribution and nature of the data, like using median age. Cabin has a large number of missing values (687 out of 891), and might not add much value to our predictions. It could be dropped or engineered into a simpler feature like "Cabin Known: Yes/No". For Embarked, as there are only two missing values, we could fill in with the most common embarkation point.

**Data Types**: There are three kinds of data types in our dataset: integers (int64), floats (float64), and objects. Object usually means that the data type is string, but it might also be used for other data types that pandas doesn't recognize. We will need to encode the categorical variables (Sex, Embarked) into numerical ones, as machine learning algorithms work with numerical data. The Name and Ticket columns may require special treatment or might be dropped, depending on whether we think they'll be useful.

**Number of Entries**: All columns have 891 entries except for the ones with missing values. This consistency is important, otherwise we would need to investigate why there are mismatches.

**Potential Feature Engineering**: SibSp and Parch represent the number of siblings/spouses and parents/children aboard. We might create a new feature called FamilySize by adding these two together + 1 (the passenger themself).

With this information, we can move on to the next steps of data cleaning and feature engineering.

In the data cleaning and feature engineering steps, we want to make sure our data is in a form that's amenable to the kind of analysis we want to perform. This involves handling missing data, dealing with outliers, encoding categorical variables, and potentially creating new features that might give us more predictive power.

Here are some steps we can follow:

### 1. **Dealing with Missing Data**

a. **Age**: This column has 177 missing values. We could fill the missing values with the median age. The median is often a better choice than the mean for data with outliers, which age might have (very young and very old passengers).

b. **Cabin**: This column has a lot of missing values (687 out of 891). Since there's so much missing, it might not be very useful in its current form. We could transform this column into a binary one: known (1) or unknown (0).

c. **Embarked**: There are only 2 missing values. We could fill in these values with the most common embarkation point.

### 2. **Encoding Categorical Variables**

a. **Sex**: This is a binary categorical variable. It could be encoded as 0 (male) and 1 (female).

b. **Embarked**: This is a multi-class categorical variable. One-hot encoding can be used here to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.

### 3. **Feature Engineering**

a. **Name**: We can extract titles (Mr, Mrs, Miss, etc) from the name, which might give us additional information about the passenger's social status that could be informative.

b. **FamilySize**: We can create a new feature called FamilySize that is the sum of SibSp and Parch plus one (the passenger themself).

c. **IsAlone**: A binary feature indicating if the passenger is alone. It could be derived from FamilySize.

d. **FareBin and AgeBin**: It can be useful to transform continuous variables into categorical ones. We can create categorical bins for Fare and Age.

e. **Ticket**: The ticket column might be dropped, as it is unlikely to contain useful information. The exception would be if there are shared ticket numbers among passengers (which might indicate groups travelling together), but this would require additional exploration.

Remember, these are general suggestions and might need to be adapted based on the specifics of the dataset and the results of exploratory data analysis. All changes should be motivated by a solid understanding of the data and the problem we're trying to solve.

In [20]:
# Calculate the median of the 'Age' column
median_age = df_train['Age'].median()
median_age = df_test['Age'].median()

# Fill the missing values in the 'Age' column with the median value
df_train['Age'].fillna(median_age, inplace=True)
df_test['Age'].fillna(median_age, inplace=True)

# Transform 'Cabin' column to 'Known' (1) if a cabin is assigned, and 'Unknown' (0) otherwise
df_train['Cabin'] = df_train['Cabin'].apply(lambda x: 1 if pd.notnull(x) else 0)
df_test['Cabin'] = df_test['Cabin'].apply(lambda x: 1 if pd.notnull(x) else 0)

# Fill the missing values in 'Embarked' column with the most common port
most_common_port = df_train['Embarked'].mode()[0]
most_common_port = df_test['Embarked'].mode()[0]
df_train['Embarked'].fillna(most_common_port, inplace=True)
df_test['Embarked'].fillna(most_common_port, inplace=True)

# Convert Embarked to new columns Embarked_C, Embarked_Q and Embarked_S and assign a 1 or 0 to each column
df_train = pd.get_dummies(df_train, columns=['Embarked'])
df_test = pd.get_dummies(df_test, columns=['Embarked'])

# Encode 'Sex' column to 0-Male and 1-Female
df_train['Sex'] = df_train['Sex'].apply(lambda x: 1 if 'female' in x else 0)
df_test['Sex'] = df_test['Sex'].apply(lambda x: 1 if 'female' in x else 0)

# create a new feature called FamilySize that is the sum of SibSp and Parch plus one (the passenger themself).
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1

# create a new feature called IsAlone that is 1 if the passenger is alone and 0 otherwise (derived from FamilySize)
df_train['IsAlone'] = df_train['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
df_test['IsAlone'] = df_test['FamilySize'].apply(lambda x: 1 if x == 1 else 0)

# create a new feature called Title that extracts titles from names
df_train['Title'] = df_train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
df_test['Title'] = df_test['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

# create a new feature called FareBin that bins the fare into 4 equally sized bins
df_train['FareBin'] = pd.qcut(df_train['Fare'], 4)
df_test['FareBin'] = pd.qcut(df_test['Fare'], 4)

# create a new feature called FareBin_Code that maps the FareBin to a numerical value
df_train['FareBin_Code'] = df_train['FareBin'].astype('category').cat.codes
df_test['FareBin_Code'] = df_test['FareBin'].astype('category').cat.codes

# create a new feature called AgeBin that bins the age into 5 equally sized bins
df_train['AgeBin'] = pd.cut(df_train['Age'].astype(int), 5)
df_test['AgeBin'] = pd.cut(df_test['Age'].astype(int), 5)

# create a new feature called AgeBin_Code that maps the AgeBin to a numerical value
df_train['AgeBin_Code'] = df_train['AgeBin'].astype('category').cat.codes
df_test['AgeBin_Code'] = df_test['AgeBin'].astype('category').cat.codes

df_test['Fare'].fillna(df_test['Fare'].mean(), inplace=True)


In [21]:
# df.head(10).to_csv('assets/processed_data.csv', index=False)
df_train.head(10)
# df_test.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S,FamilySize,IsAlone,Title,FareBin,FareBin_Code,AgeBin,AgeBin_Code
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,0,0,0,1,2,0,Mr,"(-0.001, 7.91]",0,"(16.0, 32.0]",1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,1,1,0,0,2,0,Mrs,"(31.0, 512.329]",3,"(32.0, 48.0]",2
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,0,0,0,1,1,1,Miss,"(7.91, 14.454]",1,"(16.0, 32.0]",1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,1,0,0,1,2,0,Mrs,"(31.0, 512.329]",3,"(32.0, 48.0]",2
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,0,0,0,1,1,1,Mr,"(7.91, 14.454]",1,"(32.0, 48.0]",2
5,6,0,3,"Moran, Mr. James",0,27.0,0,0,8.4583,0,0,1,0,1,1,Mr,"(7.91, 14.454]",1,"(16.0, 32.0]",1
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,51.8625,1,0,0,1,1,1,Mr,"(31.0, 512.329]",3,"(48.0, 64.0]",3
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,21.075,0,0,0,1,5,0,Master,"(14.454, 31.0]",2,"(-0.08, 16.0]",0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,11.1333,0,0,0,1,3,0,Mrs,"(7.91, 14.454]",1,"(16.0, 32.0]",1
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,30.0708,0,1,0,0,2,0,Mrs,"(14.454, 31.0]",2,"(-0.08, 16.0]",0


In [22]:
# Find the correlation between different numerical columns
df_train.corr()

  df_train.corr()


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S,FamilySize,IsAlone,FareBin_Code,AgeBin_Code
PassengerId,1.0,-0.005007,-0.035144,-0.042939,0.034759,-0.057527,-0.001652,0.012658,0.019919,-0.001205,-0.033606,0.022204,-0.040143,0.057462,-0.022998,0.026528
Survived,-0.005007,1.0,-0.338481,0.543351,-0.061956,-0.035322,0.081629,0.257307,0.316912,0.16824,0.00365,-0.149683,0.016639,-0.203367,0.299357,-0.044492
Pclass,-0.035144,-0.338481,1.0,-0.1319,-0.344489,0.083081,0.018443,-0.5495,-0.725541,-0.243292,0.221009,0.074053,0.065997,0.135207,-0.634271,-0.358005
Sex,-0.042939,0.543351,-0.1319,1.0,-0.079306,0.114631,0.245489,0.182333,0.140391,0.082853,0.074115,-0.119224,0.200988,-0.303646,0.243613,-0.071125
Age,0.034759,-0.061956,-0.344489,-0.079306,1.0,-0.233396,-0.168329,0.099571,0.244228,0.029167,-0.041675,0.000674,-0.243612,0.166664,0.089421,0.942625
SibSp,-0.057527,-0.035322,0.083081,0.114631,-0.233396,1.0,0.414838,0.159651,-0.04046,-0.059528,-0.026354,0.068734,0.890712,-0.584471,0.393025,-0.218846
Parch,-0.001652,0.081629,0.018443,0.245489,-0.168329,0.414838,1.0,0.216225,0.036987,-0.011069,-0.081228,0.060814,0.783111,-0.583398,0.393881,-0.134014
Fare,0.012658,0.257307,-0.5495,0.182333,0.099571,0.159651,0.216225,1.0,0.482075,0.269335,-0.117216,-0.162184,0.217138,-0.271832,0.579345,0.124322
Cabin,0.019919,0.316912,-0.725541,0.140391,0.244228,-0.04046,0.036987,0.482075,1.0,0.208528,-0.129572,-0.101139,-0.009175,-0.158029,0.500936,0.260538
Embarked_C,-0.001205,0.16824,-0.243292,0.082853,0.029167,-0.059528,-0.011069,0.269335,0.208528,1.0,-0.148258,-0.782742,-0.046215,-0.095298,0.186073,0.0302


Let's interpret the correlation matrix with focus on the 'Survived' variable:

1. **Sex**: This variable is highly positively correlated with Survived (0.543351), indicating that women were more likely to survive than men.
2. **Pclass**: This variable is negatively correlated with Survived (-0.338481), suggesting that people in higher classes (lower number) had a better chance of survival.
3. **Fare and FareBin_Code**: These variables have positive correlation with Survived (0.257307 and 0.299357 respectively), suggesting that passengers who paid more were more likely to survive. This makes sense as the fare is also linked with the class.
4. **Cabin**: This variable is positively correlated with Survived (0.316912), which suggests that those who had a cabin were more likely to survive. This might be related to the class of the passenger as well.
5. **Embarked_C**: This variable is positively correlated with Survived (0.168240), suggesting that passengers who embarked at Cherbourg (C) had a higher survival rate.
6. **Embarked_S**: This variable is negatively correlated with Survived (-0.149683), suggesting that passengers who embarked at Southampton (S) had a lower survival rate.
7. **Age and AgeBin_Code**: Age is slightly negatively correlated with survival (-0.064910 and -0.044492 respectively), implying that younger passengers were slightly more likely to survive.
8. **IsAlone**: This variable is negatively correlated with Survived (-0.203367), suggesting that those who travelled alone were less likely to survive.

The other variables ('PassengerId', 'SibSp', 'Parch', 'Embarked_Q', 'FamilySize') show a weak correlation with Survived and may not be as significant.

Please remember, while these interpretations give an idea about the relationships, correlation does not imply causation. The exact causal relationships can be more complex.

In [23]:
# let's keep only columns that have a correlation of at least 0.1 or -0.1 with the 'Survived' column
df_train = df_train[['PassengerId', 'Survived', 'Pclass', 'Sex', 'Fare', 'Cabin', 'Embarked_C', 'Embarked_S', 'IsAlone', 'FareBin_Code']]
df_test = df_test[['PassengerId', 'Pclass', 'Sex', 'Fare', 'Cabin', 'Embarked_C', 'Embarked_S', 'IsAlone', 'FareBin_Code']]
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Fare,Cabin,Embarked_C,Embarked_S,IsAlone,FareBin_Code
0,1,0,3,0,7.25,0,0,1,0,0
1,2,1,1,1,71.2833,1,1,0,0,3
2,3,1,3,1,7.925,0,0,1,1,1
3,4,1,1,1,53.1,1,0,1,0,3
4,5,0,3,0,8.05,0,0,1,1,1


To measure the accuracy of your predictions, you would need a ground truth - the actual outcomes that our model was trying to predict.

In a normal train/test split situation, you would have the ground truth for your test set, so you could compare our model's predictions directly to the real outcomes. Common metrics for this comparison include accuracy, precision, recall, F1 score, ROC AUC, etc.

However, in the case of the Titanic dataset from Kaggle, the ground truth (i.e., whether each passenger in the test set survived or not) is not provided. The test set predictions are meant to be submitted to Kaggle, which then computes the accuracy of your model and provides you with a score.

If we want to get a sense of our model's accuracy before submitting to Kaggle, we could split the original training data into a smaller training set and a validation set. We can train our model on this smaller training set and then test it on the validation set.

In [24]:
from sklearn.model_selection import train_test_split

# Separate the features (X) from the target variable (Y)
X = df_train.drop('Survived', axis=1)
Y = df_train['Survived']

# Split the data into a training set and a validation set
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=42)

print('Training data shape: ', X_train.shape)
print('Validation data shape: ', X_val.shape)


Training data shape:  (712, 9)
Validation data shape:  (179, 9)


In [28]:
# let's test a few more models to see which one gives the best score and then submit to kaggle
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

# Create a dictionary of the models
model_dict = {'Random Forest': RandomForestClassifier(),
              'Logistic Regression': LogisticRegression(),
              'Support Vector Machine': SVC(),
              'Gradient Boosting': GradientBoostingClassifier()}

# Initialize best score and best model name
best_score = 0
best_model_name = ''

# Iterate over each model in the dictionary
for model_name, model in model_dict.items():
    model.fit(X_train, Y_train)
    model_score = model.score(X_val, Y_val)
    print(f'{model_name} Score: ', model_score)

    # Update the best_score and best_model_name
    if model_score > best_score:
        best_score = model_score
        best_model_name = model_name

print(f'\nBest Model is {best_model_name} with score: {best_score}')



Random Forest Score:  0.7932960893854749
Logistic Regression Score:  0.776536312849162
Support Vector Machine Score:  0.5977653631284916
Gradient Boosting Score:  0.8156424581005587

Best Model is Gradient Boosting with score: 0.8156424581005587


In [29]:
from sklearn.ensemble import GradientBoostingClassifier

# Instantiate the GradientBoostingClassifier
gbc = GradientBoostingClassifier()

# Fit the model to the training data
gbc.fit(X_train, Y_train)

# Use the trained model to make predictions on the test data
predictions = gbc.predict(df_test)

# Prepare the predictions for Kaggle submission
submission = pd.DataFrame({
    'PassengerId': df_test['PassengerId'],
    'Survived': predictions
})

# Save the predictions to a CSV file
submission.to_csv('submission.csv', index=False)


So far this is the best score on Kaggle. We can tune the hyperparameters of the Gradient Boosting Classifer and use Grid Search to find the optimal parameters.

In [30]:
from sklearn.model_selection import GridSearchCV

# Define the parameters we want to test
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [1, 2, 3, 4, 5],
    'learning_rate': [0.01, 0.1, 1],
}

# Create a GradientBoostingClassifier object
gbc = GradientBoostingClassifier()

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=gbc, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, Y_train)

# Get the optimal parameters
best_params = grid_search.best_params_

print("Best parameters: ", best_params)


Best parameters:  {'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 400}


In [31]:
# Create a new GradientBoostingClassifier with the best parameters
gbc_best = GradientBoostingClassifier(n_estimators=400, max_depth=2, learning_rate=0.01)

# Fit the model to the training data
gbc_best.fit(X_train, Y_train)

# Get the score on the validation data
print("Validation accuracy: ", gbc_best.score(X_val, Y_val))

# Predict the outcomes on the actual test data
predictions = gbc_best.predict(df_test)

# Create a new DataFrame for the submission
submission = pd.DataFrame({'PassengerId':df_test['PassengerId'], 'Survived':predictions})

# Save the submission to a csv file
submission.to_csv('submission.csv', index=False)


Validation accuracy:  0.7932960893854749
