# Titanic Death/Survivor Predictor Challenge

Hi there! This notebook consists on a step-by step of the mental process I followed during this challenge . It includes different techniques I studied and applied to not only improve the accuracy of the used models, but also to learn about how these techniques work and how to use them.

The original challenge alongside the data and what each feature represents can be found in https://www.kaggle.com/competitions/titanic

## Data Loading and libraries importing

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split

In [2]:
train_data = pd.read_csv('training_data/train.csv')

## Data Understanding

From a first look at the dataset, there are 11 features and 1 binary target variable with 891 rows in total. From these 11 features, only Age, Embarked and Cabin include missing values.
Additionally, some features can be redundant and|or not generate any insights after examination. It's also worth mentioning that 61% of the passengers died while approximately only 39% survived, which indicates an imbalance in the dataset that might have to be dealt with while prepraing the data for the models.

In [13]:
for column in train_data.columns:
    print(column, train_data[column].isna().sum())

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2


In [11]:
train_data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [14]:
train_data.shape

(891, 12)

In [15]:
train_data['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

In [3]:
train_data['Ticket'].value_counts()

Ticket
347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
           ..
9234        1
19988       1
2693        1
PC 17612    1
370376      1
Name: count, Length: 681, dtype: int64

## Data cleaning and preprocessing

### Data Cleaning

First, the previously mentioned features will be eliminated and the Sex variable will be changed to 0 and 1 values. After that, the Embarked feature had 2 missing values, but first the non-va values were encoded to 0, 1 and 2 to finally change the nans by its median:

In [3]:
train_data.drop(['PassengerId','Name','Ticket','Cabin'], axis=1, inplace=True)
#In this case when applying there's no nan values, otherwise it will all change them all to 1.
train_data['Sex'] = train_data['Sex'].apply(lambda x: 0 if x=='male' else 1)

In [4]:
#Since Embarked has 2 missing values, it has to be dealt with first:
train_data.loc[train_data['Embarked'].notna(), 'Embarked'] = train_data.loc[train_data['Embarked'].notna(), 'Embarked'].apply(lambda x: 0 if x == 'S' else 1 if x == 'C' else 2)

In [6]:
train_data['Embarked'].fillna(train_data['Embarked'].median(),  inplace = True)

#### Dealing with the Age variable problem

Age has a total of 177 missing values. There are 2 different approaches I want to try and study their RMSE value to establish which one is the most accurate:

##### Filling the NA using a model

Linear Regression and Random Forest Regressor were used to calculate the Age missing values. ###TO-DO: INCLUDE A CORRELATION STUDY FOR WHAT FEATURES WILL BE USED BY THE MODEL TO CALCULATE THE AGE. In this case, the data wasn't scaled because RMSE didn't change.

The RMSE indicates that the model with the least error is LinearRegression. (###Still need to include hyper-parameter tuning)

In [None]:
###TO-DO: INCLUDE A CORRELATION STUDY FOR WHAT FEATURES WILL BE USED BY THE MODEL TO CALCULATE THE AGE

In [43]:
X = train_data[train_data['Age'].notna()][['Sex', 'Pclass', 'Fare' ,'Age']]
y = X['Age'].astype(np.int8)
X = X.drop('Age', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [48]:
linear_regression_age = LinearRegression().fit(X_train, y_train)
random_forest_age = RandomForestRegressor().fit(X_train, y_train)

In [50]:
LR_predictions = linear_regression_age.predict(X_test)
RF_predictions = random_forest_age.predict(X_test)

In [46]:
rmse_lr = np.sqrt(mean_squared_error(y_test, LR_predictions))
rmse_rf = np.sqrt(mean_squared_error(y_test, RF_predictions))

In [47]:
rmse_lr, rmse_rf

(12.682856570364436, 14.010247183953606)

##### Filling the NA using mean

A numpy array of the same length as y_test with all values as the mean of X_train['Age'] to measure its RMSE. It has a value higher than the Linear Regression model.

In [27]:
mean_predicted = y_train.mean()
np.sqrt(mean_squared_error(y_test, np.full((y_test.shape[0],1), mean_predicted)))

13.708608790726196

##### Using the LR predicted values to fill the missing ages

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_na_train['Age'] = linear_regression_age.predict(X).astype(int)


### Addressing the imbalance of Survivors/Deceased:

In [14]:
y = clean_train['Survived']
X = clean_train.drop('Survived', axis=1)
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)

### Scaling the data

In [15]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_over)
MinMaxScaler()
scaled_features = scaler.transform(X_over)

## Training the binary classification models

In [16]:
X_train, X_test, y_train, y_test = train_test_split(scaled_features, y_over, test_size=0.2, random_state=42)
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

In [45]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter = 1000)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print('Accuracy:',accuracy_score(y_test, y_pred)*100,'%')
print('Recall:', recall_score(y_test, y_pred)*100,'%')
print('Precision:', precision_score(y_test, y_pred)*100,'%')
print('F1 Score:', f1_score(y_test, y_pred)*100,'%')

Accuracy: 77.61904761904762 %
Recall: 72.11538461538461 %
Precision: 80.64516129032258 %
F1 Score: 76.14213197969544 %


In [47]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print('Accuracy:',accuracy_score(y_test, y_pred)*100,'%')
print('Recall:', recall_score(y_test, y_pred)*100,'%')
print('Precision:', precision_score(y_test, y_pred)*100,'%')
print('F1 Score:', f1_score(y_test, y_pred)*100,'%')

Accuracy: 86.19047619047619 %
Recall: 88.46153846153845 %
Precision: 84.40366972477065 %
F1 Score: 86.3849765258216 %


In [48]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=2) 
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print('Accuracy:',accuracy_score(y_test, y_pred)*100,'%')
print('Recall:', recall_score(y_test, y_pred)*100,'%')
print('Precision:', precision_score(y_test, y_pred)*100,'%')
print('F1 Score:', f1_score(y_test, y_pred)*100,'%')

Accuracy: 86.19047619047619 %
Recall: 88.46153846153845 %
Precision: 84.40366972477065 %
F1 Score: 86.3849765258216 %


## Preparing the testing data

In [59]:
test_data = pd.read_csv('Titanic/test.csv')
columns_to_eliminate = ['Name','Ticket','Cabin']
test_data.drop(columns_to_eliminate, axis=1, inplace=True)
test_data['Sex'] = test_data['Sex'].apply(lambda x: 0 if x=='male' else 1)
test_data['Embarked'] = train_data['Embarked'].apply(lambda x: 0 if x == 'S' else 1 if x == 'C' else 2)
test_data['Embarked'].fillna('2', inplace=True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)

  test_data['Embarked'].fillna('2', inplace=True)


In [60]:
age_na_test = test_data[test_data['Age'].isna()]
age_na_test_passenger_id = age_na_test['PassengerId']
age_na_test = age_na_test[['Sex', 'Pclass', 'Fare' ,'Age']]
full_age_test = test_data.dropna()


X = age_na_test.drop('Age', axis=1)
age_na_test['Age'] = linear_regression_age.predict(X).astype(int)
age_na_test['PassengerId'] = age_na_test_passenger_id
prepared_test = pd.concat([age_na_test, full_age_test])
prepared_test['Age'] = prepared_test['Age'].astype(int)


prepared_test = prepared_test.set_index('PassengerId').join(
    test_data[['PassengerId', 'SibSp', 'Parch', 'Embarked']].set_index('PassengerId'), 
    lsuffix='_l', 
    rsuffix='', 
    on='PassengerId')


prepared_test.drop(['SibSp_l', 'Parch_l', 'Embarked_l'], axis = 1, inplace=True)


scaled_test = scaler.transform(prepared_test[clean_train.columns[1:]])


final_dataset = pd.DataFrame()
final_dataset['PassengerId'] = prepared_test.index
final_dataset['Survived'] = rf.predict(scaled_test)


final_dataset.sort_values(by=['PassengerId'], inplace=True)


final_dataset.to_csv('Titanic/results.csv', index=False)