# Titanic - Machine Learning from Disaster (Kaggle)

This was my first attempt at the [Titanic Challenge](https://www.kaggle.com/competitions/titanic/overview) from Kaggle. I developed three models that I submmitted separately.

1. A random forest classifier trained on information about sex, age, class and (normalized) fare. It got a score of 0.7799.
2. A support vector machine classifier trained on class, sex, age, number of siblings+spouse, number of children/parents, total number of relatives, whether they had a cabin or not, (normalized) fare and place of embarcation. It also got a score of 0.7799.
3. A MLP trained on the same data as the SVM. It got a score of 0.7703.

This notebook contains the code for these three models. They were originally coded separately, so the name of some variables coincide.

At the end, there is a an ensemble of the three models that gets an score of 0.78708 in the challenge.

In [1]:
# Imports

import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

import matplotlib.pyplot as plt

## Random forest

#### Preprocessing

In [2]:
# Loads the dataset with a varible name to use only with this classifier
df_rf = pd.read_csv('train.csv')

In [3]:
# Calculates the mean for the age column and completes nan values
mean_age = df_rf['Age'].mean()
df_rf['Age'].fillna(mean_age, inplace=True)

In [4]:
# This creates a column with the total amount of relatives for each passenger
df_rf['Relatives'] = df_rf['SibSp'] + df_rf['Parch']

In [5]:
# This creates a column with a "normalized" fare: its the price of the ticket divided by the number of relatives on board
df_rf['Norm_fare'] = df_rf.apply(lambda row: row['Fare'] / row['Relatives'] if row['Relatives'] != 0 else row['Fare'], axis=1)

In [6]:
# This encodes sex values as ones and zeros
le = LabelEncoder()
df_rf['Sex'] = le.fit_transform(df_rf['Sex'])

In [7]:
# Now we split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_rf[['Pclass', 'Sex', 'Age', 'Norm_fare']], df_rf['Survived'], test_size=0.2, random_state=42)

#### Training the model

In [8]:
# This instantiates the random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=50,  max_depth=7, random_state=42)

In [9]:
# This trains the model
rf_classifier.fit(X_train, y_train)

In [10]:
# This gets the accuracy for the training data
rf_x_train = rf_classifier.predict(X_train)
rf_accuracy_train = accuracy_score(y_train, rf_x_train)

rf_accuracy_train

0.8862359550561798

In [11]:
# This gets the accuracy for the test data
rf_x_test = rf_classifier.predict(X_test)
rf_accuracy_test = accuracy_score(y_test, rf_x_test)

rf_accuracy_test

0.8212290502793296

#### Getting predictions

In [12]:
# This loads the dataset
df_rf_pred = pd.read_csv('test.csv')

In [13]:
# This fills the nan values in the age column with the mean
df_rf_pred['Age'].fillna(mean_age, inplace=True)

In [14]:
# This fills the nan values in the fare column with zeros
df_rf_pred['Fare'].fillna(0, inplace=True)

In [15]:
# This creates the relatives column
df_rf_pred['Relatives'] = df_rf_pred['SibSp'] + df_rf_pred['Parch']

In [16]:
# This creates the norm_fare column
df_rf_pred['Norm_fare'] = df_rf_pred.apply(lambda row: row['Fare'] / row['Relatives'] if row['Relatives'] != 0 else row['Fare'], axis=1)

In [17]:
# This changes the sex column into ones and zeros
df_rf_pred['Sex'] = le.fit_transform(df_rf_pred['Sex'])

In [18]:
#This calculates the predictions and their probabilities
rf_predictions = rf_classifier.predict(df_rf_pred[['Pclass', 'Sex', 'Age', 'Norm_fare']])
rf_probab = rf_classifier.predict_proba(df_rf_pred[['Pclass', 'Sex', 'Age', 'Norm_fare']])

In [19]:
# This creates a df with the passengerid and its predicted fate
titanic_rf = pd.DataFrame()
titanic_rf['PassengerId'] = df_rf_pred['PassengerId']
titanic_rf['Survived'] = rf_predictions

In [20]:
# This saves the resulting df as a csv file
titanic_rf.to_csv('titanic_rf.csv', index=False)

## Support vector machine

#### Preporcessing

In [21]:
# Since the feature architecture for this model is slightly different, I will reload the dataset as df_x
df_x = pd.read_csv('train.csv')

In [22]:
# This changes the cabin into zeros and ones depending on whether the passenger has a cabin
df_x['Cabin'] = df_x['Cabin'].apply(lambda x: 1 if not pd.isna(x) else x)
df_x['Cabin'].fillna(0, inplace=True)

In [23]:
# This changes to zeros and ones the Sex and Embarked columns
df_x['Sex'] = le.fit_transform(df_x['Sex'])
df_x['Embarked'] = le.fit_transform(df_x['Embarked'])

In [24]:
# This calculates the number of relatives on board
df_x['Relatives'] = df_x['SibSp'] + df_x['Parch']

In [25]:
# This gets the normalized fare by the number of relatives on board
df_x['Norm_fare'] = df_x.apply(lambda row: row['Fare'] / row['Relatives'] if row['Relatives'] != 0 else row['Fare'], axis=1)

In [26]:
# This fills nans in the Age column with the median age (not the mean as before)
median_age = df_x['Age'].median()
df_x['Age'].fillna(median_age, inplace=True)

In [27]:
# Now we split the dataset
X_train, X_test, y_train, y_test = train_test_split(df_x[['Pclass',
                                                          'Sex',
                                                          'Age',
                                                          'SibSp',
                                                          'Parch',
                                                          'Relatives',
                                                          'Cabin',
                                                          'Norm_fare',
                                                          'Embarked']],
                                                    df_x['Survived'],
                                                    test_size=0.2,
                                                    random_state=42)

In [28]:
# I'll select the same features for the predictions dataset
df_y = pd.read_csv('test.csv')

# This introduces a dummy fare value for nans
df_y['Fare'].fillna(0, inplace=True)

# This changes the cabin into zeros and ones depending on whether the passenger has a cabin
df_y['Cabin'] = df_y['Cabin'].apply(lambda x: 1 if not pd.isna(x) else x)
df_y['Cabin'].fillna(0, inplace=True)

# This changes to zeros and ones the Sex and Embarked columns
df_y['Sex'] = le.fit_transform(df_y['Sex'])
df_y['Embarked'] = le.fit_transform(df_y['Embarked'])

# This calculates the number of relatives on board
df_y['Relatives'] = df_y['SibSp'] + df_y['Parch']

# This gets the normalized fare by the number of relatives on board
df_y['Norm_fare'] = df_y.apply(lambda row: row['Fare'] / row['Relatives'] if row['Relatives'] != 0 else row['Fare'], axis=1)

# This fills nans in the Age column with the median age (not the mean as before)
df_y['Age'].fillna(median_age, inplace=True)

In [29]:
# This combines both dfs to fit the scaler
both_dfs = pd.concat([df_x, df_y], ignore_index=True)

In [30]:
# This standarizes the features
scaler = StandardScaler()
scaler.fit(both_dfs[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Relatives', 'Cabin', 'Norm_fare', 'Embarked']])
X_train_st = scaler.transform(X_train)
X_test_st = scaler.transform(X_test)

#### Training the model

In [31]:
# This creates an instance of a SVC
svm_model = SVC(kernel='rbf', C=1, probability=True)


In [32]:
svm_model.fit(X_train_st, y_train)

In [33]:
# This evaluates accuracy on the training set
svm_x_train = svm_model.predict(X_train_st)
svm_x_train_acc = accuracy_score(y_train, svm_x_train)

svm_x_train_acc

0.8398876404494382

In [34]:
# This evaluates accuracy on the test set
svm_x_test = svm_model.predict(X_test_st)
svm_x_test_acc = accuracy_score(y_test, svm_x_test)

svm_x_test_acc

0.8156424581005587

#### Get predictions

In [35]:
df_y_st = scaler.transform(df_y[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Relatives', 'Cabin', 'Norm_fare', 'Embarked']])

In [36]:
# This gets the predictions and the probabilities
svm_predictions = svm_model.predict(df_y_st)
svm_probab = svm_model.predict_proba(df_y_st)

In [37]:
# This creates a df with the passengerid and its predicted fate
titanic_svm = pd.DataFrame()
titanic_svm['PassengerId'] = df_y['PassengerId']
titanic_svm['Survived'] = svm_predictions


In [38]:
titanic_svm.to_csv('titanic_svm.csv', index=False)

## Neural network (multilayer perceptron)

This is basically the training of the neural network. All data is shared with the svm model.

In [39]:
# This instantiates a sequential model in Keras
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=X_train_st.shape[1]))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))


In [40]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                640       
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 2753 (10.75 KB)
Trainable params: 2753 (10.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [41]:
# This compiles the model with accuracy metrics
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [42]:
# Fitting the model
model.fit(X_train_st, y_train, epochs=10, validation_split=0.1, verbose=2)

Epoch 1/10
20/20 - 3s - loss: 0.6226 - accuracy: 0.6938 - val_loss: 0.5214 - val_accuracy: 0.8194 - 3s/epoch - 126ms/step
Epoch 2/10
20/20 - 0s - loss: 0.5306 - accuracy: 0.7734 - val_loss: 0.4243 - val_accuracy: 0.8333 - 177ms/epoch - 9ms/step
Epoch 3/10
20/20 - 0s - loss: 0.4848 - accuracy: 0.7953 - val_loss: 0.3684 - val_accuracy: 0.8472 - 254ms/epoch - 13ms/step
Epoch 4/10
20/20 - 0s - loss: 0.4616 - accuracy: 0.7969 - val_loss: 0.3358 - val_accuracy: 0.8472 - 180ms/epoch - 9ms/step
Epoch 5/10
20/20 - 0s - loss: 0.4475 - accuracy: 0.8062 - val_loss: 0.3230 - val_accuracy: 0.8611 - 130ms/epoch - 6ms/step
Epoch 6/10
20/20 - 0s - loss: 0.4381 - accuracy: 0.8141 - val_loss: 0.3155 - val_accuracy: 0.8750 - 358ms/epoch - 18ms/step
Epoch 7/10
20/20 - 0s - loss: 0.4290 - accuracy: 0.8234 - val_loss: 0.3082 - val_accuracy: 0.8889 - 161ms/epoch - 8ms/step
Epoch 8/10
20/20 - 0s - loss: 0.4236 - accuracy: 0.8219 - val_loss: 0.3006 - val_accuracy: 0.9028 - 228ms/epoch - 11ms/step
Epoch 9/10
20/

<keras.src.callbacks.History at 0x7c74581818a0>

In [43]:
# Accuracy for the test set
mlp_test_x = model.predict(X_test_st)
mlp_test_x = (mlp_test_x > 0.5).astype(int)
mlp_acc = accuracy_score(y_test, mlp_test_x)

mlp_acc



0.8268156424581006

In [44]:
# This gets the probabilities for the relevant data
mlp_probab = model.predict(df_y_st)
mlp_predictions = (mlp_probab > 0.5).astype(int)



In [45]:
# This creates a df with the required data
titanic_mlp = pd.DataFrame()
titanic_mlp['PassengerId'] = df_y['PassengerId']
titanic_mlp['Survived'] = mlp_predictions

In [46]:
# This saves the df to a csv file
titanic_mlp.to_csv('titanic_mlp.csv', index=False)

## Ensembling the results of the three models

In [47]:
rf_prob_reshaped = rf_probab[:, 1].reshape(-1, 1)

In [48]:
svm_prob_reshaped = svm_probab[:, 1].reshape(-1, 1)

In [49]:
combined_props = 1/3 * rf_prob_reshaped + 1/3 * svm_prob_reshaped + 1/3 * mlp_probab

In [50]:
combined_predictions = (combined_props > 0.5).astype(int)

In [51]:
# This creates a df with the required data
titanic_ensemble = pd.DataFrame()
titanic_ensemble['PassengerId'] = df_y['PassengerId']
titanic_ensemble['Survived'] = combined_predictions

In [52]:
titanic_ensemble.to_csv('titanic_ensemble.csv', index=False)