## Welcome to my workbook for the Kaggle Titanic Competition using a neural newtork developed with Pytorch!

To begin, we will have to download the training and test datasets, which we will download programmatically with the Kaggle API. Note: you will have to have a valid kaggle API token (kaggle.json file in your '.kaggle' folder) to use the below code to download the datasets.

In [89]:
import torch
from torch import nn
import pandas as pd
import numpy as np
import sklearn
try:
    import kaggle
    print('kaggle imported!')
except:
    !pip install kaggle
    import kaggle
    print('kaggle installed and imported!')
import pandas as pd
import zipfile
kaggle.api.authenticate()
kaggle.api.competition_download_files('titanic', quiet = False)
with zipfile.ZipFile('titanic.zip', 'r') as zip_file:
    zip_file.extractall('extracted_files/')
train_file = pd.read_csv('extracted_files/train.csv')
test_file = pd.read_csv('extracted_files/test.csv')

kaggle imported!
titanic.zip: Skipping, found more recently modified local copy (use --force to force download)


Now that we have our data, we will begin preparing it. The EDA has been done in another Jupyter notebook on my Github called 'Kaggle_Titanic_Competition', so we will skip that component and apply the insights we found there to this dataset and prepare it accordingly; this will include dropping unhelpful features and imputing missing values on ones we intend to keep. We will also change the data type of Pclass to an object.

In [90]:
from sklearn.impute import SimpleImputer
train_target = train_file[['Survived']]
train_data = train_file
train_data = train_data.drop(['PassengerId', 'Survived', 'Name', 'Ticket', 'Cabin'], axis = 1)
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
e_imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
train_data[['Age']] = imputer.fit_transform(train_data[['Age']])
train_data[['Embarked']] = e_imputer.fit_transform(train_data[['Embarked']])
train_data['Pclass'] = train_data['Pclass'].astype('object')

In [91]:
test_data = test_file
test_data = test_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
e_imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
test_data[['Age']] = imputer.fit_transform(test_data[['Age']])
test_data[['Embarked']] = e_imputer.fit_transform(test_data[['Embarked']])
test_data[['Fare']] = imputer.fit_transform(test_data[['Fare']])
test_data['Pclass'] = test_data['Pclass'].astype('object')

Next, we will get the names of object and numerical columns so that we can split them and apply a OneHotEncoder and StandardScaler respectively before recombining them into a new DataFrame

In [92]:
num_columns = train_data.select_dtypes(include = np.number).columns.to_list()
obj_columns = train_data.select_dtypes(include = 'object').columns.to_list()
num_train_data = train_data[num_columns]
obj_train_data = train_data[obj_columns]
print(num_columns)
print(obj_columns)

['Age', 'SibSp', 'Parch', 'Fare']
['Pclass', 'Sex', 'Embarked']


In [93]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_train_data[['Age']] = scaler.fit_transform(num_train_data[['Age']])
num_train_data[['Fare']] = scaler.fit_transform(num_train_data[['Fare']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_train_data[['Age']] = scaler.fit_transform(num_train_data[['Age']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_train_data[['Fare']] = scaler.fit_transform(num_train_data[['Fare']])


In [94]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
obj_train_data
obj_train_data = encoder.fit_transform(obj_train_data)

In [95]:
names = encoder.get_feature_names_out(obj_columns)
obj_train_data = pd.DataFrame(obj_train_data, columns = names)
train_concat = pd.concat([num_train_data, obj_train_data], axis = 1)

With the prepared training data, we can split our data into training and test sets.

In [96]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_concat, train_file['Survived'], random_state = 42, test_size = 0.2)
X_train_t = torch.tensor(X_train.to_numpy()).to(torch.float32)
X_test_t = torch.tensor(X_test.to_numpy()).to(torch.float32)
y_train_t = torch.tensor(y_train.to_numpy()).to(torch.float32)
y_test_t = torch.tensor(y_test.to_numpy()).to(torch.float32)

Now that the training and test sest are prepared, we can start to develop our neural network. For this network, I used three linear layers with a ReLU layer seperating them; I tested the performance of the neural network empirically and found the below architecture below to provide the best results.

In [97]:
class TitanicRegressor(nn.Module):
  def __init__(self, in_features, out_features, hidden_units):
    super().__init__()
    self.layer_stack = nn.Sequential(
        nn.Linear(in_features = in_features, out_features = hidden_units),
        nn.ReLU(),
        nn.Linear(in_features = hidden_units, out_features = hidden_units),
        nn.ReLU(),
        nn.Linear(in_features = hidden_units, out_features = out_features)
    )
  def forward(self, x):
    return self.layer_stack(x)

Below we instantiate a model using the architecture we developed and dictate parameters for the network and the optimizer.

In [98]:
nn_model = TitanicRegressor(in_features= 12, out_features = 1, hidden_units = 32)
optimizer = torch.optim.Adam(params = nn_model.parameters(), lr = 0.001)
loss_fn = nn.BCEWithLogitsLoss()

Below, we perform the training and test loops for our neural network, which will tell us how effectively it is reducing error across its 

In [99]:
epochs = 200
for epoch in range(epochs):
  nn_model.train()
  optimizer.zero_grad()
  preds = nn_model(X_train_t).squeeze()
  loss = loss_fn(preds, y_train_t)
  loss.backward()
  optimizer.step()

  nn_model.eval()
  with torch.inference_mode():
    test_preds = nn_model(X_test_t).squeeze()
    test_loss = loss_fn(test_preds, y_test_t)
    if epoch % 10 == 0:
      print(f'train loss {loss} | test loss {test_loss}')

train loss 0.6690300107002258 | test loss 0.6732843518257141
train loss 0.6508817672729492 | test loss 0.6573548316955566
train loss 0.6295585632324219 | test loss 0.6378886699676514
train loss 0.6038028001785278 | test loss 0.6126531958580017
train loss 0.5719224810600281 | test loss 0.5800976753234863
train loss 0.5355101227760315 | test loss 0.5423466563224792
train loss 0.4994131326675415 | test loss 0.5036120414733887
train loss 0.4690190553665161 | test loss 0.4690033495426178
train loss 0.4473961889743805 | test loss 0.44419610500335693
train loss 0.43310123682022095 | test loss 0.43085962533950806
train loss 0.422170490026474 | test loss 0.4249804615974426
train loss 0.4134940207004547 | test loss 0.4234435558319092
train loss 0.40654322504997253 | test loss 0.4232333302497864
train loss 0.40030142664909363 | test loss 0.42333459854125977
train loss 0.3945460319519043 | test loss 0.42390334606170654
train loss 0.38949185609817505 | test loss 0.42578816413879395
train loss 0.385

After the training and testing loops which have optimized the weights and baises of the neural networks, we will convert the test logits into predictions by applying a sigmoid function followed by rounding them to 0 or 1 and finally displaying the accuracy (where the predictions and truth are equal).

In [100]:
predictions = torch.round(torch.sigmoid(test_preds)) == y_test_t
accuracy = predictions.sum()/torch.numel(predictions)
accuracy

tensor(0.8268)

We have trained and evaluated our model. Now we wil prepare the test data the same way we prepared our training data.

In [101]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
num_test_data = test_data[num_columns]
obj_test_data = test_data[obj_columns]
scaler = StandardScaler()
num_test_data[['Age']] = scaler.fit_transform(num_test_data[['Age']])
num_test_data[['Fare']] = scaler.fit_transform(num_test_data[['Fare']])
encoder = OneHotEncoder(sparse_output=False)
obj_test_data = encoder.fit_transform(obj_test_data)
names = encoder.get_feature_names_out(obj_columns)
obj_test_data = pd.DataFrame(obj_test_data, columns = names)
test_concat = pd.concat([num_test_data, obj_test_data], axis = 1)
test_concat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         418 non-null    float64
 1   SibSp       418 non-null    int64  
 2   Parch       418 non-null    int64  
 3   Fare        418 non-null    float64
 4   Pclass_1    418 non-null    float64
 5   Pclass_2    418 non-null    float64
 6   Pclass_3    418 non-null    float64
 7   Sex_female  418 non-null    float64
 8   Sex_male    418 non-null    float64
 9   Embarked_C  418 non-null    float64
 10  Embarked_Q  418 non-null    float64
 11  Embarked_S  418 non-null    float64
dtypes: float64(10), int64(2)
memory usage: 39.3 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_test_data[['Age']] = scaler.fit_transform(num_test_data[['Age']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_test_data[['Fare']] = scaler.fit_transform(num_test_data[['Fare']])


In [102]:
X_final = torch.tensor(test_concat.to_numpy()).to(torch.float32)
X_final.shape

torch.Size([418, 12])

Finally, we will make predictions on the blinded testing data. The line to create a csv file with the precictions is commented out. Thank you for reviewing my notebook; I am early in my journey and a lot to learn about Data Science, and this was my first attempt at creating a neural network from scratch on data I had cleaned and prepared myself, so I found the process highly educational. Ultimately, this neural network performed just a bit worse than the XGBoost Classifier I trained for the same purpose in my companion notebook on my GitHub, though i suspect with more data the neural network would eventually outperform the XGBoost Classifier.

In [103]:
import csv
nn_model.eval()
with torch.inference_mode():
    test_preds_final = nn_model(X_final).squeeze()
    final_predictions = torch.round(torch.sigmoid(test_preds_final))
print(final_predictions.shape)
final_predictions = final_predictions.numpy()
pred_df = pd.DataFrame({'PassengerId' : test_file['PassengerId'], 'Survived':final_predictions})
pred_df[['Survived']] = pred_df[['Survived']].astype('int64')
# pred_df.to_csv('akozikowski_titanic_preds_nn.csv', index = False)

torch.Size([418])
