<a href="https://colab.research.google.com/github/ZiadAlrzooq/TitanicSurvivalPredicitonPyTorch/blob/main/Titanic_Survival_Prediciton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is my first ML project using PyTorch. I'm aiming to predict if a passenger survived the sinking of the Titanic or not (Binary classification) based on factors such as age, gender, ticket class, and more.
The dataset used is provided from kaggle's Titanic prediction competition using the following link: https://www.kaggle.com/c/titanic

In [1]:
# Download dataset from kaggle
!pip install -q kaggle
from google.colab import files
files.upload() # upload kaggle API Token (kaggle.json)

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ziyadalrzouq","key":"cc97b4028c7e60a53cd5b56fb0a0da7b"}'}

In [2]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list # Test if its working

ref                                                     title                                       size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------  -----------------------------------------  -----  -------------------  -------------  ---------  ---------------  
thedrcat/daigt-v2-train-dataset                         DAIGT V2 Train Dataset                      29MB  2023-11-16 01:38:36           3160        259  1.0              
muhammadbinimran/housing-price-prediction-data          Housing Price Prediction Data              763KB  2023-11-21 17:56:32          13212        231  1.0              
thedrcat/daigt-external-train-dataset                   DAIGT External Train Dataset               435MB  2023-11-06 17:10:37            490         55  1.0              
thedrcat/daigt-proper-train-dataset                     DAIGT Proper Train Dataset                 119MB  2023-11-05 14:03:25           2429     

In [3]:
# import the Titanic Prediciton dataset
!kaggle competitions download -c titanic

Downloading titanic.zip to /content
  0% 0.00/34.1k [00:00<?, ?B/s]
100% 34.1k/34.1k [00:00<00:00, 1.23MB/s]


In [4]:
# unzip the dataset
!unzip titanic.zip

Archive:  titanic.zip
  inflating: gender_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [5]:
import torch
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

device = 'cuda' if torch.cuda.is_available() else 'cpu' # device agnostic code
device

'cuda'

In [6]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
# Pop the labels from the training data and save it
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Data Preprocessing:
The following code cells handle preprocessing of the data such as dealing with NaN values, encoding categorical data into numerical values, etc...

In [7]:
# Check columns for NaN values(just to give us an idea which columns need to be handled)
columns_with_nan = df_train.columns[df_train.isnull().any()].tolist()
print(columns_with_nan)

['Age', 'Cabin', 'Embarked']


In [8]:
# Drop irrelevant features
df_train.drop(columns=['PassengerId', 'Name','Ticket','Cabin'], axis=1, inplace=True)
df_test.drop(columns=['PassengerId', 'Name','Ticket','Cabin'], axis=1, inplace=True)
# Note that name can be a relevant feature(E.g. extracting honorifics or titles such as Mr, Mrs, miss etc...) but it's been dropped for convenience
df_train

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S
887,1,1,female,19.0,0,0,30.0000,S
888,0,3,female,,1,2,23.4500,S
889,1,1,male,26.0,0,0,30.0000,C


In [9]:
# Handle NaN values in numeric columns(such as Age) using the mean
df_train.fillna(df_train.mean(), inplace=True)
df_test.fillna(df_test.mean(), inplace=True)
df_train

  df_train.fillna(df_train.mean(), inplace=True)
  df_test.fillna(df_test.mean(), inplace=True)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.000000,1,0,7.2500,S
1,1,1,female,38.000000,1,0,71.2833,C
2,1,3,female,26.000000,0,0,7.9250,S
3,1,1,female,35.000000,1,0,53.1000,S
4,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S
887,1,1,female,19.000000,0,0,30.0000,S
888,0,3,female,29.699118,1,2,23.4500,S
889,1,1,male,26.000000,0,0,30.0000,C


In [10]:
# Handle NaN values in categorical columns(such as Embarked) using the mode(most frequent value)
df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace=True)
df_test['Embarked'].fillna(df_test['Embarked'].mode()[0], inplace=True)
df_train

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.000000,1,0,7.2500,S
1,1,1,female,38.000000,1,0,71.2833,C
2,1,3,female,26.000000,0,0,7.9250,S
3,1,1,female,35.000000,1,0,53.1000,S
4,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S
887,1,1,female,19.000000,0,0,30.0000,S
888,0,3,female,29.699118,1,2,23.4500,S
889,1,1,male,26.000000,0,0,30.0000,C


In [11]:
# One hot encode categorical columns such as Sex and Embarked using pd.get_dummies()
sex = pd.get_dummies(df_train['Sex'], drop_first=True)
embarked = pd.get_dummies(df_train['Embarked'], drop_first=True)
df_train = pd.concat([df_train, sex, embarked], axis=1)

# Drop pre-encode columns
df_train.drop(columns=['Embarked', 'Sex'], inplace=True)

# Do the same with testing data
sex = pd.get_dummies(df_test['Sex'], drop_first=True)
embarked = pd.get_dummies(df_test['Embarked'], drop_first=True)
df_test = pd.concat([df_test, sex, embarked], axis=1)

df_test.drop(columns=['Embarked', 'Sex'], inplace=True)
df_train


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
0,0,3,22.000000,1,0,7.2500,1,0,1
1,1,1,38.000000,1,0,71.2833,0,0,0
2,1,3,26.000000,0,0,7.9250,0,0,1
3,1,1,35.000000,1,0,53.1000,0,0,1
4,0,3,35.000000,0,0,8.0500,1,0,1
...,...,...,...,...,...,...,...,...,...
886,0,2,27.000000,0,0,13.0000,1,0,1
887,1,1,19.000000,0,0,30.0000,0,0,1
888,0,3,29.699118,1,2,23.4500,0,0,1
889,1,1,26.000000,0,0,30.0000,1,0,0


In [12]:
from sklearn.model_selection import StratifiedShuffleSplit
# StratisfiedShuffle the df_train data into training and testing sets using important features
# such as "Survived", "Sex", and "Pclass"
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
for train_indices, test_indices in sss.split(df_train, df_train[['Survived', 'Pclass', "male"]]):
  df_train_set = df_train.loc[train_indices]
  df_test_set = df_train.loc[test_indices]

# Pop the labels from the training & testing sets and save it
y_train = df_train_set.pop('Survived')
y_test = df_test_set.pop('Survived')

In [13]:
# Turn the dataframes into tensors
X_train = torch.from_numpy(df_train_set.values).float().to(device)
X_test = torch.from_numpy(df_test_set.values).float().to(device)
y_train = torch.from_numpy(y_train.values).long().to(device)
y_test = torch.from_numpy(y_test.values).long().to(device)
kaggle_test = torch.from_numpy(df_test.values).float().to(device) # for kaggle submission

In [14]:
X_train.size(), y_train.size(), y_train.dtype, X_train[0]

(torch.Size([801, 8]),
 torch.Size([801]),
 torch.int64,
 tensor([ 2., 33.,  0.,  2., 26.,  0.,  0.,  1.], device='cuda:0'))

# Training The Model:
With the data now processed we can start training a neural network

In [15]:
from torch.utils.data import Dataset, DataLoader

# Create a CustomDataset class that inherits from Dataset; to load our tensors in DataLoaders
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create two instances of custom dataset. One for training and one for testing
train_dataset = CustomDataset(X_train, y_train)
test_dataset = CustomDataset(X_test, y_test)

# Create DataLoaders for both training and testing sets
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [16]:
len(train_loader), len(test_loader)

(26, 3)

In [26]:
class TitanicModel(nn.Module):
  def __init__(self, input_features, output_features, hidden_units=8):
    super().__init__()
    self.linear_layer_stack = nn.Sequential(
        nn.Linear(in_features=input_features, out_features=hidden_units),
        nn.ReLU(),
        nn.Linear(in_features=hidden_units, out_features=output_features)
    )

  def forward(self, x):
    return self.linear_layer_stack(x)

In [27]:
# Set up hyperparamters
NUM_CLASSES = 2
NUM_FEATURES = 8
RANDOM_SEED = 42
LR = 0.001
EPOCHS = 100
BATCH_SIZE = 32

torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed(RANDOM_SEED)
model = TitanicModel(input_features=NUM_FEATURES, output_features=NUM_CLASSES, hidden_units=8).to(device)

# Choose a loss function and an optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model.parameters(), lr=LR)

# Adapted training loop from PyTorch for Deep Learning & Machine Learning – Full Course Video by mrdbourke
# Link: https://youtu.be/V_xro1bcAuA?si=wy4LyaJj5gJVBksl&t=58081
for epoch in range(EPOCHS):
  train_loss = 0
  for X, y in train_loader:
    model.train()
    # Forward pass
    y_pred = model(X)

    # Calculate the loss
    loss = loss_fn(y_pred, y)
    train_loss += loss

    # Perform backpropagation
    loss.backward()

    # Optimizer step
    optimizer.step()

    # Reset optimizer
    optimizer.zero_grad()

  train_loss /= len(train_loader) # Get the avg train loss per batch

  ### Testing
  test_loss, test_acc = 0, 0
  model.eval()
  with torch.inference_mode():
    for X, y in test_loader:
      test_pred = model(X)
      test_loss += loss_fn(test_pred, y)
      test_acc += (test_pred.argmax(dim=1) == y).float().mean()
    # Calculate average test loss and test acc
    test_loss /= len(test_loader)
    test_acc /= len(test_loader)
  # Print progress
  if epoch % 10 == 0:
    print(f"Epoch: {epoch} | Train Loss: {train_loss:.5f} | Test Loss: {test_loss:.5f} | Test Acc {test_acc*100:.2f}%")

Epoch: 0 | Train Loss: 0.73792 | Test Loss: 0.64687 | Test Acc 69.47%
Epoch: 10 | Train Loss: 0.62735 | Test Loss: 0.54149 | Test Acc 72.52%
Epoch: 20 | Train Loss: 0.58134 | Test Loss: 0.50423 | Test Acc 81.33%
Epoch: 30 | Train Loss: 0.53808 | Test Loss: 0.48849 | Test Acc 79.25%
Epoch: 40 | Train Loss: 0.50140 | Test Loss: 0.45753 | Test Acc 83.41%
Epoch: 50 | Train Loss: 0.48734 | Test Loss: 0.44652 | Test Acc 81.89%
Epoch: 60 | Train Loss: 0.51390 | Test Loss: 0.43251 | Test Acc 82.13%
Epoch: 70 | Train Loss: 0.46401 | Test Loss: 0.43994 | Test Acc 81.57%
Epoch: 80 | Train Loss: 0.44305 | Test Loss: 0.42954 | Test Acc 81.57%
Epoch: 90 | Train Loss: 0.43707 | Test Loss: 0.41668 | Test Acc 83.65%


# Submission to kaggle:
The following code cells utilize the previously designed model to make predictions using the data from test.csv. It then saves the predictions, along with the passengers' IDs, into a new CSV file named submission.csv.


In [22]:
pred = model(kaggle_test).argmax(dim=1)
sub = pd.read_csv('test.csv').pop('PassengerId')
pred = pred.cpu()

In [23]:
sur = pd.DataFrame({'Survived': pred})
sub = pd.concat([sub, sur], axis=1)
sub

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [24]:
sub.to_csv("submission.csv", index=False)