# Spaceship Titanic with Lightning

This notebook contains the code for predicting which passengers were transported by the anomaly on the Spaceship Titanic. We utilize a neural network built using PyTorch and PyTorch Lightning for data analysis and model training on the training dataset. After training the model, we evaluate its performance on the validation dataset and generate predictions for the test dataset.

In [2]:
import torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

## Data Preparation

In [3]:
# Read data
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
sample_submission = pd.read_csv('../data/sample_submission.csv')

In [4]:
# Display the few rows of the train dataset
train.head(2)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True


In [5]:
# Generate descriptive statistics for the train dataset
train.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [6]:
# Display summary of the train dataset
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


## Data Cleaning and Preprocessing
In this block, data preprocessing is carried out, including handling missing values, creating new features, and encoding categorical features.

In [7]:
# Preprocessing
train_labels = train.pop('Transported')
train.drop(['PassengerId', 'Name', 'Destination'], axis=1, inplace=True)
test.drop(['PassengerId', 'Name', 'Destination'], axis=1, inplace=True)

In [8]:
# Extract cabin parts
for df in [train, test]:
    cabin_parts = df['Cabin'].str.split('/', expand=True)
    df['Deck'] = cabin_parts[0]
    df['Num'] = cabin_parts[1].astype(float)
    df['Side'] = cabin_parts[2]
    df.drop(['Cabin'], axis=1, inplace=True)

In [9]:
# Fill missing values
for df in [train, test]:
    mode_values = df[['HomePlanet', 'CryoSleep', 'VIP', 'Side', 'Deck']].mode().iloc[0]
    df.fillna(mode_values, inplace=True)
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df.fillna(0, inplace=True)

In [10]:
# Feature engineering
for df in [train, test]:
    df['MoneySpent'] = df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
    df.insert(3, 'AgeCategories', pd.cut(df['Age'], bins=[0, 15, 25, 66, float('inf')], labels=['first', 'second', 'third', 'fourth']))

In [11]:
# One-hot encoding
columns_to_encode = ['Deck', 'Side', 'HomePlanet', 'AgeCategories']
train = pd.get_dummies(train, columns=columns_to_encode)
test = pd.get_dummies(test, columns=columns_to_encode)

## Data Preparation for Training 

This block prepares the data for training by scaling features, splitting the data into training and validation sets, and setting up PyTorch Dataset and DataLoader.

In [12]:
# Label encoding for target variable
label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_labels)

# Scaling
scaler = StandardScaler()
float_columns = train.select_dtypes(include=['float64']).columns
train[float_columns] = scaler.fit_transform(train[float_columns])
test[float_columns] = scaler.transform(test[float_columns])

# Train-validation split
x_train, x_validation, y_train, y_validation = train_test_split(train, train_labels, test_size=0.2, random_state=19)


In [13]:
# Display the few rows of the x_train dataset
x_train.head(2)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Num,MoneySpent,...,Deck_T,Side_P,Side_S,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,AgeCategories_first,AgeCategories_second,AgeCategories_third,AgeCategories_fourth
358,False,0.151488,True,-0.333105,6.614385,0.63954,6.976834,4.752599,-1.10854,8.548028,...,False,True,False,False,True,False,False,False,True,False
4850,False,-0.545948,False,0.548071,-0.281027,-0.283579,-0.139129,-0.216269,0.481416,-0.23471,...,False,True,False,True,False,False,False,True,False,False


In [14]:
# Display y_train dataset
y_train

array([0, 0, 1, ..., 0, 0, 1])

In [15]:
class TitanicDataset(Dataset):
    def __init__(self, x, y):
        # Convert input features and labels to PyTorch tensors
        self.data = torch.from_numpy(np.array(x, dtype=np.float32))
        self.labels = torch.from_numpy(np.array(y, dtype=np.int64))
        
    def __len__(self):
        # Return the length of the dataset
        return len(self.data)
    
    def __getitem__(self, index):
        # Retrieve data and corresponding label at the specified index
        return self.data[index], self.labels[index]


## Model Definition 

This block contains the definition of the neural network architecture, choice of loss function, and optimizer.

In [16]:
class TitanicModel(pl.LightningModule):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        # Define the neural network architecture
        self.model = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, 1)
        )
        # Define the loss function
        self.loss_fn = nn.BCEWithLogitsLoss()

    def forward(self, x):
        # Forward pass through the network
        return self.model(x)

    def training_step(self, batch, batch_idx, logger=None):
        # Training step
        x, y = batch
        output = self(x)
        # Calculate loss
        loss = self.loss_fn(output, y.float().unsqueeze(1))
        preds = torch.round(torch.sigmoid(output))
        # Calculate accuracy
        acc = (preds == y.unsqueeze(1)).sum().float() / y.size(0)
        # Log metrics
        self.log('train_loss', loss, on_step=True, on_epoch=True, logger=logger)
        self.log('train_acc', acc, on_step=True, on_epoch=True, logger=logger)
        return loss

    def validation_step(self, batch, batch_idx, logger=None):
        # Validation step
        x, y = batch
        output = self(x)
        # Calculate loss
        loss = self.loss_fn(output, y.float().unsqueeze(1))
        preds = torch.round(torch.sigmoid(output))
        # Calculate accuracy
        acc = (preds == y.unsqueeze(1)).sum().float() / y.size(0)
        # Log metrics
        self.log('val_loss', loss, on_step=True, on_epoch=True, logger=logger)
        self.log('val_acc', acc, on_step=True, on_epoch=True, logger=logger)

    def configure_optimizers(self):
        # Define optimizer and learning rate scheduler
        optimizer = torch.optim.Adam(self.parameters(), lr=0.005)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, verbose=True)
        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'monitor': 'val_loss'
            }
        }

# Initialize model and dataloaders
input_size = x_train.shape[1]
hidden_size = input_size * 2
model = TitanicModel(input_size, hidden_size)
train_dataloader = DataLoader(TitanicDataset(x_train, y_train), batch_size=128, shuffle=True)
val_dataloader = DataLoader(TitanicDataset(x_validation, y_validation), batch_size=64)

## Model Training 
Here, the model is trained using PyTorch Lightning Trainer.

In [17]:
# Initialize trainer and start training
trainer = pl.Trainer(max_epochs=100, callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=20, mode='min')])
trainer.fit(model, train_dataloader, val_dataloader)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name    | Type              | Params
----------------------------------------------
0 | model   | Sequential        | 4.5 K 
1 | loss_fn | BCEWithLogitsLoss | 0     
----------------------------------------------
4.5 K     Trainable params
0         Non-trainable params
4.5 K     Total params
0.018     Total estimated model params size (MB)


Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:00<00:00,  8.80it/s]

Epoch 36: 100%|██████████| 55/55 [00:00<00:00, 117.89it/s, v_num=109]      


## Model Evaluation 

This section prints out the final metrics of the model on the training and validation sets.

In [18]:
# Getting the final validation loss
final_val_loss = trainer.callback_metrics['val_loss']

# Getting the final validation accuracy
final_val_acc = trainer.callback_metrics['val_acc']

# Getting the final training loss
final_train_loss = trainer.callback_metrics['train_loss']

# Getting the final training accuracy
final_train_acc = trainer.callback_metrics['train_acc']

# Printing the final metrics
print("Final Metrics:")
print(f"Training Loss: {final_train_loss:.4f}")
print(f"Training Accuracy: {final_train_acc:.4f}")
print(f"Validation Loss: {final_val_loss:.4f}")
print(f"Validation Accuracy: {final_val_acc:.4f}")


Final Metrics:
Training Loss: 0.3568
Training Accuracy: 0.8237
Validation Loss: 0.3941
Validation Accuracy: 0.8143


## Generating Predictions for Test Set

In [19]:
# Convert test data to tensor and make predictions
tensor_data = torch.tensor(np.array(test, dtype=np.float32), device='cpu')
predictions = torch.sigmoid(model(tensor_data)).round()

# Convert predictions to boolean values and create submission DataFrame
boolean_predictions = predictions.view(-1).bool().cpu().numpy()
submission_df = pd.DataFrame({
    "PassengerId": sample_submission["PassengerId"],
    "Transported": boolean_predictions
})

# Save submission DataFrame to a CSV file
submission_df.to_csv("submission.csv", index=False)