In [None]:
import numpy as np
import pandas as pd
import tqdm
import time
import matplotlib.pyplot as plt


# 0: The Data

For Basic classification we just use Kaggle's Spaceship Titanic dataset. It's a simple dataset with a few features and a binary label.
Download from: https://www.kaggle.com/competitions/spaceship-titanic/data

### Explore the data briefly, see summary, histograms, ranges, correlations, etc.

The goal here is to get a feel for the data, and to see if there are any obvious issues with it.
Also we prepare the data for learning by doing some basic preprocessing like assigning numerical labels to categorical columns, removing NaNs, imputing values if needed, and normalizing ranges.

In [None]:
train_df = pd.read_csv('../data/spaceship-titanic/train.csv')

From a common sense POV, passengerID shouldn't affect the passenger's survival, unless it's a proxy for some feature that isn't in the dataset. So we won't drop it just yet.

Our 0 / 1 classification label here is Transported. Our objective is to predict whether a passenger was transported or not given the other features.

In [None]:
# convert categorical data to numerical integer codes
corr_df = train_df.copy()
for col in corr_df.columns:
    if corr_df[col].dtype == 'object' or 'bool':
        corr_df[col] = corr_df[col].astype('category').cat.codes
    else:
        # normalize data by column to -1 to 1 per column
        corr_df[col] = (corr_df[col] - corr_df[col].mean()) / corr_df[col].std()

CryoSleep seems to be decently correlated with Transported, so we should expect it to be a major feature in our model.
RoomService seems to be negatively correlated with Transported, so we should expect it to be a major feature in our model.

Anyway, we're not hand engineering stuff here. The goal is to try out some basic classic ML approaches for classification and see how well they work. Less goo.


In [None]:
# train / val split 80:20 randomly
df = corr_df.copy()

# Set a seed for reproducibility
np.random.seed(42)

# Generate an array of random indices for shuffling
indices = np.arange(len(df))
np.random.shuffle(indices)

# Calculate the split index
split_index = int(0.8 * len(df))

# Split the DataFrame
train_df_split = df.iloc[indices[:split_index]]
val_df = df.iloc[indices[split_index:]]

# Reset the index in the resulting DataFrames
train_df_split.reset_index(drop=True, inplace=True)
val_df.reset_index(drop=True, inplace=True)

# Get train df splits labels and val df's labels
train_labels = train_df_split['Transported']
val_labels = val_df['Transported']
train_df_split=train_df_split.drop(columns=['Transported'])
val_df=val_df.drop(columns=['Transported'])

# Print the shapes of the resulting DataFrames
print("Train set shape:", train_df_split.shape)
print("Validation set shape:", val_df.shape)
print("train_labels", train_labels.shape)
print("val_labels", val_labels.shape)

# Let's go to the Zoo!

# 1: Feed Forward Neural Network

Best reference to start: https://github.com/karpathy/nn-zero-to-hero
Ofc also fast ai: https://course.fast.ai/Lessons/lesson3.html
Andrew Ng's deep learning specialization is also a good resource and recommmended for a more thorough understanding of the topic.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

seed = 1337
torch.manual_seed(seed)  # Set the seed for reproducibility

class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.softmax(x)
        return x

input_size = train_df_split.shape[1]
hidden_size = 60
num_classes = 2

model = NeuralNet(input_size, hidden_size, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.005)

# Convert train_df_split and train_labels to tensors
train_df_split_tensor = torch.tensor(train_df_split.values, dtype=torch.float32)
train_labels_tensor = torch.tensor(train_labels.values, dtype=torch.long)

# Set the model to training mode
model.train()

# Iterate over the data for a specified number of epochs
num_epochs = 800
losses = []
for epoch in tqdm.tqdm(range(num_epochs), total=num_epochs, desc="Epochs"):
    # Forward pass
    outputs = model(train_df_split_tensor)
    loss = criterion(outputs, train_labels_tensor)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print the loss for every 10th epoch
    if epoch % 50 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")
    losses.append(loss.item())

    # Divide the learning rate by 2 if epoch == 400
    if epoch%100 == 0:
        for param_group in optimizer.param_groups:
            param_group['lr'] /= 1.5
    

In [None]:
# Plot losses vs epoch
plt.plot(range(1, num_epochs+1), losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Losses vs Epoch')
plt.show()


In [None]:
# Run validation on val_df
val_df_tensor = torch.tensor(val_df.values, dtype=torch.float32)

predictions = model.forward(val_df_tensor)
predicted_classes = torch.argmax(predictions, dim=1)


In [None]:
predictions = predicted_classes.numpy()


### Testing Accuracy:
We just subtract predictions from the labels, take the absolute value and calculate the mean. Subtracting this mean from 1 should give us a 0-1 accuracy metric. i.e 1 implies 100%, 0 implies 0%.

In [None]:
accuracy = 1-abs(predictions - val_labels).mean()

print("accuracy", accuracy)


Basic NN with 1 hidden layer does as well as an SVM, and with some hyperparam tuning can get upto 0.75!

### Confusion Matrix

The confusion matrix is a visualisation of how many true positives, false positives, true negatives and false negatives our model predicts.
It is useful for getting a sense of how wrong our model is, and where it might be going wrong.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# confusion matrix

# Calculate the confusion matrix
def plot_confusion_matrix(pred, labels):
    conf_matrix = np.zeros((2, 2))
    for i in range(len(labels)):
        conf_matrix[labels[i], pred[i]] += 1

    # Plot the confusion matrix using seaborn
    sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1'])
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title('Confusion Matrix Manhattan')
    plt.show()

In [None]:
plot_confusion_matrix(predictions, val_labels)

We can do deeper NNs, but let's stop here for the titanic problem. larger networks need more data to git gud.

# Time for some more interesting data:

For image classification we'll use the Fashion MNIST dataset as a nod to Try-it-on ;)
Download from: https://www.kaggle.com/datasets/zalando-research/fashionmnist

# 2: Deep NNs: Let's go Deep

# 3: CNNs: Let's give NNs Sight

# 4: RNNS: Let's let NNs remember

# 4a: LSTMS

# 4b: GRUs