# Spaceship Titanic - TensorFlow Model

## Overview

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, 
the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets 
orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic
collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000
years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the 
Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal
records recovered from the ship's damaged computer system.

## File and Data Field Descriptions

### train.csv 

Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

Sure! Here's the code converted to a markdown table:

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `HomePlanet` | The planet the passenger departed from, typically their planet of permanent residence. |
| `CryoSleep` | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| `Cabin` | The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. |
| `Destination` | The planet the passenger will be debarking to. |
| `Age` | The age of the passenger. |
| `VIP` | Whether the passenger has paid for special VIP service during the voyage. |
| `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. |
| `Name` | The first and last names of the passenger. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

### test.csv

Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. 

Your task is to predict the value of Transported for the passengers in this set.
    
### sample_submission.csv

A sample submission file in the correct format.

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

## Imports and load data

In [37]:
# Imports
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [38]:
# Load the data - to pandas dataframes
test_df = pd.read_csv('./inputs/test.csv')
test_idx = test_df['PassengerId']
train_df = pd.read_csv('./inputs/train.csv')

In [39]:
# Display the first 5 examples
train_df.head(5)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


# Handle empty Data

## Functions to help

In [40]:
def fill_with_median(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].median())
    return df

def fill_with_mode(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mode()[0])
    return df

def fill_with_mean(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mean())
    return df

# Here we provide a list of other_features - if a row has a nan in a feature we will fill it with the mean of the other_features in its row 
def fill_with_mean_of_other_features(df, features, other_features):
    for feature in features:
        df[feature] = df[feature].fillna(df[other_features].dropna().mean(axis=1))
    return df

def fill_with_constant(df, features, constant):
    for feature in features:
        df[feature] = df[feature].fillna(constant)
    return df

# Define a function to print missing data
def print_missing(train: pd.DataFrame, test: pd.DataFrame):
    missing_data = pd.DataFrame({
        'Train Missing': train.isnull().sum(),
        'Test Missing': test.isnull().sum()
    }).sort_values(by='Train Missing', ascending=False)

    print(missing_data)

In [41]:
def handle_empty_data(df: pd.DataFrame) -> pd.DataFrame:
   
    # fill in missing values with a constant
    df = fill_with_constant(df, ['VIP'], False)
    df = fill_with_constant(df, ['Name', 'Cabin', 'HomePlanet', 'Destination'], 'Unknown')

    # Fill numerics with median
    df = fill_with_median(df, ['Age'])
    
    # Fill with average of other spend features
    spend_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
    df = fill_with_mean_of_other_features(df, spend_features, spend_features)
    df = fill_with_constant(df, spend_features, 0)

    # Have a sensible default so leave this as is
    # for the provided dataframe set CryoSleep to False if TotalSpend is greater than 0
    UnknownCryoSpender = (df["CryoSleep"].isnull() | df["CryoSleep"].isna()) & (df[spend_features].dropna().sum(axis=1) > 0)
    df.loc[UnknownCryoSpender, 'CryoSleep'] = False
    df['CryoSleep'] = df['CryoSleep'].fillna('Unknown')

    return df

In [42]:
# Prin Missing data before and after handling
print_missing(train_df, test_df)
train_df = handle_empty_data(train_df)
test_df = handle_empty_data(test_df)
print_missing(train_df, test_df)

              Train Missing  Test Missing
CryoSleep               217          93.0
ShoppingMall            208          98.0
VIP                     203          93.0
HomePlanet              201          87.0
Name                    200          94.0
Cabin                   199         100.0
VRDeck                  188          80.0
FoodCourt               183         106.0
Spa                     183         101.0
Destination             182          92.0
RoomService             181          82.0
Age                     179          91.0
PassengerId               0           0.0
Transported               0           NaN
              Train Missing  Test Missing
Age                       0           0.0
Cabin                     0           0.0
CryoSleep                 0           0.0
Destination               0           0.0
FoodCourt                 0           0.0
HomePlanet                0           0.0
Name                      0           0.0
PassengerId               0       

## Handle categorical Data

In [43]:
# Print the shape of the data

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

Train shape: (8693, 14)
Test shape: (4277, 13)


In [44]:
# Drop columns that are not useful

In [45]:
y = train_df.pop('Transported')
id = train_df.pop('PassengerId')
X = pd.get_dummies(train_df, drop_first=True)

## Split the data into Train and Test

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Create a model and fit it

In [47]:
import tensorflow as tf
tf.random.set_seed(42)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='accuracy')
    ]
)

## Assess model

In [48]:
from sklearn.metrics import confusion_matrix, accuracy_score

def assess_model(fit_model, X_val, y_val):
    predictions = fit_model.predict(X_val)
    y_pred = [1 if prob > 0.5 else 0 for prob in np.ravel(predictions)]
    print()
    print(confusion_matrix(y_val, y_pred))
    print()
    print(f'Accuracy: {accuracy_score(y_val, y_pred):.2f}')
    


In [49]:
# Fit model
model.fit(X_train, y_train, epochs=100) 

# Assess model
assess_model(model, X_val, y_val)

Epoch 1/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7480 - loss: 0.7814
Epoch 2/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7841 - loss: 0.4784
Epoch 3/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8353 - loss: 0.3986
Epoch 4/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8719 - loss: 0.3132
Epoch 5/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8778 - loss: 0.2782
Epoch 6/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8795 - loss: 0.2754
Epoch 7/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8846 - loss: 0.2592
Epoch 8/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8882 - loss: 0.2551
Epoch 9/100
[1m204/204[0m [32

## Next Model - L1 Regularization

L1 regularization adds a penalty term to the loss function that forces the model to learn sparse features by shrinking the less important features to zero. 

In [52]:
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping


model_l1 = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l1(0.001)),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.001)),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.001)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_l1.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='accuracy')
    ]
)
early_stop = EarlyStopping(
    monitor='val_loss',
    mode='min',
    verbose=1,
    patience=10,
    restore_best_weights=True
)


# In the above code, we have added an L1 regularization penalty term with a strength of 0.001 to each of the hidden layers using the kernel_regularizer argument in the Dense layer.

In [53]:
# Fit model
history_l1 = model_l1.fit(
    X_train, y_train, 
    epochs=100, 
    validation_data=(X_val, y_val), 
    callbacks=[early_stop]
)

# Assess model
assess_model(model_l1, X_val, y_val)

Epoch 1/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7461 - loss: 4.1239 - val_accuracy: 0.7898 - val_loss: 1.8619
Epoch 2/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7805 - loss: 1.7717 - val_accuracy: 0.7889 - val_loss: 1.5865
Epoch 3/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7853 - loss: 1.5143 - val_accuracy: 0.7939 - val_loss: 1.3792
Epoch 4/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7875 - loss: 1.3214 - val_accuracy: 0.7925 - val_loss: 1.2159
Epoch 5/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7953 - loss: 1.1640 - val_accuracy: 0.7889 - val_loss: 1.0804
Epoch 6/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7946 - loss: 1.0419 - val_accuracy: 0.7875 - val_loss: 0.9706
Epoch 7/100
[1m204/20

## Next model - Dropout regularization

Dropout layers randomly disable a percentage of neurons during training, forcing the remaining neurons to learn more robust features.

In [54]:
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l1(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='accuracy')
    ]
)
early_stop = EarlyStopping(
    monitor='val_loss',
    mode='min',
    verbose=1,
    patience=10,
    restore_best_weights=True
)

In [55]:

history = model.fit(X_train, y_train, epochs=100, validation_data=( X_val, y_val), callbacks=[early_stop])
# Assess model
assess_model(model, X_val, y_val)

Epoch 1/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.6572 - loss: 6.0490 - val_accuracy: 0.7760 - val_loss: 2.4228
Epoch 2/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7349 - loss: 2.5062 - val_accuracy: 0.7489 - val_loss: 2.2612
Epoch 3/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7185 - loss: 2.2821 - val_accuracy: 0.7295 - val_loss: 2.1338
Epoch 4/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7143 - loss: 2.1312 - val_accuracy: 0.7562 - val_loss: 1.9841
Epoch 5/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7229 - loss: 1.9787 - val_accuracy: 0.7507 - val_loss: 1.8531
Epoch 6/100
[1m204/204[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7324 - loss: 1.8401 - val_accuracy: 0.7631 - val_loss: 1.7114
Epoch 7/100
[1m204/20