# Spaceship Titanic

## Overview

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, 
the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets 
orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic
collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000
years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the 
Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal
records recovered from the ship's damaged computer system.

## File and Data Field Descriptions

### train.csv 

Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `HomePlanet` | The planet the passenger departed from, typically their planet of permanent residence. |
| `CryoSleep` | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| `Cabin` | The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. |
| `Destination` | The planet the passenger will be debarking to. |
| `Age` | The age of the passenger. |
| `VIP` | Whether the passenger has paid for special VIP service during the voyage. |
| `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. |
| `Name` | The first and last names of the passenger. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

### test.csv

Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. 

Your task is to predict the value of Transported for the passengers in this set.
    
### sample_submission.csv

A sample submission file in the correct format.

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

## Pre-requisites

In [147]:
# Library imports

# Data wrangling
import pandas as pd
import numpy as np
import missingno
from collections import Counter

# Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Model evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

## Import data

In [148]:
# Load the data - to pandas dataframes
test_df = pd.read_csv('./inputs/test.csv')
test_idx = test_df['PassengerId']
train_df = pd.read_csv('./inputs/train.csv')

## Data Pre-processing

## Functions to define logic for empty values

One of the most complicated areas is figuring out how to impute empty values. I did figure some of this out but this post was helpful in filling some gaps https://www.kaggle.com/code/jacobsultan/how-to-impute-nearly-every-cabin-correctly

We will define that logic here:

1. **Cryosleep**
    - False if money is spent
2. **CabinSide** 
    - Group is always on the same side of the ship
3. **HomePlanet**
    - Group Members Share the Same Home Planet: If two passengers are in the same group, they originate from the same home planet. (Appendix A.2)
    - Shared Last Names Indicate Same Home Planet: Passengers sharing a last name are from the same home planet.
4. **Spending**
    - 0 if under 12 
5. **CabinRoom**
    - can only share a room if in same group
6. **CabinDeck**
    - Mars: Decks 'D', 'E', or 'F'
    - Earth: Decks 'E', 'F', or 'G'
    - Europa: Decks 'A', 'B', 'C', 'D', 'E', or 'T'

Now we can validate and define functions to impute

In [149]:
# Define spend features
spend_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

In [150]:
def empty_cryosleep(df: pd.DataFrame) -> pd.DataFrame:
    UnknownCryoSpender = (df["CryoSleep"].isnull() | df["CryoSleep"].isna()) & (df['TotalSpend'] > 0)
    df.loc[UnknownCryoSpender, 'CryoSleep'] = False
    return df

In [151]:
# This code does not work as when you use `fillna` with `inplace=True` on a DataFrame slice, it doesn't actually modify the original DataFrame. 
# This is because DataFrame slicing returns a copy, not a view, and `fillna` modifies this copy, not the original DataFrame.

# def asleep_or_young(df: pd.DataFrame) -> pd.DataFrame:
#     df[(df['ShoppingMall'].isna()) & (df['Age'] < 13)].fillna(0, inplace=True)
#     return df

# In this fixed code, `df.loc[condition, 'ShoppingMall'].fillna(0)` returns a new Series with the missing values in 'ShoppingMall' filled. 
# This Series is then assigned back to the 'ShoppingMall' column of the original DataFrame, effectively modifying the original DataFrame.
def asleep_or_young(df: pd.DataFrame) -> pd.DataFrame:
    for feature in spend_features:
        # young_condition = (df[feature].isna()) & (df['Age'] < 13)
        # df.loc[young_condition, feature] = df.loc[young_condition, feature].fillna(0)
        # asleep_condition = (df[feature].isna()) & (df['CryoSleep'] == True)
        # df.loc[asleep_condition, feature] = df.loc[asleep_condition, feature].fillna(0)
        condition = ((df[feature].isna()) & (df['Age'] < 13)) | ((df[feature].isna()) & (df['CryoSleep'] == True))
        df.loc[condition, feature] = df.loc[condition, feature].fillna(0)
    return df

In [152]:
## Here we loop through all the groups and if there is a CabinSide missing, we fill it with the mode of the group
def fill_missing_side(df: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Find the mode of 'CabinSide' for each 'Group'
    mode_per_group = df.groupby('Group')['CabinSide'].transform(lambda x: x.mode()[0] if not x.mode().empty else np.nan)

    # Step 2: Replace NaN values in 'CabinSide' with the mode
    df['CabinSide'] = df['CabinSide'].fillna(mode_per_group)

    # Step 3: Return the modified DataFrame
    return df


In [153]:
# HomePlanet
#    - Group Members Share the Same Home Planet: If two passengers are in the same group, they originate from the same home planet. (Appendix A.2)
def fill_missing_home_planet_group(df: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Find the mode of 'HomePlanet' for each 'Group'
    mode_per_group = df.groupby('Group')['HomePlanet'].transform(lambda x: x.mode()[0] if not x.mode().empty else np.nan)

    # Step 2: Replace NaN values in 'HomePlanet' with the mode
    df['HomePlanet'] = df['HomePlanet'].fillna(mode_per_group)

    # Step 3: Return the modified DataFrame
    return df

#    - Shared Last Names Indicate Same Home Planet: Passengers sharing a last name are from the same home planet.
def fill_missing_home_planet_name(df: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Find the mode of 'HomePlanet' for each 'Group'
    mode_per_group = df.groupby('NameSecond')['HomePlanet'].transform(lambda x: x.mode()[0] if not x.mode().empty else np.nan)

    # Step 2: Replace NaN values in 'HomePlanet' with the mode
    df['HomePlanet'] = df['HomePlanet'].fillna(mode_per_group)

    # Step 3: Return the modified DataFrame
    return df

def fill_missing_home_planet(df: pd.DataFrame) -> pd.DataFrame:
    df = fill_missing_home_planet_group(df)
    df = fill_missing_home_planet_name(df)
    return df

In [154]:
def fill_in_missing_vip(df: pd.DataFrame) -> pd.DataFrame:
    # If HomePlanet is Earth set VIP to False
    df.loc[df['HomePlanet'] == 'Earth', 'VIP'] = False

    # if CabinDeck is T then VIP False
    df.loc[df['CabinDeck'] == 'T', 'VIP'] = False

    # Europa VIP have Age >= 25, so VIP FALSE if Europa and Age under 25
    df.loc[(df['HomePlanet'] == 'Europa') & (df['Age'] < 25), 'VIP'] = False

    # Mars VIP have Age >= 18 and no CryoSleep and never goes to "55 Cancri e"
    condition = (df['HomePlanet'] == 'Mars') & (df['Age'] < 18) & (df['CryoSleep'] == False) & (df['Destination'] != '55CancriE')
    df.loc[condition, 'VIP'] = False

    # extract a series of HomePlanet and NameFirst
    # fill missing HomePlanet based on NameFirst series
    planet_per_name = df.groupby('NameFirst')['HomePlanet'].transform(lambda x: x.mode()[0] if not x.mode().empty else np.nan)
    df['HomePlanet'] = df['HomePlanet'].fillna(planet_per_name)
    
    return df

### Functions to apply pre-processing 

In [155]:
def handle_empty_values(df: pd.DataFrame) -> pd.DataFrame:
    df = empty_cryosleep(df)
    df = asleep_or_young(df)
    df = fill_missing_side(df)
    df = fill_missing_home_planet(df)
    df = fill_in_missing_vip(df)
    return df

In [156]:
def scaling_and_encoding(df: pd.DataFrame) -> pd.DataFrame:
    return df

In [157]:
def feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    # Split out components of the Cabin column
    df[["CabinDeck", "CabinRoom", "CabinSide"]] = df["Cabin"].str.split("/", expand=True)
    df['CabinRoom'] = df['CabinRoom'].astype('Int64')

    # Calculate Total Spend
    df['TotalSpend'] = df[spend_features].sum(axis=1)
   
    # Extract the group and group Size
    df['Group'] = df['PassengerId'].map(lambda x: x.split('_')[0])
    df['GroupSize'] = df['Group'].map(df['Group'].value_counts())

    # Split the name into first name and last name
    df[["NameFirst", "NameSecond"]] = df["Name"].str.split(" ", expand=True)
    
    return df

In [158]:
## Pre-process the data - fill missing values, encode categorical variables, scale the data, feature engineering
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    df = feature_engineering(df)
    df = handle_empty_values(df)
    df = scaling_and_encoding(df)
    df = df.drop(columns=['Name', 'Cabin'])
    return df

### Apply to the data 

In [159]:
## Split the data into features and target

# split off the label
y = train_df.pop('Transported')

# split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_df, y, test_size = 0.25, random_state = 0)

In [160]:
## Look at data before preprocessing
X_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
5020,5363_01,Mars,True,F/1102/P,TRAPPIST-1e,37.0,False,0.0,0.0,0.0,0.0,0.0,Crowk Apeau
5967,6324_02,Earth,,G/1025/S,55 Cancri e,44.0,False,0.0,0.0,0.0,0.0,0.0,Murie Hinetthews
991,1053_01,Earth,False,F/199/S,PSO J318.5-22,27.0,False,182.0,0.0,0.0,0.0,376.0,Rald Colleruces
2894,3128_01,Earth,False,G/512/P,55 Cancri e,15.0,False,62.0,57.0,2646.0,1104.0,312.0,Heila Gordond
2228,2385_01,Mars,False,F/461/S,TRAPPIST-1e,23.0,False,1773.0,0.0,78.0,0.0,3.0,Jacats Pité


In [161]:
# extract the indexes of the rows with nan values in the training set to check later
nan_rows = X_train[X_train.isnull().any(axis=1)].index

## Print any missing data before processing
missing_data = pd.DataFrame({
    'Train Missing': X_train.isnull().sum().astype(int),
    'Test Missing': test_df.isnull().sum().astype(int),
}).sort_values(by='Train Missing', ascending=False)

print(missing_data)

              Train Missing  Test Missing
VIP                     167            93
ShoppingMall            164            98
CryoSleep               160            93
HomePlanet              150            87
Cabin                   145           100
Name                    145            94
Spa                     144           101
RoomService             141            82
VRDeck                  139            80
Destination             137            92
FoodCourt               137           106
Age                     135            91
PassengerId               0             0


In [162]:
## Preprocess the data
X_train = preprocess_data(X_train)
X_val = preprocess_data(X_val)
test_df = preprocess_data(test_df)

In [163]:
## DEBUGGING - Look at data after  preprocessing
X_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,CabinDeck,CabinRoom,CabinSide,TotalSpend,Group,GroupSize,NameFirst,NameSecond
5020,5363_01,Mars,True,TRAPPIST-1e,37.0,False,0.0,0.0,0.0,0.0,0.0,F,1102,P,0.0,5363,1,Crowk,Apeau
5967,6324_02,Earth,,55 Cancri e,44.0,False,0.0,0.0,0.0,0.0,0.0,G,1025,S,0.0,6324,3,Murie,Hinetthews
991,1053_01,Earth,False,PSO J318.5-22,27.0,False,182.0,0.0,0.0,0.0,376.0,F,199,S,558.0,1053,1,Rald,Colleruces
2894,3128_01,Earth,False,55 Cancri e,15.0,False,62.0,57.0,2646.0,1104.0,312.0,G,512,P,4181.0,3128,1,Heila,Gordond
2228,2385_01,Mars,False,TRAPPIST-1e,23.0,False,1773.0,0.0,78.0,0.0,3.0,F,461,S,1854.0,2385,1,Jacats,Pité


In [164]:
## DEBUGGING - Look at data with nan values
nan_rows_df = X_train.loc[nan_rows]
nan_rows_df.head() 

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,CabinDeck,CabinRoom,CabinSide,TotalSpend,Group,GroupSize,NameFirst,NameSecond
5967,6324_02,Earth,,55 Cancri e,44.0,False,0.0,0.0,0.0,0.0,0.0,G,1025.0,S,0.0,6324,3,Murie,Hinetthews
1839,1964_01,Europa,False,TRAPPIST-1e,63.0,False,,869.0,1.0,375.0,3377.0,C,74.0,S,4622.0,1964,1,Altara,Righturter
6558,6921_02,Europa,True,55 Cancri e,24.0,False,0.0,0.0,0.0,0.0,0.0,C,255.0,S,0.0,6921,5,Eleron,Fordulgaug
7333,7848_01,Mars,False,TRAPPIST-1e,29.0,False,1052.0,0.0,635.0,0.0,0.0,,,S,1687.0,7848,2,Neypus,Pine
756,0794_01,Mars,,55 Cancri e,19.0,False,0.0,0.0,0.0,0.0,0.0,F,144.0,S,0.0,794,2,Nakes,Dutte


In [165]:
## Print any missing data
missing_data = pd.DataFrame({
    'Train Missing': X_train.isnull().sum().astype(int),
    'Test Missing': test_df.isnull().sum().astype(int),
}).sort_values(by='Train Missing', ascending=False)

print(missing_data)

              Train Missing  Test Missing
NameSecond              145            94
CabinDeck               145           100
NameFirst               145            94
CabinRoom               145           100
Destination             137            92
Age                     135            91
Spa                      97            52
CabinSide                87            63
RoomService              84            55
ShoppingMall             83            60
FoodCourt                80            65
VRDeck                   76            43
CryoSleep                69            38
VIP                      67            47
HomePlanet                3             7
TotalSpend                0             0
Group                     0             0
GroupSize                 0             0
PassengerId               0             0


## Train the models

### Instantiate the models

In [166]:
classifiers = {
    'LogisticRegression': LogisticRegression(),
    'Perceptron': Perceptron(),
    'SVC': SVC(),
    'KNN5': KNeighborsClassifier(n_neighbors = 5),
    'KNN8': KNeighborsClassifier(n_neighbors = 8),
    'KNN10': KNeighborsClassifier(n_neighbors = 10),
    'KNN15': KNeighborsClassifier(n_neighbors = 15),
    'DecisionTree': DecisionTreeClassifier(),
    'Gaussian': GaussianNB(),
    'RandomForest': RandomForestClassifier(),
    'LinearSvc': LinearSVC(),
    'SGDClassifier': SGDClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'CatBoost': CatBoostClassifier(verbose=0),
    'GradientBoosting': GradientBoostingClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'LightGBM': LGBMClassifier(verbose=-0),
    'SVC_linear': SVC(kernel='linear'),
    'SVC_poly': SVC(kernel='poly'),
    'SVC_sigmoid': SVC(kernel='sigmoid'),
}

### Training

In [167]:
## Function to score the model
def score_model(classifier, X_train, X_val, y_train, y_val):
    # Fit the model
    classifier.fit(X_train, y_train)

    # Predict the test data
    y_pred = classifier.predict(X_val)

    # Create a confusion matrix
    cm = confusion_matrix(y_val, y_pred)

    return { 
            'confusion_matrix': cm,
            'TP': cm[0][0],
            'FP': cm[0][1],
            'FN': cm[1][0],
            'TN': cm[1][1],
            'accuracy': accuracy_score(y_val, y_pred),
            'kfold-cv': cross_val_score(classifier, X_train, y_train, cv = 10).mean(),
            'f1': f1_score(y_val, y_pred)
        }

In [168]:
## Loop through the classifiers and score them
results = {}
for name, classifier in classifiers.items():
    results[name] = score_model(classifier, X_train, X_val, y_train, y_val)

ValueError: could not convert string to float: 'Mars'

In [None]:
# convert results to a dataframe and sort by accuracy
results_df = pd.DataFrame(results).T
results_df.sort_values(by='accuracy', ascending=False)