# Spaceship Titanic - third Model

## Overview

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, 
the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets 
orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic
collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000
years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the 
Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal
records recovered from the ship's damaged computer system.

## File and Data Field Descriptions

### train.csv 

Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

Sure! Here's the code converted to a markdown table:

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `HomePlanet` | The planet the passenger departed from, typically their planet of permanent residence. |
| `CryoSleep` | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| `Cabin` | The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. |
| `Destination` | The planet the passenger will be debarking to. |
| `Age` | The age of the passenger. |
| `VIP` | Whether the passenger has paid for special VIP service during the voyage. |
| `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. |
| `Name` | The first and last names of the passenger. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

### test.csv

Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. 

Your task is to predict the value of Transported for the passengers in this set.
    
### sample_submission.csv

A sample submission file in the correct format.

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

In [None]:
model='sixth'

In [None]:
# Library imports

# Data wrangling
import pandas as pd
import numpy as np
import missingno
from collections import Counter

# Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

# Model evaluation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# Data preprocessing

## Drop unrequired features & add engineered ones

In [None]:
# Load the data - to pandas dataframes

test_df = pd.read_csv('./inputs/test.csv')
test_idx = test_df['PassengerId']
train_df = pd.read_csv('./inputs/train.csv')

In [None]:
# split name into components
def split_name(name:str):
    if pd.isnull(name): # protect against empty values
        return (np.nan, np.nan, np.nan)
    parts = name.split(' ')
    first = parts[0]
    last = parts[-1]
    return (first, last)

# split cabin into components
def split_cabin(name:str):
    if pd.isnull(name): # protect against empty values
        return (np.nan, np.nan, np.nan)
    parts = name.split('/')
    deck = parts[0]
    side = parts[-1]
    number = ' '.join(parts[1:-1])
    return (deck, number, side)

# This function will drop and add required features
def feature_manipulation(d : pd.DataFrame) -> pd.DataFrame:
    # print the shape of the provided dataframe
    print("Before: ", d.shape)
    # Clone the provided dataframe
    df = d.copy()

    _, df['Name'] = zip(*df['Name'].map(split_name))
    
    # Location
    df['Deck'], _, df['Side'] = zip(*df['Cabin'].map(split_cabin))
    df['DeckSide'] = df['Deck'] + df['Side'] # Combine Deck and Side

    # Group related columns
    df['Group'] = df['PassengerId'].map(lambda x: x.split('_')[0])
    df['GroupSize'] = df['Group'].map(df['Group'].value_counts())
    
    print("After: ", df.shape)
    return df

In [None]:
train_df = feature_manipulation(train_df)

In [None]:
test_df = feature_manipulation(test_df)

In [None]:
train_df.head()

In [None]:
test_df.head()

## Handle Missing Data

In the first model we had a complicated approach - here we will just use the median for numeric and mode for non-numeric

In [None]:
print(f'Before: {len(train_df)} rows')
train_df.isnull().sum().sort_values(ascending = False)

In [None]:
print(f'Before: {len(test_df)} rows')
train_df.isnull().sum().sort_values(ascending = False)

In [None]:
# rows with no missing values
print(f'Before: {len(train_df)} rows')
train_nona = train_df.dropna()
print(f'After: {len(train_nona)} rows')

## Apply the empty data handling

### Define functions to replace data

In [None]:
def fill_with_median(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].median())
    return df

def fill_with_mode(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mode()[0])
    return df

def fill_with_mean(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mean())
    return df

# Here we provide a list of other_features - if a row has a nan in a feature we will fill it with the mean of the other_features in its row 
def fill_with_mean_of_other_features(df, features, other_features):
    for feature in features:
        df[feature] = df[feature].fillna(df[other_features].dropna().mean(axis=1))
    return df

def fill_with_constant(df, features, constant):
    for feature in features:
        df[feature] = df[feature].fillna(constant)
    return df

### Apply simple rules for empty data

In [None]:
def handle_empty_data(df: pd.DataFrame) -> pd.DataFrame:
   
    # fill in missing values with a constant
    df = fill_with_constant(df, ['VIP'], False)
    df = fill_with_constant(df, ['Name', 'Cabin', 'Deck', 'Side', 'DeckSide', 'HomePlanet', 'Destination'], 'Unknown')

    # Fill numerics with median
    df = fill_with_median(df, ['Age'])
    
    # Fill with average of other spend features
    spend_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
    df = fill_with_mean_of_other_features(df, spend_features, spend_features)
    df = fill_with_constant(df, spend_features, 0)

    # Have a sensible default so leave this as is
    # for the provided dataframe set CryoSleep to False if TotalSpend is greater than 0
    UnknownCryoSpender = (df["CryoSleep"].isnull() | df["CryoSleep"].isna()) & (df[spend_features].dropna().sum(axis=1) > 0)
    df.loc[UnknownCryoSpender, 'CryoSleep'] = False
    df['CryoSleep'] = df['CryoSleep'].fillna('Unknown')

    return df


In [None]:
train_df = handle_empty_data(train_df)
test_df = handle_empty_data(test_df)

In [None]:
# Check the training data

train_df.isnull().sum().sort_values(ascending = False)


In [None]:
# Check the test data
test_df.isnull().sum().sort_values(ascending = False)

### And thats a wrap!

There are now no missing values so we can proceed to Pre-processing

# Data Preprocessing

## Quick check of the data after the empty data is filled

In [None]:
# Look at the training data after the changes to remove empty data
train_df.head()

In [None]:
# Look at the test data
test_df.head()

In [None]:
# Check the shapes:

print(f'Train: {train_df.shape}')
print(f'Test: {test_df.shape}')

# Check the train and test column data types side by side
pd.concat([train_df.dtypes, test_df.dtypes], axis=1, keys=['Train', 'Test'])



## Drop Unrequired columns

We can now drop these columns:

- Cabin
- TotalSpend
- PassengerId - may do this one last so we have the key

And to simplify for now lets also remove
- Name
- Group

In [None]:
# Drop the columns that are not required
def drop_unrequired_columns(df: pd.DataFrame) -> pd.DataFrame:
    columns_to_drop = [
        'Cabin',
        'Name',
        'DeckSide'
    ]
    df.drop(columns_to_drop, axis = 1, inplace = True)
    return df

train_df = drop_unrequired_columns(train_df)
test_df = drop_unrequired_columns(test_df)

## Prepare the data for training

- Log/Bin the numeric data
- OneHotEncoding

### Bin and OHE features

I am going to try the `KBinsDiscretizer` from sklearn - see https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import RobustScaler

# Define pipelines for the features where we wish to bin and OHE

spending_pipeline = Pipeline([
    ('scale', RobustScaler()),
])

age_pipeline = Pipeline([
    ('binning', KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')),
    ('onehot', OneHotEncoder())
])

categorical_pipeline = Pipeline([
    ('onehot', OneHotEncoder())
])


In [None]:
# There are some cases where we have used 'Unknown' in boolean columns that have True and False otherwise that led to an error
# complaining about a mix of str and bool so we need to convert the columns to str

# print all columns where the value is 'Unknown'
for column in train_df.columns:
    if 'Unknown' in train_df[column].unique():
        print(column)
        train_df[column] = train_df[column].astype(str)



## Run the pipeline

In [None]:
# what data looks like before pipeline
train_df.head()

In [None]:
# Define the preprocessor

from sklearn.compose import ColumnTransformer

spending_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] # bin then OHE
categorical_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side'] # OHE

preprocessor = ColumnTransformer(
    transformers=[
        ('category', categorical_pipeline, categorical_features),
        ('spending', spending_pipeline, spending_features),
        ('age', age_pipeline, ['Age'])
    ],
    remainder='passthrough'  # This leaves the rest of the columns untouched
)

In [None]:
y = train_df.pop('Transported')

In [None]:
i = train_df.pop('PassengerId')

In [None]:
# Now you can fit and transform the training data with the preprocessor
X = preprocessor.fit_transform(train_df) # sparse to dense array

# Time to attempt the modelling

## Split the data to create a validation set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Instantiation of the models

In [None]:
# Create a TensorFlow classifier
def tensorflow_model():
    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Dropout
    from tensorflow.keras.optimizers import Adam

    model = Sequential([
        Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
        Dropout(0.2),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(32, activation='relu'),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [None]:
classifiers = {
    'LogisticRegression': LogisticRegression(),
    'Perceptron': Perceptron(),
    'SVC': SVC(),
    'KNN': KNeighborsClassifier(n_neighbors = 5),
    'DecisionTree': DecisionTreeClassifier(),
    'Gaussian': GaussianNB(),
    'RandomForest': RandomForestClassifier(),
    'LinearSvc': LinearSVC(),
    'SGDClassifier': SGDClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

## Train the models

### Evaluate model

In [None]:
def score_model(classifier, X_train, X_val, y_train, y_val):
    # Fit the model
    classifier.fit(X_train, y_train)

    # Predict the test data
    y_pred = classifier.predict(X_val)

    # Create a confusion matrix
    cm = confusion_matrix(y_val, y_pred)

    # Calculate the accuracy
    acc = accuracy_score(y_val, y_pred)

    scores = cross_val_score(classifier, X_train, y_train, cv = 10)

    return { 
            'confusion_matrix': cm,
            'accuracy': acc,
            'kfold-cv': scores.mean()
        }

def run_classifiers(classifiers: dict, X_train, X_test, y_train, y_val) -> dict:
    results = {}
    for name, classifier in classifiers.items():
        results[name] = score_model(classifier, X_train, X_test, y_train, y_val)
    return results

results = run_classifiers(classifiers, X_train, X_val, y_train, y_val)

In [None]:
# convert results to a dataframe
results_df = pd.DataFrame(results).T
results_df.sort_values(by='kfold-cv', ascending=False)

## Tuning the two best models

In [None]:
# Print the confusion matrix with labels for TP, FP, TN, FN
def print_confusion_matrix(cm: np.ndarray):
    # True Positives
    TP = cm[1, 1]
    # True Negatives
    TN = cm[0, 0]
    # False Positives
    FP = cm[0, 1]
    # False Negatives
    FN = cm[1, 0]

    # Now we have extracted the values print as a grid
    print(f'True Positives: {TP} False Positives: {FP}')
    print(f'False Negatives: {FN} True Negatives: {TN}')
    

## Tuning the two best models

In [None]:
# Tune a model using GridSearchCV
def tune_model(classifier, param_grid):
    print(f'\n\nTuning {classifier.__class__.__name__}...')
    grid_search = GridSearchCV(classifier, param_grid, cv=10, n_jobs = -1, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    best_estimator = grid_search.best_estimator_
    score = score_model(best_estimator, X_train, X_val, y_train, y_val)

    print(f'Best SVC Accuracy: { score["accuracy"] }')
    print(f'Best SVC KFold CV: { score["kfold-cv"] }')
    print('\nConfusion Matrix:')
    print_confusion_matrix(score['confusion_matrix'])
    
    f1 = f1_score(y_val, best_estimator.predict(X_val))
    # and finally the f1 score
    print(f'F1 Score: {f1}')

    return (grid_search.best_params_, best_estimator, f1)

# Print the confusion matrix with labels for TP, FP, TN, FN
def print_confusion_matrix(cm: np.ndarray):
    # True Positives
    TP = cm[1, 1]
    # True Negatives
    TN = cm[0, 0]
    # False Positives
    FP = cm[0, 1]
    # False Negatives
    FN = cm[1, 0]

    # Now we have extracted the values print as a grid
    print(f'True Positives: {TP} False Positives: {FP}')
    print(f'False Negatives: {FN} True Negatives: {TN}')

In [None]:
tuning_results = [
    tune_model(
        XGBClassifier(),
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 4, 5],
            'learning_rate': [0.1, 0.01, 0.001],
            'eval_metric': ['logloss'], 
            'use_label_encoder': [False]
        }, 
    ),
    tune_model(
        RandomForestClassifier(), 
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 4, 5],
            'criterion': ['gini', 'entropy']
        }
    )
]

In [None]:
best_result = max(tuning_results, key=lambda x: x[2])[1]

## Create submission

Now we can predict the test set and submit

In [None]:
# Convert any Boolean columns with Unknown to String type
for column in test_df.columns:
    if 'Unknown' in test_df[column].unique():
        print(column)
        test_df[column] = test_df[column].astype(str)

In [None]:
# Check the shape

print(f'Test: {test_df.shape}')

# Print the first 5 rows:
test_df.head()

In [None]:
# Capture the passenger IDs for submission and remove from the data as not useful predictors
idx_test = test_df.pop('PassengerId')

In [None]:
# Pre-process test set
X_test = preprocessor.transform(test_df) # sparse to dense array

In [None]:
# Predict the test set
y_pred = best_result.predict(X_test).astype(bool)

In [None]:
# Write the submission DataFrame
submission = pd.DataFrame({'PassengerId': idx_test, 'Transported': y_pred})
submission.to_csv(f'submissions/{model}-submission.csv', index=False)
print("Your submission was successfully saved!")
