# Spaceship Titanic - fourth Model

## Overview

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, 
the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets 
orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic
collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000
years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the 
Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal
records recovered from the ship's damaged computer system.

## File and Data Field Descriptions

### train.csv 

Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

Sure! Here's the code converted to a markdown table:

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `HomePlanet` | The planet the passenger departed from, typically their planet of permanent residence. |
| `CryoSleep` | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| `Cabin` | The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. |
| `Destination` | The planet the passenger will be debarking to. |
| `Age` | The age of the passenger. |
| `VIP` | Whether the passenger has paid for special VIP service during the voyage. |
| `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` | Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. |
| `Name` | The first and last names of the passenger. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

### test.csv

Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. 

Your task is to predict the value of Transported for the passengers in this set.
    
### sample_submission.csv

A sample submission file in the correct format.

| Column Name | Description |
|------------- |-------------|
| `PassengerId` | A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. |
| `Transported` | Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict. |

In [76]:
model='fourth'

In [77]:
# Library imports

# Data wrangling
import pandas as pd
import numpy as np
import missingno
from collections import Counter

# Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

# Model evaluation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# Data preprocessing

## Drop unrequired features & add engineered ones

In [78]:
# Load the data - to pandas dataframes

test_df = pd.read_csv('./inputs/test.csv')
test_idx = test_df['PassengerId']
train_df = pd.read_csv('./inputs/train.csv')

In [79]:
# split name into components
def split_name(name:str):
    if pd.isnull(name): # protect against empty values
        return (np.nan, np.nan, np.nan)
    parts = name.split(' ')
    first = parts[0]
    last = parts[-1]
    return (first, last)

# split cabin into components
def split_cabin(name:str):
    if pd.isnull(name): # protect against empty values
        return (np.nan, np.nan, np.nan)
    parts = name.split('/')
    deck = parts[0]
    side = parts[-1]
    number = ' '.join(parts[1:-1])
    return (deck, number, side)

# This function will drop and add required features
def feature_manipulation(d : pd.DataFrame) -> pd.DataFrame:
    # print the shape of the provided dataframe
    print("Before: ", d.shape)
    # Clone the provided dataframe
    df = d.copy()

    _, df['Name'] = zip(*df['Name'].map(split_name))
    
    # Location
    df['Deck'], df['Num'], df['Side'] = zip(*df['Cabin'].map(split_cabin))
    df['DeckSide'] = df['Deck'] + df['Side'] # Combine Deck and Side

    # Group related columns
    df['Group'] = df['PassengerId'].map(lambda x: x.split('_')[0])
    df['GroupSize'] = df['Group'].map(df['Group'].value_counts())
    
    print("After: ", df.shape)
    return df

In [80]:
train_df = feature_manipulation(train_df)

Before:  (8693, 14)
After:  (8693, 20)


In [81]:
train_df['Num'].value_counts()

Num
82      28
86      22
19      22
56      21
176     21
        ..
1644     1
1515     1
1639     1
1277     1
1894     1
Name: count, Length: 1817, dtype: int64

In [82]:
test_df = feature_manipulation(test_df)

Before:  (4277, 13)
After:  (4277, 19)


In [83]:
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Deck,Num,Side,DeckSide,Group,GroupSize
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Ofracculy,False,B,0,P,BP,1,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Vines,True,F,0,S,FS,2,1
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Susent,False,A,0,S,AS,3,2
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Susent,False,A,0,S,AS,3,2
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Santantines,True,F,1,S,FS,4,1


In [84]:
test_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Deck,Num,Side,DeckSide,Group,GroupSize
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Carsoning,G,3,S,GS,13,1
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Peckers,F,4,S,FS,18,1
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Unhearfus,C,0,S,CS,19,1
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Caltilter,C,1,S,CS,21,1
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Harperez,F,5,S,FS,23,1


## Handle Missing Data

In the first model we had a complicated approach - here we will just use the median for numeric and mode for non-numeric

In [85]:
print(f'Before: {len(train_df)} rows')
train_df.isnull().sum().sort_values(ascending = False)

Before: 8693 rows


CryoSleep       217
ShoppingMall    208
VIP             203
HomePlanet      201
Name            200
Side            199
Cabin           199
Num             199
Deck            199
DeckSide        199
VRDeck          188
Spa             183
FoodCourt       183
Destination     182
RoomService     181
Age             179
Group             0
PassengerId       0
Transported       0
GroupSize         0
dtype: int64

In [86]:
print(f'Before: {len(test_df)} rows')
train_df.isnull().sum().sort_values(ascending = False)

Before: 4277 rows


CryoSleep       217
ShoppingMall    208
VIP             203
HomePlanet      201
Name            200
Side            199
Cabin           199
Num             199
Deck            199
DeckSide        199
VRDeck          188
Spa             183
FoodCourt       183
Destination     182
RoomService     181
Age             179
Group             0
PassengerId       0
Transported       0
GroupSize         0
dtype: int64

In [87]:
# rows with no missing values

print(f'Before: {len(train_df)} rows')
train_nona = train_df.dropna()
print(f'After: {len(train_nona)} rows')

Before: 8693 rows
After: 6606 rows


## Apply the empty data handling

### Define functions to replace data

In [88]:
def fill_with_median(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].median())
    return df

def fill_with_mode(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mode()[0])
    return df

def fill_with_mean(df, features):
    for feature in features:
        df[feature] = df[feature].fillna(df[feature].mean())
    return df

# Here we provide a list of other_features - if a row has a nan in a feature we will fill it with the mean of the other_features in its row 
def fill_with_mean_of_other_features(df, features, other_features):
    for feature in features:
        df[feature] = df[feature].fillna(df[other_features].dropna().mean(axis=1))
    return df


def fill_with_constant(df, features, constant):
    for feature in features:
        df[feature] = df[feature].fillna(constant)
    return df

### Apply simple rules for empty data

In [89]:
def handle_empty_data(df: pd.DataFrame) -> pd.DataFrame:
   
    # fill in missing values with a constant
    df = fill_with_constant(df, ['VIP'], False)
    df = fill_with_constant(df, ['Name', 'Cabin', 'Deck', 'Side', 'DeckSide', 'HomePlanet', 'Destination'], 'Unknown')
    df = fill_with_constant(df, ['Num'], 0)

    # Fill numerics with median
    df = fill_with_median(df, ['Age'])
    
    # Fill with average of other spend features
    spend_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
    df = fill_with_mean_of_other_features(df, spend_features, spend_features)
    df = fill_with_constant(df, spend_features, 0)

    # Have a sensible default so leave this as is
    # for the provided dataframe set CryoSleep to False if TotalSpend is greater than 0
    UnknownCryoSpender = (df["CryoSleep"].isnull() | df["CryoSleep"].isna()) & (df[spend_features].dropna().sum(axis=1) > 0)
    df.loc[UnknownCryoSpender, 'CryoSleep'] = False
    df['CryoSleep'] = df['CryoSleep'].fillna('Unknown')

    return df


In [90]:
train_df = handle_empty_data(train_df)
test_df = handle_empty_data(test_df)

In [91]:
# Check the training data

train_df.isnull().sum().sort_values(ascending = False)


PassengerId     0
HomePlanet      0
Group           0
DeckSide        0
Side            0
Num             0
Deck            0
Transported     0
Name            0
VRDeck          0
Spa             0
ShoppingMall    0
FoodCourt       0
RoomService     0
VIP             0
Age             0
Destination     0
Cabin           0
CryoSleep       0
GroupSize       0
dtype: int64

In [92]:
# Check the test data
test_df.isnull().sum().sort_values(ascending = False)

PassengerId     0
Spa             0
Group           0
DeckSide        0
Side            0
Num             0
Deck            0
Name            0
VRDeck          0
ShoppingMall    0
HomePlanet      0
FoodCourt       0
RoomService     0
VIP             0
Age             0
Destination     0
Cabin           0
CryoSleep       0
GroupSize       0
dtype: int64

### And thats a wrap!

There are now no missing values so we can proceed to Pre-processing

# Data Preprocessing

## Quick check of the data after the empty data is filled

In [93]:
# Look at the training data after the changes to remove empty data
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Deck,Num,Side,DeckSide,Group,GroupSize
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Ofracculy,False,B,0,P,BP,1,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Vines,True,F,0,S,FS,2,1
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Susent,False,A,0,S,AS,3,2
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Susent,False,A,0,S,AS,3,2
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Santantines,True,F,1,S,FS,4,1


In [94]:
# Look at the test data
test_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Deck,Num,Side,DeckSide,Group,GroupSize
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Carsoning,G,3,S,GS,13,1
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Peckers,F,4,S,FS,18,1
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Unhearfus,C,0,S,CS,19,1
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Caltilter,C,1,S,CS,21,1
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Harperez,F,5,S,FS,23,1


In [95]:
# Check the shapes:

print(f'Train: {train_df.shape}')
print(f'Test: {test_df.shape}')

# Check the train and test column data types side by side
pd.concat([train_df.dtypes, test_df.dtypes], axis=1, keys=['Train', 'Test'])



Train: (8693, 20)
Test: (4277, 19)


Unnamed: 0,Train,Test
PassengerId,object,object
HomePlanet,object,object
CryoSleep,object,object
Cabin,object,object
Destination,object,object
Age,float64,float64
VIP,bool,bool
RoomService,float64,float64
FoodCourt,float64,float64
ShoppingMall,float64,float64


## Drop Unrequired columns

We can now drop these columns:

- Cabin
- TotalSpend
- PassengerId - may do this one last so we have the key

And to simplify for now lets also remove
- Name
- Group

In [96]:
# Drop the columns that are not required
def drop_unrequired_columns(df: pd.DataFrame) -> pd.DataFrame:
    columns_to_drop = [
        'Cabin',
        'Name',
        'DeckSide'
    ]
    df.drop(columns_to_drop, axis = 1, inplace = True)
    return df

train_df = drop_unrequired_columns(train_df)
test_df = drop_unrequired_columns(test_df)

## Prepare the data for training

- Log/Bin the numeric data
- OneHotEncoding

### Bin and OHE features

I am going to try the `KBinsDiscretizer` from sklearn - see https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer

In [97]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import RobustScaler

# Define pipelines for the features where we wish to bin and OHE

spending_pipeline = Pipeline([
    ('scale', RobustScaler()),
])

age_pipeline = Pipeline([
    ('binning', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')),
    ('onehot', OneHotEncoder())
])

categorical_pipeline = Pipeline([
    ('onehot', OneHotEncoder())
])


In [98]:
# There are some cases where we have used 'Unknown' in boolean columns that have True and False otherwise that led to an error
# complaining about a mix of str and bool so we need to convert the columns to str

# print all columns where the value is 'Unknown'
for column in train_df.columns:
    if 'Unknown' in train_df[column].unique() or column == 'Num':
        print(column)
        train_df[column] = train_df[column].astype(str)

HomePlanet
CryoSleep
Destination
Deck
Num
Side


## Run the pipeline

In [99]:
# what data looks like before pipeline
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side,Group,GroupSize
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0,P,1,1
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0,S,2,1
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0,S,3,2
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0,S,3,2
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1,S,4,1


In [100]:
# Define the preprocessor

from sklearn.compose import ColumnTransformer

spending_features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] # bin then OHE
categorical_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side', 'Num'] # OHE

preprocessor = ColumnTransformer(
    transformers=[
        ('category', categorical_pipeline, categorical_features),
        ('spending', spending_pipeline, spending_features),
        ('age', age_pipeline, ['Age'])
    ],
    remainder='passthrough'  # This leaves the rest of the columns untouched
)

In [101]:
y = train_df.pop('Transported')

In [102]:
i = train_df.pop('PassengerId')

In [103]:
# Now you can fit and transform the training data with the preprocessor
X = preprocessor.fit_transform(train_df).toarray() # sparse to dense array

# Time to attempt the modelling

## Split the data to create a validation set

In [104]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Instantiation of the models

In [105]:
# Create a TensorFlow classifier
def tensorflow_model():
    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Dropout
    from tensorflow.keras.optimizers import Adam

    model = Sequential([
        Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
        Dropout(0.2),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(32, activation='relu'),
        Dropout(0.2),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [106]:
classifiers = {
    'LogisticRegression': LogisticRegression(),
    'Perceptron': Perceptron(),
    'SVC': SVC(),
    'KNN': KNeighborsClassifier(n_neighbors = 5),
    'DecisionTree': DecisionTreeClassifier(),
    'Gaussian': GaussianNB(),
    'RandomForest': RandomForestClassifier(),
    'LinearSvc': LinearSVC(),
    'SGDClassifier': SGDClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

## Train the models

### Evaluate model

In [107]:
def score_model(classifier, X_train, X_val, y_train, y_val):
    # if X_train is a sparse matric convert to dense
    
    # Fit the model
    classifier.fit(X_train, y_train)

    # Predict the test data
    y_pred = classifier.predict(X_val)

    # Create a confusion matrix
    cm = confusion_matrix(y_val, y_pred)

    # Calculate the accuracy
    acc = accuracy_score(y_val, y_pred)

    scores = cross_val_score(classifier, X_train, y_train, cv = 10)

    return { 
            'confusion_matrix': cm,
            'accuracy': acc,
            'kfold-cv': scores.mean()
        }

def run_classifiers(classifiers: dict, X_train, X_test, y_train, y_val) -> dict:
    results = {}
    for name, classifier in classifiers.items():
        results[name] = score_model(classifier, X_train, X_test, y_train, y_val)
    return results

results = run_classifiers(classifiers, X_train, X_val, y_train, y_val)

In [108]:
# convert results to a dataframe
results_df = pd.DataFrame(results).T
results_df.sort_values(by='kfold-cv', ascending=False)

Unnamed: 0,confusion_matrix,accuracy,kfold-cv
XGBoost,"[[867, 210], [203, 894]]",0.810028,0.799663
RandomForest,"[[899, 178], [267, 830]]",0.795308,0.792299
LogisticRegression,"[[704, 373], [112, 985]]",0.776909,0.786162
DecisionTree,"[[828, 249], [272, 825]]",0.76035,0.763767
Gaussian,"[[672, 405], [190, 907]]",0.726311,0.732317
KNN,"[[533, 544], [223, 874]]",0.647194,0.655927
LinearSvc,"[[461, 616], [116, 981]]",0.663293,0.649027
SGDClassifier,"[[426, 651], [51, 1046]]",0.677093,0.606381
Perceptron,"[[395, 682], [48, 1049]]",0.664213,0.571102
SVC,"[[632, 445], [543, 554]]",0.545538,0.543792


## Tuning the two best models

In [109]:
# Print the confusion matrix with labels for TP, FP, TN, FN
def print_confusion_matrix(cm: np.ndarray):
    # True Positives
    TP = cm[1, 1]
    # True Negatives
    TN = cm[0, 0]
    # False Positives
    FP = cm[0, 1]
    # False Negatives
    FN = cm[1, 0]

    # Now we have extracted the values print as a grid
    print(f'True Positives: {TP} False Positives: {FP}')
    print(f'False Negatives: {FN} True Negatives: {TN}')
    

## Tuning the two best models

In [110]:
# XGBoost seems best so lets try that with hyperparameter tuning
search_grid = GridSearchCV(
    XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'learning_rate': [0.1, 0.01, 0.001]
    },
    cv = 10,
    n_jobs = -1
)

In [111]:
# Fit the model
search_grid.fit(X_train, y_train)

# print best parameter after tuning 
print(search_grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(search_grid.best_estimator_) 

{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 300}
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=4,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=300,
              n_jobs=None, num_parallel_tree=None, random_state=None, ...)


In [112]:
# Print the confusion matrix with labels for TP, FP, TN, FN
def print_confusion_matrix(cm: np.ndarray):
    # True Positives
    TP = cm[1, 1]
    # True Negatives
    TN = cm[0, 0]
    # False Positives
    FP = cm[0, 1]
    # False Negatives
    FN = cm[1, 0]

    # Now we have extracted the values print as a grid
    print(f'True Positives: {TP} False Positives: {FP}')
    print(f'False Negatives: {FN} True Negatives: {TN}')

In [113]:
# Extract the best two from the grid search
xgb_optimal = search_grid.best_estimator_

In [114]:
# Score the optimised SVC
xgb_optimal_result = score_model(xgb_optimal, X_train, X_val, y_train, y_val)

In [115]:
# Print the results

print(f'Best SVC Accuracy: { xgb_optimal_result["accuracy"] }')
print(f'Best SVC KFold CV: { xgb_optimal_result["kfold-cv"] }')
print('\nConfusion Matrix:')
print_confusion_matrix(xgb_optimal_result['confusion_matrix'])

# and finally the f1 score
print(f'F1 Score: {f1_score(y_val, xgb_optimal.predict(X_val))}')

Best SVC Accuracy: 0.8095676172953082
Best SVC KFold CV: 0.8088664913818289

Confusion Matrix:
True Positives: 915 False Positives: 232
False Negatives: 182 True Negatives: 845
F1 Score: 0.8155080213903744


In [116]:
### RandomForestClassifier

In [117]:
# RandomForest grid search
search_grid = GridSearchCV(
    RandomForestClassifier(),
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'criterion': ['gini', 'entropy']
    },
    cv = 10,
    n_jobs = -1
)

In [118]:
# Fit the model
search_grid.fit(X_train, y_train)

# print best parameter after tuning 
print(search_grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(search_grid.best_estimator_) 

{'criterion': 'entropy', 'max_depth': 4, 'n_estimators': 100}
RandomForestClassifier(criterion='entropy', max_depth=4)


In [119]:
# Extract the best two from the grid search
rf_optimal = search_grid.best_estimator_

In [120]:
# Score the optimised SVC
rf_optimal_result = score_model(xgb_optimal, X_train, X_val, y_train, y_val)

In [121]:
# Print the results

print(f'Best SVC Accuracy: { rf_optimal_result["accuracy"] }')
print(f'Best SVC KFold CV: { rf_optimal_result["kfold-cv"] }')
print('\nConfusion Matrix:')
print_confusion_matrix(rf_optimal_result['confusion_matrix'])

# and finally the f1 score
print(f'F1 Score: {f1_score(y_val, rf_optimal.predict(X_val))}')

Best SVC Accuracy: 0.8095676172953082
Best SVC KFold CV: 0.8088664913818289

Confusion Matrix:
True Positives: 915 False Positives: 232
False Negatives: 182 True Negatives: 845
F1 Score: 0.7087087087087087


## Create submission

Now we can predict the test set and submit

In [122]:
# Check the shape

print(f'Test: {test_df.shape}')

# Print the first 5 rows:
test_df.head()

Test: (4277, 16)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Num,Side,Group,GroupSize
0,0013_01,Earth,True,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,G,3,S,13,1
1,0018_01,Earth,False,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,F,4,S,18,1
2,0019_01,Europa,True,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,C,0,S,19,1
3,0021_01,Europa,False,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,C,1,S,21,1
4,0023_01,Earth,False,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,F,5,S,23,1


In [123]:
# Capture the passenger IDs for submission and remove from the data as not useful predictors
idx_test = test_df.pop('PassengerId')

In [124]:
# Convert any Boolean columns with Unknown to String type
for column in test_df.columns:
    if 'Unknown' in train_df[column].unique() or column == 'Num':
        print(column)
        test_df[column] = test_df[column].astype(str)

HomePlanet
CryoSleep
Destination
Deck
Num
Side


In [125]:
# Pre-process test set
X_test = preprocessor.transform(test_df) # sparse to dense array

ValueError: Found unknown categories ['909', '1759', '1623', '1885', '1589', '1443', '1811', '1573', '1874', '1698', '1809', '1845', '1556', '1827', '1662', '1861', '1563', '1762', '1568', '1803', '1890', '1836', '1401', '1649', '1801', '1530', '1859', '1835', '1157', '1565', '1849', '1795', '1850', '1887', '1641', '1581', '1516', '1757', '1760', '1883', '1728', '1881', '1524', '1542', '1843', '1042', '1732', '1869', '1650', '1852', '1833', '1631', '1714', '1882', '1590', '1796', '1854', '1806', '1656', '1692', '1222', '1860', '1785', '628', '1730', '1814', '1813', '1851', '1791', '1110', '1182', '1764', '1790', '1879', '1580', '1291', '1862'] in column 6 during transform

In [None]:
# Predict the test set
y_pred = xgb_optimal.predict(X_test).astype(bool)

In [None]:
# Write the submission DataFrame
submission = pd.DataFrame({'PassengerId': idx_test, 'Transported': y_pred})
submission.to_csv(f'submissions/{model}/submission.csv', index=False)
print("Your submission was successfully saved!")
