# Spaceship Titanic Submission

For details on how I arrived at data pre-processing strategy and selection of a model, see "Exploratory Data Analysis" and "Model Analysis" notebooks in my https://github.com/dandresky/ds-ml-projects/tree/main/kaggle-challenges/spaceship-titanic repository.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Import and inspect the data

In [2]:
df_train = pd.read_csv('../data/train.csv')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [3]:
df_test = pd.read_csv('../data/test.csv')
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   4277 non-null   object 
 1   HomePlanet    4190 non-null   object 
 2   CryoSleep     4184 non-null   object 
 3   Cabin         4177 non-null   object 
 4   Destination   4185 non-null   object 
 5   Age           4186 non-null   float64
 6   VIP           4184 non-null   object 
 7   RoomService   4195 non-null   float64
 8   FoodCourt     4171 non-null   float64
 9   ShoppingMall  4179 non-null   float64
 10  Spa           4176 non-null   float64
 11  VRDeck        4197 non-null   float64
 12  Name          4183 non-null   object 
dtypes: float64(6), object(7)
memory usage: 434.5+ KB


In [4]:
df_sample = pd.read_csv('../data/sample_submission.csv')
df_sample.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,False
3,0021_01,False
4,0023_01,False


In [5]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PassengerId  4277 non-null   object
 1   Transported  4277 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 37.7+ KB


The submission includes all passengers so we cannot build a predictor that relies on dropping rows.

# Preprocessing

In [6]:
df_train.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [7]:
df_test.isnull().sum()

PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
dtype: int64

### Drop the Name Feature

A count of unique Name values shows there are only 20 duplicates out of 8493 names. I don't expect any useful information so drop it.

In [8]:
df_train.drop('Name', axis=1, inplace=True)
df_test.drop('Name', axis=1, inplace=True)

### Fill in Missing CryoSleep Values

Less than 3% of people awake are old enough to have money but are not spending. That is low enough that I will just set CryoSleep missing values to True if the passenger is not spending money.

First create a temporary helper feature

In [9]:
df_train['TotalSpend'] =  df_train[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
df_test['TotalSpend'] =  df_test[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

Now set missing CryoSleep values based on spending.

In [10]:
df_train['CryoSleep'] = df_train.apply(lambda row: False if ((row['TotalSpend']>0) & np.isnan(row['CryoSleep'])) else row['CryoSleep'], axis=1)
df_train['CryoSleep'] = df_train.apply(lambda row: True if ((row['TotalSpend']==0) & np.isnan(row['CryoSleep'])) else row['CryoSleep'], axis=1)
df_test['CryoSleep'] = df_test.apply(lambda row: False if ((row['TotalSpend']>0) & np.isnan(row['CryoSleep'])) else row['CryoSleep'], axis=1)
df_test['CryoSleep'] = df_test.apply(lambda row: True if ((row['TotalSpend']==0) & np.isnan(row['CryoSleep'])) else row['CryoSleep'], axis=1)

Now drop TotalSpend

In [11]:
df_train.drop('TotalSpend', axis=1, inplace=True)
df_test.drop('TotalSpend', axis=1, inplace=True)

### Set Missing Spending to Median if CryoSleep is False

Otherwise set to 0.

In [12]:
df_train['RoomService'] = df_train.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['RoomService'])) else row['RoomService'], axis=1)
df_train['RoomService'] = df_train.apply(lambda row: df_train['RoomService'].median() if ((row['CryoSleep']==False) & np.isnan(row['RoomService'])) else row['RoomService'], axis=1)

df_train['FoodCourt'] = df_train.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['FoodCourt'])) else row['FoodCourt'], axis=1)
df_train['FoodCourt'] = df_train.apply(lambda row: df_train['FoodCourt'].median() if ((row['CryoSleep']==False) & np.isnan(row['FoodCourt'])) else row['FoodCourt'], axis=1)

df_train['ShoppingMall'] = df_train.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['ShoppingMall'])) else row['ShoppingMall'], axis=1)
df_train['ShoppingMall'] = df_train.apply(lambda row: df_train['ShoppingMall'].median() if ((row['CryoSleep']==False) & np.isnan(row['ShoppingMall'])) else row['ShoppingMall'], axis=1)

df_train['Spa'] = df_train.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['Spa'])) else row['Spa'], axis=1)
df_train['Spa'] = df_train.apply(lambda row: df_train['Spa'].median() if ((row['CryoSleep']==False) & np.isnan(row['Spa'])) else row['Spa'], axis=1)

df_train['VRDeck'] = df_train.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['VRDeck'])) else row['VRDeck'], axis=1)
df_train['VRDeck'] = df_train.apply(lambda row: df_train['VRDeck'].median() if ((row['CryoSleep']==False) & np.isnan(row['VRDeck'])) else row['VRDeck'], axis=1)

df_train['Age'] = df_train.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['Age'])) else row['Age'], axis=1)
df_train['Age'] = df_train.apply(lambda row: df_train['Age'].median() if ((row['CryoSleep']==False) & np.isnan(row['Age'])) else row['Age'], axis=1)


In [13]:
df_test['RoomService'] = df_test.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['RoomService'])) else row['RoomService'], axis=1)
df_test['RoomService'] = df_test.apply(lambda row: df_test['RoomService'].median() if ((row['CryoSleep']==False) & np.isnan(row['RoomService'])) else row['RoomService'], axis=1)

df_test['FoodCourt'] = df_test.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['FoodCourt'])) else row['FoodCourt'], axis=1)
df_test['FoodCourt'] = df_test.apply(lambda row: df_test['FoodCourt'].median() if ((row['CryoSleep']==False) & np.isnan(row['FoodCourt'])) else row['FoodCourt'], axis=1)

df_test['ShoppingMall'] = df_test.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['ShoppingMall'])) else row['ShoppingMall'], axis=1)
df_test['ShoppingMall'] = df_test.apply(lambda row: df_test['ShoppingMall'].median() if ((row['CryoSleep']==False) & np.isnan(row['ShoppingMall'])) else row['ShoppingMall'], axis=1)

df_test['Spa'] = df_test.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['Spa'])) else row['Spa'], axis=1)
df_test['Spa'] = df_test.apply(lambda row: df_test['Spa'].median() if ((row['CryoSleep']==False) & np.isnan(row['Spa'])) else row['Spa'], axis=1)

df_test['VRDeck'] = df_test.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['VRDeck'])) else row['VRDeck'], axis=1)
df_test['VRDeck'] = df_test.apply(lambda row: df_test['VRDeck'].median() if ((row['CryoSleep']==False) & np.isnan(row['VRDeck'])) else row['VRDeck'], axis=1)

df_test['Age'] = df_test.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['Age'])) else row['Age'], axis=1)
df_test['Age'] = df_test.apply(lambda row: df_test['Age'].median() if ((row['CryoSleep']==False) & np.isnan(row['Age'])) else row['Age'], axis=1)


### Set Missing VIP Values

Nearly 98% of passengers are not VIP so I'll just assume missing values are False.

In [14]:
df_train.fillna({'VIP':False}, inplace=True)
df_test.fillna({'VIP':False}, inplace=True)

### Add Dummy Values for Missing HomePlanet, Destination, and Cabin

In [15]:
df_train.fillna({'HomePlanet':'Mercury'}, inplace=True)
df_train.fillna({'Destination':'Planet-Z'}, inplace=True)
df_train.fillna({'Cabin':'0/0/0'}, inplace=True)

In [16]:
df_test.fillna({'HomePlanet':'Mercury'}, inplace=True)
df_test.fillna({'Destination':'Planet-Z'}, inplace=True)
df_test.fillna({'Cabin':'0/0/0'}, inplace=True)

### There should be no more null values

In [17]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   object 
 2   CryoSleep     8693 non-null   bool   
 3   Cabin         8693 non-null   object 
 4   Destination   8693 non-null   object 
 5   Age           8693 non-null   float64
 6   VIP           8693 non-null   bool   
 7   RoomService   8693 non-null   float64
 8   FoodCourt     8693 non-null   float64
 9   ShoppingMall  8693 non-null   float64
 10  Spa           8693 non-null   float64
 11  VRDeck        8693 non-null   float64
 12  Transported   8693 non-null   bool   
dtypes: bool(3), float64(6), object(4)
memory usage: 704.7+ KB


In [18]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   4277 non-null   object 
 1   HomePlanet    4277 non-null   object 
 2   CryoSleep     4277 non-null   bool   
 3   Cabin         4277 non-null   object 
 4   Destination   4277 non-null   object 
 5   Age           4277 non-null   float64
 6   VIP           4277 non-null   bool   
 7   RoomService   4277 non-null   float64
 8   FoodCourt     4277 non-null   float64
 9   ShoppingMall  4277 non-null   float64
 10  Spa           4277 non-null   float64
 11  VRDeck        4277 non-null   float64
dtypes: bool(2), float64(6), object(4)
memory usage: 342.6+ KB


# Extract Features from Cabin

There are three values separated by '/' that indicate class or location of cabin. An examination of the unique values suggests that the first character may indicate a level, the second a room number, and the third could be port/starboard.

In [19]:
df_train[['CabinLevel', 'CabinNumber', 'CabinSide']] = df_train['Cabin'].str.split('/', expand=True)
df_test[['CabinLevel', 'CabinNumber', 'CabinSide']] = df_test['Cabin'].str.split('/', expand=True)

Drop Cabin and CabinNumber, they don't have any useful information

In [20]:
df_train.drop(['Cabin', 'CabinNumber'], axis=1, inplace=True)
df_test.drop(['Cabin', 'CabinNumber'], axis=1, inplace=True)

### Convert Remaining Categorial Features to Numeric

In [21]:
df_train['HomePlanet']=df_train['HomePlanet'].astype('category').cat.codes
df_train['CryoSleep']=df_train['CryoSleep'].astype('category').cat.codes
df_train['Destination']=df_train['Destination'].astype('category').cat.codes
df_train['VIP']=df_train['VIP'].astype('category').cat.codes
df_train['CabinLevel']=df_train['CabinLevel'].astype('category').cat.codes
df_train['CabinSide']=df_train['CabinSide'].astype('category').cat.codes

In [22]:
df_test['HomePlanet']=df_test['HomePlanet'].astype('category').cat.codes
df_test['CryoSleep']=df_test['CryoSleep'].astype('category').cat.codes
df_test['Destination']=df_test['Destination'].astype('category').cat.codes
df_test['VIP']=df_test['VIP'].astype('category').cat.codes
df_test['CabinLevel']=df_test['CabinLevel'].astype('category').cat.codes
df_test['CabinSide']=df_test['CabinSide'].astype('category').cat.codes

# Train a Gradient Boost Classifier

First, split the data into features (X) and labels (Y).

In [23]:
X = df_train[['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'CabinLevel', 'CabinSide']]
Y = df_train['Transported']

Apply a scaler

In [24]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled

array([[3.33333333e-01, 0.00000000e+00, 1.00000000e+00, ...,
        0.00000000e+00, 2.50000000e-01, 5.00000000e-01],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
        1.82322960e-03, 7.50000000e-01, 1.00000000e+00],
       [3.33333333e-01, 0.00000000e+00, 1.00000000e+00, ...,
        2.03041478e-03, 1.25000000e-01, 1.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
        0.00000000e+00, 8.75000000e-01, 1.00000000e+00],
       [3.33333333e-01, 0.00000000e+00, 0.00000000e+00, ...,
        1.34048813e-01, 6.25000000e-01, 1.00000000e+00],
       [3.33333333e-01, 0.00000000e+00, 1.00000000e+00, ...,
        4.97244437e-04, 6.25000000e-01, 1.00000000e+00]])

Create and train the model

In [25]:
from sklearn.ensemble import GradientBoostingClassifier

gbst = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, max_depth=4, random_state=67)

gbst.fit(X_scaled, Y)

GradientBoostingClassifier(max_depth=4, random_state=67)

# Predict Values

First extract features

In [26]:
X_test = df_test[['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'CabinLevel', 'CabinSide']]

Apply a scaler

In [27]:
scaler = preprocessing.MinMaxScaler()
X_test_scaled = scaler.fit_transform(X_test)
X_test_scaled

array([[0.        , 1.        , 1.        , ..., 0.        , 0.875     ,
        1.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.75      ,
        1.        ],
       [0.33333333, 1.        , 0.        , ..., 0.        , 0.375     ,
        1.        ],
       ...,
       [0.66666667, 1.        , 0.        , ..., 0.        , 0.5       ,
        0.5       ],
       [0.33333333, 0.        , 0.66666667, ..., 0.0234824 , 0.5       ,
        0.5       ],
       [0.        , 1.        , 0.33333333, ..., 0.        , 0.875     ,
        1.        ]])

In [28]:
y_pred = gbst.predict(X_test_scaled)

In [29]:
y_pred

array([ True, False,  True, ...,  True,  True,  True])

Create and save the submission

In [31]:
submission = df_test[['PassengerId']]
submission['Transported'] = y_pred.tolist()
submission.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission['Transported'] = y_pred.tolist()


Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [32]:
submission.to_csv('dea-submission.csv', index=False)