# Spaceship Titanic Kaggle Competition

This notebook simply explores the dataset to see what insights can be gained.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Import and inspect the data

In [2]:
df = pd.read_csv('../data/train.csv')
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


# Strategies for Dealing with Missing Data

1. Delete all rows with missing data
2. Replace missing data with the meadian, mean, or mode of the feature
3. Develop models to predict missing data

Since this is an introductory challenge meant to learn how to use the Kaggle platform, I will use one or both of the first two strategies.

In [4]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

The highest percentage of missing data is CryoSleep at 2.5%. How much data would I discard if I dropped all missing values?

In [5]:
print(df.dropna().shape[0])

6606


I would lose 24% of the data. How many passengers have multiple missing values?

In [6]:
print('Passengers with 2 or more missing values: ' + str(8693 - df.dropna(thresh=2).shape[0]))

Passengers with 2 or more missing values: 0


There is not a single passenger with more than one missing value. Lets see if I can reduce the number of dropped passengers with some pre-processing.

# Drop the Name Feature

A count of unique Name values shows there are only 20 duplicates out of 8493 names. I don't expect any useful information so drop it.

In [7]:
len(df.Name.unique())

8474

In [8]:
df.drop('Name', axis=1, inplace=True)

# Fill in Missing CryoSleep Values

Create a new feature that sums all spending for each passenger and look at the correlation with CryoSleep.

Will assume anyone not spending money is in CryoSleep. First, test my hypothesis.

In [9]:
df['TotalSpend'] =  df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

Is there anyone awake and not spending money?

In [16]:
len(df[(df.CryoSleep==False) & (df.TotalSpend==0)])

518

There are 518 people awake and not spending money. Are they old enough to have money to spend?

In [17]:
len(df[(df.CryoSleep==False) & (df.Age>10) & (df.TotalSpend==0)])

156

Confirm no one asleep is spending money.

In [19]:
len(df[(df.CryoSleep==True) & df.TotalSpend>0])

0

In [20]:
df.CryoSleep.value_counts()

False    5439
True     3037
Name: CryoSleep, dtype: int64

156 out of 5439 people awake are old enough to have money but are not spending. That is low enough that I will just set CryoSleep missing values to True if the passenger is not spending money.

In [21]:
df['CryoSleep'] = df.apply(lambda row: False if ((row['TotalSpend']>0) & np.isnan(row['CryoSleep'])) else row['CryoSleep'], axis=1)
df['CryoSleep'] = df.apply(lambda row: True if ((row['TotalSpend']==0) & np.isnan(row['CryoSleep'])) else row['CryoSleep'], axis=1)
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep         0
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Transported       0
TotalSpend        0
dtype: int64

# Fill in Missing Values for Spending Features

If CryoSleep = False, then spending = median. If CryoSleep = True, then spending = 0

In [22]:
df['RoomService'] = df.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['RoomService'])) else row['RoomService'], axis=1)
df['RoomService'] = df.apply(lambda row: df['RoomService'].median() if ((row['CryoSleep']==False) & np.isnan(row['RoomService'])) else row['RoomService'], axis=1)

df['FoodCourt'] = df.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['FoodCourt'])) else row['FoodCourt'], axis=1)
df['FoodCourt'] = df.apply(lambda row: df['FoodCourt'].median() if ((row['CryoSleep']==False) & np.isnan(row['FoodCourt'])) else row['FoodCourt'], axis=1)

df['ShoppingMall'] = df.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['ShoppingMall'])) else row['ShoppingMall'], axis=1)
df['ShoppingMall'] = df.apply(lambda row: df['ShoppingMall'].median() if ((row['CryoSleep']==False) & np.isnan(row['ShoppingMall'])) else row['ShoppingMall'], axis=1)

df['Spa'] = df.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['Spa'])) else row['Spa'], axis=1)
df['Spa'] = df.apply(lambda row: df['Spa'].median() if ((row['CryoSleep']==False) & np.isnan(row['Spa'])) else row['Spa'], axis=1)

df['VRDeck'] = df.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['VRDeck'])) else row['VRDeck'], axis=1)
df['VRDeck'] = df.apply(lambda row: df['VRDeck'].median() if ((row['CryoSleep']==False) & np.isnan(row['VRDeck'])) else row['VRDeck'], axis=1)

df['Age'] = df.apply(lambda row: 0 if ((row['CryoSleep']==True) & np.isnan(row['Age'])) else row['Age'], axis=1)
df['Age'] = df.apply(lambda row: df['Age'].median() if ((row['CryoSleep']==False) & np.isnan(row['Age'])) else row['Age'], axis=1)


In [23]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep         0
Cabin           199
Destination     182
Age               0
VIP             203
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Transported       0
TotalSpend        0
dtype: int64

# How About Missing VIP Values

In [24]:
df.VIP.value_counts()

False    8291
True      199
Name: VIP, dtype: int64

Only 2.3% of passengers are VIP. Is there a correlation to money spent? 

In [27]:
len(df[(df['VIP']==True) & (df['TotalSpend']>750)])

170

171 out of 199 VIP passengers spent money. 170 of those spent more than $750.

How many non-VIP spent more than $750?

In [29]:
len(df[(df['VIP']==False) & (df['TotalSpend']>750)])

3901

There doesn't appear to be any correlation. Given 97.7% of passengers are not VIP, I could just make null values False.

In [30]:
df.fillna({'VIP':False}, inplace=True)
df.VIP.value_counts()

False    8494
True      199
Name: VIP, dtype: int64

In [31]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep         0
Cabin           199
Destination     182
Age               0
VIP               0
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Transported       0
TotalSpend        0
dtype: int64

Now that I have filled missing spending values, I can drop total spend.

In [32]:
df.drop('TotalSpend', axis=1, inplace=True)

# Add Dummy for Missing HomePlanet, Destination, and Cabin

In [34]:
df.fillna({'HomePlanet':'Mercury'}, inplace=True)
df.HomePlanet.value_counts()

Earth      4602
Europa     2131
Mars       1759
Mercury     201
Name: HomePlanet, dtype: int64

In [35]:
df.fillna({'Destination':'Planet-Z'}, inplace=True)
df.Destination.value_counts()

TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Planet-Z          182
Name: Destination, dtype: int64

In [36]:
df.fillna({'Cabin':'0/0/0'}, inplace=True)
df.Cabin.value_counts()

0/0/0      199
G/734/S      8
C/137/S      7
B/201/P      7
G/109/P      7
          ... 
G/556/P      1
E/231/S      1
G/545/S      1
G/543/S      1
C/178/S      1
Name: Cabin, Length: 6561, dtype: int64

In [37]:
df.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

# Extract Features from Cabin

There are three values separated by '/' that indicate class or location of cabin. An examination of the unique values suggests that the first character may indicate a level, the second a room number, and the third could be port/starboard.

In [38]:
df[['CabinLevel', 'CabinNumber', 'CabinSide']] = df['Cabin'].str.split('/', expand=True)

I'll drop Cabin and CabinNumber

In [39]:
df.drop(['Cabin', 'CabinNumber'], axis=1, inplace=True)

In [41]:
df.CabinLevel.value_counts()

F    2794
G    2559
E     876
B     779
C     747
D     478
A     256
0     199
T       5
Name: CabinLevel, dtype: int64

In [42]:
df.CabinSide.value_counts()

S    4288
P    4206
0     199
Name: CabinSide, dtype: int64

# Create a New Feature for Age Range

There may be thresholds in Age that give more useful information than knowing the actual age. Create a feature for range and compare correlation with the Age feature.

In [None]:
bins = [0, 6, 18, 35, 65]
labels = ['<6', '6-17', '18-34', '35-64', '65+']

d = dict(enumerate(labels, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))
df.AgeRange.value_counts()

In [None]:
df['Age'].corr(df['Transported'], method='pearson')
#df['Age'].corr(df['Transported'], method='spearman')
#df['Age'].corr(df['Transported'], method='kendall')

In [None]:
df['AgeRange']=df['AgeRange'].astype('category').cat.codes
df['AgeRange'].corr(df['Transported'], method='pearson')
#df['AgeRange'].corr(df['Transported'], method='spearman')
#df['AgeRange'].corr(df['Transported'], method='kendall')

Conclusion:

A slight improvement in correllation can be obtained by grouping ages but still seems negligible. I'll drop Age and keep AgeRange. I may drop AgeRange when testing models.

In [None]:
df.drop('Age', axis=1, inplace=True)
df.info()

# Convert Remaining Categorial Features to Numeric

In [43]:
df['HomePlanet']=df['HomePlanet'].astype('category').cat.codes
df['CryoSleep']=df['CryoSleep'].astype('category').cat.codes
df['Destination']=df['Destination'].astype('category').cat.codes
df['VIP']=df['VIP'].astype('category').cat.codes
df['CabinLevel']=df['CabinLevel'].astype('category').cat.codes
df['CabinSide']=df['CabinSide'].astype('category').cat.codes

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   int8   
 2   CryoSleep     8693 non-null   int8   
 3   Destination   8693 non-null   int8   
 4   Age           8693 non-null   float64
 5   VIP           8693 non-null   int8   
 6   RoomService   8693 non-null   float64
 7   FoodCourt     8693 non-null   float64
 8   ShoppingMall  8693 non-null   float64
 9   Spa           8693 non-null   float64
 10  VRDeck        8693 non-null   float64
 11  Transported   8693 non-null   bool   
 12  CabinLevel    8693 non-null   int8   
 13  CabinSide     8693 non-null   int8   
dtypes: bool(1), float64(6), int8(6), object(1)
memory usage: 534.9+ KB


# Save the Dateframe for use in the Model Notebook

In [45]:
df.reset_index(inplace=True)
df.drop('index', axis=1, inplace=True)
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,CabinLevel,CabinSide
0,0001_01,1,0,3,39.0,0,0.0,0.0,0.0,0.0,0.0,False,2,1
1,0002_01,0,0,3,24.0,0,109.0,9.0,25.0,549.0,44.0,True,6,2
2,0003_01,1,0,3,58.0,1,43.0,3576.0,0.0,6715.0,49.0,False,1,2
3,0003_02,1,0,3,33.0,0,0.0,1283.0,371.0,3329.0,193.0,False,1,2
4,0004_01,0,0,3,16.0,0,303.0,70.0,151.0,565.0,2.0,True,6,2


In [46]:
df.to_pickle('./dataframe.pkl')