<a href="https://colab.research.google.com/github/adhang/learn-data-science/blob/main/Spaceship_Titanic_Data_Cleansing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spaceship Titanic

Author: Adhang Muntaha Muhammad

[![LinkedIn](https://img.shields.io/badge/linkedin-0077B5?style=for-the-badge&logo=linkedin&logoColor=white&link=https://www.linkedin.com/in/adhangmuntaha/)](https://www.linkedin.com/in/adhangmuntaha/)
[![GitHub](https://img.shields.io/badge/github-121011?style=for-the-badge&logo=github&logoColor=white&link=https://github.com/adhang)](https://github.com/adhang)
[![Kaggle](https://img.shields.io/badge/kaggle-20BEFF?style=for-the-badge&logo=kaggle&logoColor=white&link=https://www.kaggle.com/adhang)](https://www.kaggle.com/adhang)
[![Tableau](https://img.shields.io/badge/tableau-E97627?style=for-the-badge&logo=tableau&logoColor=white&link=https://public.tableau.com/app/profile/adhang)](https://public.tableau.com/app/profile/adhang)
___

**Contents**
1. Introduction
2. Importing Libraries
3. Dataset Overview
4. Exploratory Data Analysis
5. Data Preprocessing
6. Model Development & Evaluation
7. Conclusion
8. Explainable AI
9. Reference and Further Reading

# 1. Introduction

**Features**
- `PassengerId` - A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group. People in a group are often family members, but not always.
- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- `Destination` - The planet the passenger will be debarking to.
- `Age` - The age of the passenger.
- `VIP` - Whether the passenger has paid for special VIP service during the voyage.
- `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- `Name` - The first and last names of the passenger.
- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

# Importing Libraries

In [None]:
!pip install datawig
!pip install inflection

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.precision', 3)

import datawig
import inflection

from sklearn.metrics import f1_score, classification_report

# Dataset Overview

## Reading Dataset

In [3]:
path = 'https://raw.githubusercontent.com/adhang/datasets/main/spaceship-titanic-train.csv'

data = pd.read_csv(path)
data.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


## Duplicated Values

In [4]:
data.duplicated().sum()

0

## Rename Column Names

In [5]:
# renaming the column
column_list = list(data.columns)

for i, col in enumerate(column_list):
  column_list[i] = inflection.underscore(column_list[i]).replace(' ', '_')

data.columns = column_list
data.head()

Unnamed: 0,passenger_id,home_planet,cryo_sleep,cabin,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,name,transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


## Data Types

### Numeric

In [6]:
data.select_dtypes(include=np.number)

Unnamed: 0,age,room_service,food_court,shopping_mall,spa,vr_deck
0,39.0,0.0,0.0,0.0,0.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0
4,16.0,303.0,70.0,151.0,565.0,2.0
...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0
8689,18.0,0.0,0.0,0.0,0.0,0.0
8690,26.0,0.0,0.0,1872.0,1.0,0.0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0


In [7]:
column_numerical = data.select_dtypes(include=np.number).columns.values.tolist()
column_numerical

['age', 'room_service', 'food_court', 'shopping_mall', 'spa', 'vr_deck']

### Categorical

In [8]:
data.select_dtypes(exclude=np.number)

Unnamed: 0,passenger_id,home_planet,cryo_sleep,cabin,destination,vip,name,transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,False,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,False,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,True,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,False,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,False,Willy Santantines,True
...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,True,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,False,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,False,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,False,Celeon Hontichre,False


In [9]:
column_categorical = data.select_dtypes(exclude=np.number).columns.values.tolist()
column_categorical

['passenger_id',
 'home_planet',
 'cryo_sleep',
 'cabin',
 'destination',
 'vip',
 'name',
 'transported']

### Convert Bool to String

In [10]:
bool_col = ['cryo_sleep', 'vip', 'transported']
data.loc[:, bool_col] = data.loc[:, bool_col].replace({True:'TRUE', False:'FALSE'})

data.head()

Unnamed: 0,passenger_id,home_planet,cryo_sleep,cabin,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,name,transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


## Missing Values

In [11]:
# total null values
data_null_total = pd.DataFrame(data.isna().sum()).T.rename({0:'total null'})

# percentage of null values
data_null_percentage = pd.DataFrame(100*data.isna().sum()/data.shape[0]).T.rename({0:'percentage null'})

data_null = pd.concat([data_null_total, data_null_percentage], axis=0).T
data_null.style.background_gradient()

2022-05-03 10:02:40,290 [INFO]  NumExpr defaulting to 2 threads.


Unnamed: 0,total null,percentage null
passenger_id,0,0.0
home_planet,201,2.31
cryo_sleep,217,2.5
cabin,199,2.29
destination,182,2.09
age,179,2.06
vip,203,2.34
room_service,181,2.08
food_court,183,2.11
shopping_mall,208,2.39


# Feature Extraction

## Passenger ID

In [12]:
data_id = data['passenger_id'].str.split('_', expand=True)
data_id = data_id.rename(columns={0:'passenger_group', 1:'passenger_num'})

data_id

Unnamed: 0,passenger_group,passenger_num
0,0001,01
1,0002,01
2,0003,01
3,0003,02
4,0004,01
...,...,...
8688,9276,01
8689,9278,01
8690,9279,01
8691,9280,01


## Cabin

In [13]:
data_cabin = data['cabin'].str.split('/', expand=True)
data_cabin = data_cabin.rename(columns={0:'cabin_deck', 1:'cabin_num', 2:'cabin_side'})

data_cabin

Unnamed: 0,cabin_deck,cabin_num,cabin_side
0,B,0,P
1,F,0,S
2,A,0,S
3,A,0,S
4,F,1,S
...,...,...,...
8688,A,98,P
8689,G,1499,S
8690,G,1500,S
8691,E,608,S


## Total Luxury Amenities

In [199]:
data_bill = data['room_service'] + data['food_court'] + data['shopping_mall'] + data['spa'] + data['vr_deck']
data_bill = pd.DataFrame(data_bill)
data_bill = data_bill.rename(columns={0:'total_bill'})

data_bill

Unnamed: 0,total_bill
0,0.0
1,736.0
2,10383.0
3,5176.0
4,1091.0
...,...
8688,8536.0
8689,0.0
8690,1873.0
8691,4637.0


## Updated Dataframe

In [200]:
data_update = pd.concat([data, data_id, data_cabin, data_bill], axis=1)

data_update.drop(columns=['passenger_id', 'cabin', 'name'], inplace=True)
data_update.head()

Unnamed: 0,home_planet,cryo_sleep,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,transported,passenger_group,passenger_num,cabin_deck,cabin_num,cabin_side,total_bill
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,1,1,B,0,P,0.0
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,2,1,F,0,S,736.0
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,3,1,A,0,S,10383.0
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,3,2,A,0,S,5176.0
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,4,1,F,1,S,1091.0


# Impute by Knowledge
Some of this approaches are inspired by Vincent Debout's comment on the [Kaggle discussion page](https://www.kaggle.com/competitions/spaceship-titanic/discussion/315987)

## Luxury Amenities

### Cryo Sleep

In [201]:
mask = data_update['cryo_sleep'] == 'TRUE'

In [202]:
data_update.loc[mask, 'room_service'].value_counts()

0.0    2969
Name: room_service, dtype: int64

In [203]:
data_update.loc[mask, 'food_court'].value_counts()

0.0    2967
Name: food_court, dtype: int64

In [204]:
data_update.loc[mask, 'shopping_mall'].value_counts()

0.0    2941
Name: shopping_mall, dtype: int64

In [205]:
data_update.loc[mask, 'spa'].value_counts()

0.0    2972
Name: spa, dtype: int64

In [206]:
data_update.loc[mask, 'vr_deck'].value_counts()

0.0    2975
Name: vr_deck, dtype: int64

In [207]:
data_update.loc[mask, 'total_bill'].value_counts()

0.0    2690
Name: total_bill, dtype: int64

As we can see, passengers who enter cryosleep do not have luxurious facilities. So, I will impute the null values with `0` for all passengers entering cryosleep.

In [208]:
mask = data_update['cryo_sleep'] == 'TRUE'
column_luxury = ['room_service', 'food_court', 'shopping_mall', 'spa', 'vr_deck', 'total_bill']

data_update.loc[mask, column_luxury] = data_update.loc[mask, column_luxury].fillna(0)

In [209]:
data_update.isna().sum()

home_planet        201
cryo_sleep         217
destination        182
age                179
vip                203
room_service       113
food_court         113
shopping_mall      112
spa                118
vr_deck            126
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side         199
total_bill         561
dtype: int64

Now, we have reduced some missing values on the luxury amenities features

### Children

In [210]:
mask = data_update['age'] <= 12

data_update.loc[mask, 'total_bill'].value_counts()

0.0    763
Name: total_bill, dtype: int64

As we can see, children doesn't have any bill. So, I will impute it with `0`

In [211]:
mask = data_update['age'] <= 12
column_luxury = ['room_service', 'food_court', 'shopping_mall', 'spa', 'vr_deck', 'total_bill']

data_update.loc[mask, column_luxury] = data_update.loc[mask, column_luxury].fillna(0)

In [212]:
data_update.isna().sum()

home_planet        201
cryo_sleep         217
destination        182
age                179
vip                203
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side         199
total_bill         518
dtype: int64

It still has some missing values, but it has reduced

## Destination

In [213]:
mask = (data_update['age'] > 12) & (data_update['cryo_sleep'] == 'FALSE') & (data_update['total_bill'] == 0)

data_update.loc[mask, 'destination'].value_counts()

TRAPPIST-1e    92
Name: destination, dtype: int64

In [214]:
mask = (data_update['age'] > 12) & (data_update['cryo_sleep'] == 'FALSE') & (data_update['total_bill'] == 0)

data_update.loc[mask, 'destination'] = data_update.loc[mask, 'destination'].fillna('TRAPPIST-1e')

Passengers who are not children, not in cryosleep, and have no bill will have `TRAPPIST-1e` as their destination

## VIP

### Earth

In [215]:
mask = data_update['home_planet'] == 'Earth'

data_update.loc[mask, 'vip'].value_counts()

FALSE    4487
Name: vip, dtype: int64

In [216]:
mask = data_update['home_planet'] == 'Earth'

data_update.loc[mask, 'vip'] = data_update.loc[mask, 'vip'].fillna('FALSE')

In [217]:
data_update.isna().sum()

home_planet        201
cryo_sleep         217
destination        180
age                179
vip                 88
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side         199
total_bill         518
dtype: int64

### Mars

In [218]:
mask = (data_update['home_planet'] == 'Mars') & (data_update['age'] >= 18) & (data_update['cryo_sleep'] == 'FALSE') & (data_update['destination'] == '55 Cancri e')

data_update.loc[mask, 'vip'].value_counts()

FALSE    88
Name: vip, dtype: int64

In [219]:
mask = (data_update['home_planet'] == 'Mars') & (data_update['age'] >= 18) & (data_update['cryo_sleep'] == 'FALSE') & (data_update['destination'] == '55 Cancri e')

data_update.loc[mask, 'vip'] = data_update.loc[mask, 'vip'].fillna('FALSE')

In [220]:
data_update.isna().sum()

home_planet        201
cryo_sleep         217
destination        180
age                179
vip                 87
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side         199
total_bill         518
dtype: int64

## Home Planet
People in the same group have the same home planet

In [221]:
tmp = data_update.groupby('passenger_group')

data_update['home_planet'] = tmp['home_planet'].transform(lambda s: np.nan if pd.isnull(s).all() == True
                                                          else s.loc[s.first_valid_index()])

In [222]:
data_update.isna().sum()

home_planet        111
cryo_sleep         217
destination        180
age                179
vip                 87
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side         199
total_bill         518
dtype: int64

## Cabin Deck

In [223]:
mask = data_update['cabin_deck'] == 'T'

data_update.loc[mask]

Unnamed: 0,home_planet,cryo_sleep,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,transported,passenger_group,passenger_num,cabin_deck,cabin_num,cabin_side,total_bill
1004,,False,TRAPPIST-1e,35.0,False,415.0,1328.0,0.0,14.0,60.0,False,1071,1,T,0,P,1817.0
2254,Europa,False,TRAPPIST-1e,42.0,False,0.0,1829.0,2.0,3133.0,2447.0,False,2414,1,T,1,P,7411.0
2734,Europa,False,TRAPPIST-1e,33.0,False,0.0,28.0,0.0,6841.0,543.0,False,2935,1,T,2,P,7412.0
2763,Europa,False,TRAPPIST-1e,38.0,False,0.0,3135.0,0.0,26.0,3.0,True,2971,1,T,3,P,3164.0
4565,Europa,,TRAPPIST-1e,37.0,False,1721.0,667.0,,28.0,1362.0,False,4863,1,T,2,S,


The `T` cabin deck only has 5 rows. Could it be the ship crew?
<br><br>
I think so, because most of their home planet is `Europa` and they are not in cryosleep. And the other cabin has ordered name, starting from `A` to `G`.

In [224]:
mask = data_update['cabin_deck'] == 'T'

data_update.loc[mask, 'home_planet'] = data_update.loc[mask, 'home_planet'].fillna('Europa')
data_update.loc[mask, 'cryo_sleep'] = data_update.loc[mask, 'cryo_sleep'].fillna('FALSE')

In [225]:
data_update.isna().sum()

home_planet        110
cryo_sleep         216
destination        180
age                179
vip                 87
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side         199
total_bill         518
dtype: int64

## Cabin Side
People in the same group have the same cabin side

In [239]:
tmp = data_update.groupby('passenger_group')

data_update['cabin_side'] = tmp['cabin_side'].transform(lambda s: np.nan if pd.isnull(s).all() == True
                                                        else s.loc[s.first_valid_index()])

In [240]:
data_update.isna().sum()

home_planet        110
cryo_sleep         216
destination        180
age                179
vip                 87
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side          99
total_bill         518
dtype: int64

# Impute Using  DataWig

## Train - Test Selection
- Train - contains no null values
- Test - contain all null values

In [241]:
data_train = data_update.dropna().copy()
data_test = data_update[data_update.isna().any(axis=1)].copy()

## Train - Test Split
It's used to train and evaluate the DataWig performance. So, it's based on the `data_train` (which contains no null values)

In [242]:
df_train, df_test = datawig.utils.random_split(data_train, split_ratios=[0.8, 0.2])

## DataWig Imputation

In [243]:
data_update.isna().sum()

home_planet        110
cryo_sleep         216
destination        180
age                179
vip                 87
room_service       107
food_court         106
shopping_mall      103
spa                114
vr_deck            107
transported          0
passenger_group      0
passenger_num        0
cabin_deck         199
cabin_num          199
cabin_side          99
total_bill         518
dtype: int64

There are 3 columns that do not have a null value

In [244]:
column_list = data_test.drop(columns=['transported', 'passenger_group', 'passenger_num']).columns.tolist()
column_list

['home_planet',
 'cryo_sleep',
 'destination',
 'age',
 'vip',
 'room_service',
 'food_court',
 'shopping_mall',
 'spa',
 'vr_deck',
 'cabin_deck',
 'cabin_num',
 'cabin_side',
 'total_bill']

In [245]:
for col in column_list:
  print('=========================')
  print('=========================')
  print('Imputation for', col)

  drive = '/content/drive/MyDrive/My Projects/2022/Titanic Spaceship/imputer/'
  path = drive + f'imputed_{col}'

  # create a single imputer for specific column
  imputer = datawig.SimpleImputer(
      input_columns = df_train.drop(columns=[col]).columns.tolist(),
      output_column = col,
      output_path = path
  )

  # fit the imputer model
  imputer.fit(data_train)

  # make predictions on missing values
  predictions = imputer.predict(data_test)

  # impute the missing values
  mask = data_test[col].isna()

  imputed_col = f'{col}_imputed'
  data_test.loc[mask, col] = predictions.loc[mask, imputed_col]

Imputation for home_planet


2022-05-03 12:46:44,781 [INFO]  
2022-05-03 12:46:48,753 [INFO]  Epoch[0] Batch [0-205]	Speed: 834.50 samples/sec	cross-entropy=0.667613	home_planet-accuracy=0.709648
2022-05-03 12:46:52,644 [INFO]  Epoch[0] Train-cross-entropy=0.582690
2022-05-03 12:46:52,647 [INFO]  Epoch[0] Train-home_planet-accuracy=0.756555
2022-05-03 12:46:52,653 [INFO]  Epoch[0] Time cost=7.852
2022-05-03 12:46:52,675 [INFO]  Saved checkpoint to "/content/drive/MyDrive/My Projects/2022/Titanic Spaceship/imputer/imputed_home_planet/model-0000.params"
2022-05-03 12:46:53,402 [INFO]  Epoch[0] Validation-cross-entropy=0.443722
2022-05-03 12:46:53,408 [INFO]  Epoch[0] Validation-home_planet-accuracy=0.809783
2022-05-03 12:46:57,365 [INFO]  Epoch[1] Batch [0-205]	Speed: 833.77 samples/sec	cross-entropy=0.410625	home_planet-accuracy=0.853155
2022-05-03 12:47:01,308 [INFO]  Epoch[1] Train-cross-entropy=0.386081
2022-05-03 12:47:01,314 [INFO]  Epoch[1] Train-home_planet-accuracy=0.867226
2022-05-03 12:47:01,319 [INFO]  E

Imputation for cryo_sleep


  return np.log(probas)
  return np.log(probas)


Imputation for destination


  _warn_prf(average, modifier, msg_start, len(result))


Imputation for age
Imputation for vip


  _warn_prf(average, modifier, msg_start, len(result))


Imputation for room_service


  return np.log(probas)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)


Imputation for food_court


  return np.log(probas)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)


Imputation for shopping_mall


  return np.log(probas)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)


Imputation for spa


  return np.log(probas)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)


Imputation for vr_deck


  return np.log(probas)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)


Imputation for cabin_deck


  return np.log(probas)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Imputation for cabin_num


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Imputation for cabin_side
Imputation for total_bill


  return np.log(probas)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
  bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)


In [246]:
data_test.isna().sum()

home_planet        0
cryo_sleep         0
destination        0
age                0
vip                0
room_service       0
food_court         0
shopping_mall      0
spa                0
vr_deck            0
transported        0
passenger_group    0
passenger_num      0
cabin_deck         0
cabin_num          0
cabin_side         0
total_bill         0
dtype: int64

No null values!

## Fix Imputation Result

In [249]:
data_test

Unnamed: 0,home_planet,cryo_sleep,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,transported,passenger_group,passenger_num,cabin_deck,cabin_num,cabin_side,total_bill
15,Earth,FALSE,TRAPPIST-1e,31.0,FALSE,32.000,0.0,876.000,0.000,0.000,FALSE,0012,01,F,149,P,908.000
16,Mars,FALSE,55 Cancri e,27.0,FALSE,1286.000,122.0,49.899,0.000,0.000,FALSE,0014,01,F,3,P,1298.493
35,Mars,FALSE,TRAPPIST-1e,20.0,FALSE,-908.989,0.0,1750.000,990.000,0.000,TRUE,0031,03,F,9,P,1527.423
47,Mars,TRUE,TRAPPIST-1e,19.0,FALSE,0.000,0.0,0.000,0.000,0.000,TRUE,0045,02,F,10,P,0.000
48,Earth,FALSE,55 Cancri e,35.0,FALSE,790.000,0.0,0.000,497.623,0.000,FALSE,0050,01,E,1,S,1259.086
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8667,Europa,FALSE,TRAPPIST-1e,29.0,FALSE,0.000,2972.0,-1022.105,28.000,188.000,TRUE,9250,01,E,597,P,2305.546
8674,Earth,FALSE,TRAPPIST-1e,13.0,FALSE,39.000,0.0,1085.000,24.000,0.000,FALSE,9257,01,F,1892,P,1148.000
8675,Earth,FALSE,TRAPPIST-1e,44.0,FALSE,1030.000,1015.0,0.000,11.000,-843.748,TRUE,9259,01,F,1893,P,1159.149
8684,Earth,TRUE,TRAPPIST-1e,23.0,FALSE,0.000,0.0,0.000,0.000,0.000,TRUE,9274,01,G,1508,P,0.000


As we can see, there are some 'wrong' imputation, especially on the luxury amenities columns. How can we have negative value? Therefore, I will replace it with 0

In [256]:
data_test_update = data_test.copy()
column_luxury = ['room_service', 'food_court', 'shopping_mall', 'spa', 'vr_deck', 'total_bill']

for col in column_luxury:
  mask = data_test_update[col] < 0

  data_test_update.loc[mask, col] = 0

data_test_update

Unnamed: 0,home_planet,cryo_sleep,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,transported,passenger_group,passenger_num,cabin_deck,cabin_num,cabin_side,total_bill
15,Earth,FALSE,TRAPPIST-1e,31.0,FALSE,32.0,0.0,876.000,0.000,0.0,FALSE,0012,01,F,149,P,908.000
16,Mars,FALSE,55 Cancri e,27.0,FALSE,1286.0,122.0,49.899,0.000,0.0,FALSE,0014,01,F,3,P,1298.493
35,Mars,FALSE,TRAPPIST-1e,20.0,FALSE,0.0,0.0,1750.000,990.000,0.0,TRUE,0031,03,F,9,P,1527.423
47,Mars,TRUE,TRAPPIST-1e,19.0,FALSE,0.0,0.0,0.000,0.000,0.0,TRUE,0045,02,F,10,P,0.000
48,Earth,FALSE,55 Cancri e,35.0,FALSE,790.0,0.0,0.000,497.623,0.0,FALSE,0050,01,E,1,S,1259.086
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8667,Europa,FALSE,TRAPPIST-1e,29.0,FALSE,0.0,2972.0,0.000,28.000,188.0,TRUE,9250,01,E,597,P,2305.546
8674,Earth,FALSE,TRAPPIST-1e,13.0,FALSE,39.0,0.0,1085.000,24.000,0.0,FALSE,9257,01,F,1892,P,1148.000
8675,Earth,FALSE,TRAPPIST-1e,44.0,FALSE,1030.0,1015.0,0.000,11.000,0.0,TRUE,9259,01,F,1893,P,1159.149
8684,Earth,TRUE,TRAPPIST-1e,23.0,FALSE,0.0,0.0,0.000,0.000,0.0,TRUE,9274,01,G,1508,P,0.000


## Imputed Dataframe

In [257]:
data_imputed = pd.concat([data_train, data_test_update], axis=0)

I will add the original cabin column

In [258]:
data_imputed['cabin_update'] = data_imputed['cabin_deck'] + '/' + data_imputed['cabin_num'].astype(int).astype(str) + '/' + data_imputed['cabin_side']

data_imputed.head()

Unnamed: 0,home_planet,cryo_sleep,destination,age,vip,room_service,food_court,shopping_mall,spa,vr_deck,transported,passenger_group,passenger_num,cabin_deck,cabin_num,cabin_side,total_bill,cabin_update
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,1,1,B,0,P,0.0,B/0/P
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,2,1,F,0,S,736.0,F/0/S
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,3,1,A,0,S,10383.0,A/0/S
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,3,2,A,0,S,5176.0,A/0/S
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,4,1,F,1,S,1091.0,F/1/S


In [259]:
data_imputed['cabin_update'].value_counts()

G/1476/S    23
G/1476/P    18
F/260/S     11
G/1046/S     9
B/201/P      9
            ..
G/298/S      1
D/6/S        1
G/165/P      1
B/105/S      1
B/241/S      1
Name: cabin_update, Length: 6577, dtype: int64

Before I do this approach, I just use a simple imputation (most frequent and median) for the cabin deck, number, and side. The result? I get a cabin that has 300 passengers in it. It doesn't make sense.
<br><br>
Using datawig, the highest number of passengers in a cabin still makes sense (even though it still can be improved)

## Save Clean Dataset

In [261]:
drive = '/content/drive/MyDrive/My Projects/2022/Titanic Spaceship/imputer/'
path = drive + f'train_clean.csv'

data_imputed.sort_index().to_csv(path)