<a href="https://colab.research.google.com/github/Vishu-Gupta/MLProjects/blob/main/04%20Spaceship_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a Kaggle competition :

https://www.kaggle.com/c/spaceship-titanic/overview


## Connecting with Kaggle and getting the dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
! mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [3]:
!kaggle competitions download spaceship-titanic

train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)


## Importing Libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Loading train and test data.

In [5]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [6]:
# Shapes of the dataset
df_train.shape

(8693, 14)

In [7]:
df_test.shape

(4277, 13)

In [8]:
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Data Description
File and Data Field Descriptions** 

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.
Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## EDA

In [9]:
df_train.isnull().sum() # Missing Values in train

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [10]:
df_train.nunique() # no of distinct values for each feature

PassengerId     8693
HomePlanet         3
CryoSleep          2
Cabin           6560
Destination        3
Age               80
VIP                2
RoomService     1273
FoodCourt       1507
ShoppingMall    1115
Spa             1327
VRDeck          1306
Name            8473
Transported        2
dtype: int64

In [11]:
#Cabin is a combo of deck/num/side , which individually could be important features. Need to be extracted
df_train[['deck','cabin_num','side']] = df_train['Cabin'].fillna('//').str.split('/',expand=True)

In [12]:
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,deck,cabin_num,side
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,B,0,P
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,F,0,S
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,A,0,S
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,A,0,S
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,F,1,S


In [13]:
df_train.dropna().shape[0]/df_train.shape[0] # what %age of data will disappear in case of missing value drop (22%)

0.7599217761417232

In [14]:
df_train.info() # data types of all features

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
 14  deck          8693 non-null   object 
 15  cabin_num     8693 non-null   object 
 16  side          8693 non-null   object 
dtypes: bool(1), float64(6), object(10)
memory usage: 1.1+ MB


In [15]:
df_train['deck'].unique()

array(['B', 'F', 'A', 'G', '', 'E', 'D', 'C', 'T'], dtype=object)

In [16]:
df_train['side'].unique()

array(['P', 'S', ''], dtype=object)

In [17]:
df_train['cabin_num'].unique()

array(['0', '1', '2', ..., '1892', '1893', '1894'], dtype=object)

In [22]:
df_train[['deck','cabin_num','side']] = df_train[['deck','cabin_num','side']].replace('',np.NaN)

In [23]:
df_train[df_train['Cabin'].isnull()][['deck','cabin_num','side']].head()

Unnamed: 0,deck,cabin_num,side
15,,,
93,,,
103,,,
222,,,
227,,,


In [24]:
df_train.drop('Cabin',axis=1,inplace=True) # dropping Cabin column 

In [26]:
df_train[df_train['Spa'].isnull()].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,deck,cabin_num,side
48,0050_01,Earth,False,55 Cancri e,35.0,False,790.0,0.0,0.0,,0.0,Sony Lancis,False,E,1,S
143,0164_01,Earth,False,TRAPPIST-1e,57.0,False,50.0,1688.0,0.0,,135.0,Fany Hutchinton,True,G,28,S
245,0265_01,Europa,True,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,,0.0,Etair Herpumble,True,D,8,S
269,0294_01,Europa,True,TRAPPIST-1e,50.0,False,0.0,0.0,0.0,,0.0,Phonons Roforhauge,True,B,8,S
289,0320_01,Earth,False,TRAPPIST-1e,18.0,False,0.0,2.0,0.0,,0.0,Breney Bellarkerd,False,G,44,S


In [27]:
df_train.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [30]:
# for luxury expenditure columns -  Spa,Room Service,FoodCourt, ShoppingMall and VRDeck , missing values can be replaced with 0 , 
# cosnidering no expenditure
df_train[['Spa','RoomService','FoodCourt','ShoppingMall','VRDeck']]= df_train[['Spa','RoomService','FoodCourt','ShoppingMall','VRDeck']].fillna(0)

In [31]:
df_train.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Destination     182
Age             179
VIP             203
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
deck            199
cabin_num       199
side            199
dtype: int64