<a href="https://colab.research.google.com/github/Vishu-Gupta/MLProjects/blob/main/01%20Kaggle%20Projects/04%20Spaceship_Titanic/Main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a Kaggle competition :

https://www.kaggle.com/c/spaceship-titanic/overview


## Connecting with Kaggle and getting the dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
! mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/

In [3]:
!kaggle competitions download spaceship-titanic

Downloading test.csv to /content
  0% 0.00/364k [00:00<?, ?B/s]
100% 364k/364k [00:00<00:00, 53.0MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/58.5k [00:00<?, ?B/s]
100% 58.5k/58.5k [00:00<00:00, 66.0MB/s]
Downloading train.csv to /content
  0% 0.00/787k [00:00<?, ?B/s]
100% 787k/787k [00:00<00:00, 52.1MB/s]


## Importing Libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Loading train and test data.

In [5]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [6]:
# Shapes of the dataset
df_train.shape

(8693, 14)

In [7]:
df_train.shape

(8693, 14)

In [8]:
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Data Description
File and Data Field Descriptions** 

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.
Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## EDA

In [9]:
df_train.isnull().sum() # Missing Values in train

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [10]:
df_train.nunique() # no of distinct values for each feature

PassengerId     8693
HomePlanet         3
CryoSleep          2
Cabin           6560
Destination        3
Age               80
VIP                2
RoomService     1273
FoodCourt       1507
ShoppingMall    1115
Spa             1327
VRDeck          1306
Name            8473
Transported        2
dtype: int64

In [11]:
#Cabin is a combo of deck/num/side , which individually could be important features. Need to be extracted
df_train[['deck','cabin_num','side']] = df_train['Cabin'].fillna('//').str.split('/',expand=True)

In [12]:
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,deck,cabin_num,side
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,B,0,P
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,F,0,S
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,A,0,S
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,A,0,S
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,F,1,S


In [13]:
df_train.dropna().shape[0]/df_train.shape[0] # what %age of data will disappear in case of missing value drop (22%)

0.7599217761417232

In [14]:
df_train.info() # data types of all features

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
 14  deck          8693 non-null   object 
 15  cabin_num     8693 non-null   object 
 16  side          8693 non-null   object 
dtypes: bool(1), float64(6), object(10)
memory usage: 1.1+ MB


In [15]:
df_train['deck'].unique()

array(['B', 'F', 'A', 'G', '', 'E', 'D', 'C', 'T'], dtype=object)

In [16]:
df_train['side'].unique()

array(['P', 'S', ''], dtype=object)

In [17]:
df_train['cabin_num'].unique()

array(['0', '1', '2', ..., '1892', '1893', '1894'], dtype=object)

In [18]:
df_train[['deck','cabin_num','side']] = df_train[['deck','cabin_num','side']].replace('',np.NaN)

In [19]:
df_train[df_train['Cabin'].isnull()][['deck','cabin_num','side']].head()

Unnamed: 0,deck,cabin_num,side
15,,,
93,,,
103,,,
222,,,
227,,,


In [20]:
df_train.drop('Cabin',axis=1,inplace=True) # dropping Cabin column 

In [21]:
df_train[df_train['Spa'].isnull()].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,deck,cabin_num,side
48,0050_01,Earth,False,55 Cancri e,35.0,False,790.0,0.0,0.0,,0.0,Sony Lancis,False,E,1,S
143,0164_01,Earth,False,TRAPPIST-1e,57.0,False,50.0,1688.0,0.0,,135.0,Fany Hutchinton,True,G,28,S
245,0265_01,Europa,True,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,,0.0,Etair Herpumble,True,D,8,S
269,0294_01,Europa,True,TRAPPIST-1e,50.0,False,0.0,0.0,0.0,,0.0,Phonons Roforhauge,True,B,8,S
289,0320_01,Earth,False,TRAPPIST-1e,18.0,False,0.0,2.0,0.0,,0.0,Breney Bellarkerd,False,G,44,S


In [22]:
df_train.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [23]:
# for luxury expenditure columns -  Spa,Room Service,FoodCourt, ShoppingMall and VRDeck , missing values can be replaced with 0 , 
# cosnidering no expenditure
df_train[['Spa','RoomService','FoodCourt','ShoppingMall','VRDeck']]= df_train[['Spa','RoomService','FoodCourt','ShoppingMall','VRDeck']].fillna(0)

In [24]:
df_train.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Destination     182
Age             179
VIP             203
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
deck            199
cabin_num       199
side            199
dtype: int64

In [25]:
df_train['VIP'].value_counts(normalize=True)

False    0.976561
True     0.023439
Name: VIP, dtype: float64

In [26]:
## Declaring all others missing as non-VIP
df_train['VIP'] = df_train['VIP'].fillna(False)

In [27]:
df_train['deck'].value_counts(normalize=True)

F    0.328938
G    0.301271
E    0.103132
B    0.091712
C    0.087944
D    0.056275
A    0.030139
T    0.000589
Name: deck, dtype: float64

In [28]:
df_train['side'].value_counts()

S    4288
P    4206
Name: side, dtype: int64

In [29]:
df_train['HomePlanet'].value_counts(normalize=True)

Earth     0.541922
Europa    0.250942
Mars      0.207136
Name: HomePlanet, dtype: float64

In [30]:
## fill others with Earth too
df_train['HomePlanet'] = df_train['HomePlanet'].fillna(df_train['HomePlanet'].mode()[0])

In [31]:
df_train['Age'].median()

27.0

In [32]:
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].median())

In [33]:
df_train.isnull().sum()

PassengerId       0
HomePlanet        0
CryoSleep       217
Destination     182
Age               0
VIP               0
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
deck            199
cabin_num       199
side            199
dtype: int64

In [34]:
df_train['CryoSleep']=df_train['CryoSleep'].fillna(df_train['CryoSleep'].mode()[0])

In [35]:
impute=dict()
impute['deck'] = df_train['deck'].mode()[0]
impute['cabin_num'] = df_train['cabin_num'].mode()[0]
impute['side'] = df_train['side'].mode()[0]
impute['Destination'] = df_train['Destination'].mode()[0]

In [36]:
impute

{'Destination': 'TRAPPIST-1e', 'cabin_num': '82', 'deck': 'F', 'side': 'S'}

In [37]:
for k in impute.keys():
  df_train[k] = df_train[k].fillna(impute[k]) 

In [38]:
df_train.isnull().sum()

PassengerId       0
HomePlanet        0
CryoSleep         0
Destination       0
Age               0
VIP               0
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
deck              0
cabin_num         0
side              0
dtype: int64

In [39]:
## check for missing values in test data
df_test.isnull().sum()

PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
dtype: int64

In [40]:
## doing the same conversions
df_test[['deck','cabin_num','side']] = df_test['Cabin'].fillna('//').str.split('/',expand=True)
df_test[['deck','cabin_num','side']] = df_test[['deck','cabin_num','side']].replace('',np.NaN)
df_test[['Spa','RoomService','FoodCourt','ShoppingMall','VRDeck']]= df_test[['Spa','RoomService','FoodCourt','ShoppingMall','VRDeck']].fillna(0)
df_test['VIP'] = df_test['VIP'].fillna(False)
df_test['HomePlanet'] = df_test['HomePlanet'].fillna('Earth')
df_test['Age'] = df_test['Age'].fillna(27)
df_test['CryoSleep']=df_test['CryoSleep'].fillna(False)
for k in impute.keys():
  df_test[k] = df_test[k].fillna(impute[k]) 

In [41]:
df_train.drop('Name',axis=1,inplace=True)
df_test.drop('Cabin',axis=1,inplace=True)
df_test.drop('Name',axis=1,inplace=True)

In [42]:
df_train.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
deck            0
cabin_num       0
side            0
dtype: int64

In [43]:
df_test.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
deck            0
cabin_num       0
side            0
dtype: int64

In [44]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   object 
 2   CryoSleep     8693 non-null   bool   
 3   Destination   8693 non-null   object 
 4   Age           8693 non-null   float64
 5   VIP           8693 non-null   bool   
 6   RoomService   8693 non-null   float64
 7   FoodCourt     8693 non-null   float64
 8   ShoppingMall  8693 non-null   float64
 9   Spa           8693 non-null   float64
 10  VRDeck        8693 non-null   float64
 11  Transported   8693 non-null   bool   
 12  deck          8693 non-null   object 
 13  cabin_num     8693 non-null   object 
 14  side          8693 non-null   object 
dtypes: bool(3), float64(6), object(6)
memory usage: 840.6+ KB


In [45]:
## correcting the datatypes:
df_train[['CryoSleep','VIP','Transported','cabin_num']] = df_train[['CryoSleep','VIP','Transported','cabin_num']].astype('int')
df_test[['CryoSleep','VIP','cabin_num']] = df_test[['CryoSleep','VIP','cabin_num']].astype('int')

In [46]:
categ_cols = ['HomePlanet','Destination','deck','side']
for categ in categ_cols:
  df_train = pd.concat([df_train,pd.get_dummies(df_train[categ],drop_first=True)],axis=1)
  df_test = pd.concat([df_test,pd.get_dummies(df_test[categ],drop_first=True)],axis=1)

In [47]:
df_train.drop(categ_cols,axis=1,inplace=True)
df_test.drop(categ_cols,axis=1,inplace=True)

In [48]:
df_train.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,cabin_num,Europa,Mars,PSO J318.5-22,TRAPPIST-1e,B,C,D,E,F,G,T,S
0,0001_01,0,39.0,0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,0,1,1,0,0,0,0,0,0,0
1,0002_01,0,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,0,0,0,1,0,0,0,0,1,0,0,1
2,0003_01,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,0,1,0,0,1,0,0,0,0,0,0,0,1
3,0003_02,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,0,1,0,0,1,0,0,0,0,0,0,0,1
4,0004_01,0,16.0,0,303.0,70.0,151.0,565.0,2.0,1,1,0,0,0,1,0,0,0,0,1,0,0,1


In [49]:
df_test.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,cabin_num,Europa,Mars,PSO J318.5-22,TRAPPIST-1e,B,C,D,E,F,G,T,S
0,0013_01,1,27.0,0,0.0,0.0,0.0,0.0,0.0,3,0,0,0,1,0,0,0,0,0,1,0,1
1,0018_01,0,19.0,0,0.0,9.0,0.0,2823.0,0.0,4,0,0,0,1,0,0,0,0,1,0,0,1
2,0019_01,1,31.0,0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0,0,0,0,0,1
3,0021_01,0,38.0,0,0.0,6652.0,0.0,181.0,585.0,1,1,0,0,1,0,1,0,0,0,0,0,1
4,0023_01,0,20.0,0,10.0,0.0,635.0,0.0,0.0,5,0,0,0,1,0,0,0,0,1,0,0,1


## Data looks clean now.
Start with modelling

In [50]:
df_train

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,cabin_num,Europa,Mars,PSO J318.5-22,TRAPPIST-1e,B,C,D,E,F,G,T,S
0,0001_01,0,39.0,0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,0,1,1,0,0,0,0,0,0,0
1,0002_01,0,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,0,0,0,1,0,0,0,0,1,0,0,1
2,0003_01,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,0,1,0,0,1,0,0,0,0,0,0,0,1
3,0003_02,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,0,1,0,0,1,0,0,0,0,0,0,0,1
4,0004_01,0,16.0,0,303.0,70.0,151.0,565.0,2.0,1,1,0,0,0,1,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,0,41.0,1,0.0,6819.0,0.0,1643.0,74.0,0,98,1,0,0,0,0,0,0,0,0,0,0,0
8689,9278_01,1,18.0,0,0.0,0.0,0.0,0.0,0.0,0,1499,0,0,1,0,0,0,0,0,0,1,0,1
8690,9279_01,0,26.0,0,0.0,0.0,1872.0,1.0,0.0,1,1500,0,0,0,1,0,0,0,0,0,1,0,1
8691,9280_01,0,32.0,0,0.0,1049.0,0.0,353.0,3235.0,0,608,1,0,0,0,0,0,0,1,0,0,0,1


In [51]:
y = df_train['Transported'].copy()
X = df_train.drop(['PassengerId','Transported'],axis=1)

In [52]:
from sklearn.model_selection import train_test_split
X_train,X_validate,y_train,y_validate = train_test_split(X,y,train_size=0.8,stratify=y,random_state=42)

## Logistic regression based model

In [53]:
import statsmodels.api as sm
model = sm.GLM(y_train,sm.add_constant(X_train),family=sm.families.Binomial())
results = model.fit()
results.summary()

  import pandas.util.testing as tm
  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Transported,No. Observations:,6954.0
Model:,GLM,Df Residuals:,6932.0
Model Family:,Binomial,Df Model:,21.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-3022.8
Date:,"Sun, 27 Feb 2022",Deviance:,6045.6
Time:,10:33:19,Pearson chi2:,6940.0
No. Iterations:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.5576,0.332,-1.677,0.094,-1.209,0.094
CryoSleep,1.3950,0.087,15.952,0.000,1.224,1.566
Age,-0.0096,0.002,-4.169,0.000,-0.014,-0.005
VIP,-0.3363,0.272,-1.235,0.217,-0.870,0.197
RoomService,-0.0015,0.000,-14.870,0.000,-0.002,-0.001
FoodCourt,0.0005,4.5e-05,11.260,0.000,0.000,0.001
ShoppingMall,0.0005,7.23e-05,6.499,0.000,0.000,0.001
Spa,-0.0021,0.000,-17.008,0.000,-0.002,-0.002
VRDeck,-0.0018,0.000,-16.291,0.000,-0.002,-0.002


As per this, some features are coming as non-significant. Hence, need to be removed from the model

In [54]:
features = list(X_train.columns)

In [55]:
features.remove('VIP')
features.remove('E')
features.remove('F')
features.remove('G')
features.remove('T')

In [56]:
model = sm.GLM(y_train,sm.add_constant(X_train[features]),family=sm.families.Binomial())
results = model.fit()
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Transported,No. Observations:,6954.0
Model:,GLM,Df Residuals:,6937.0
Model Family:,Binomial,Df Model:,16.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-3036.0
Date:,"Sun, 27 Feb 2022",Deviance:,6072.1
Time:,10:33:19,Pearson chi2:,6910.0
No. Iterations:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.3098,0.122,-2.536,0.011,-0.549,-0.070
CryoSleep,1.2969,0.080,16.140,0.000,1.139,1.454
Age,-0.0088,0.002,-3.876,0.000,-0.013,-0.004
RoomService,-0.0015,0.000,-14.841,0.000,-0.002,-0.001
FoodCourt,0.0005,4.46e-05,11.156,0.000,0.000,0.001
ShoppingMall,0.0005,7.23e-05,6.589,0.000,0.000,0.001
Spa,-0.0020,0.000,-16.975,0.000,-0.002,-0.002
VRDeck,-0.0018,0.000,-16.216,0.000,-0.002,-0.002
cabin_num,0.0004,6.74e-05,5.555,0.000,0.000,0.001


In [57]:
y_train_logreg_prob = results.predict(sm.add_constant(X_train[features]))

  x = pd.concat(x[::order], 1)


In [58]:
y_train_logreg_prob.head()

3600    0.519947
1262    0.744029
8612    0.515836
5075    0.961922
4758    0.000055
dtype: float64

In [59]:
from sklearn.metrics import confusion_matrix,accuracy_score,roc_auc_score,roc_curve
print(roc_auc_score(y_train,y_train_logreg_prob))

0.8777373862841495


In [60]:
y_validate_logreg_prob = results.predict(sm.add_constant(X_validate[features]))
print(roc_auc_score(y_validate,y_validate_logreg_prob))

0.8807996952332576


  x = pd.concat(x[::order], 1)


In [61]:
# metric of evaluation has been kept as accuracy.
# Threshold needs to be decided based on that
fpr,tpr,thresholds = roc_curve(y_train,y_train_logreg_prob)

In [62]:
tune = pd.DataFrame(thresholds,columns=['Threhsold'])
tune['TPR'] = tpr
tune['FPR'] = fpr

In [63]:
tune['Diff'] = tune['TPR']-tune['FPR']

In [64]:
threshold = tune[tune['Diff']==tune['Diff'].max()]['Threhsold'].values[0]

In [65]:
y_train_logreg = y_train_logreg_prob.apply(lambda x : 1 if x>=threshold else 0)
y_validate_logreg = y_validate_logreg_prob.apply(lambda x : 1 if x>=threshold else 0)

In [66]:
print('Training accuracy is ',accuracy_score(y_train,y_train_logreg))
print('Validation accuracy is ',accuracy_score(y_validate,y_validate_logreg))

Training accuracy is  0.7965199884958297
Validation accuracy is  0.7918343875790684


In [67]:
df_train['logreg_prob'] = results.predict(sm.add_constant(df_train[features]))
df_train['logreg_pred'] = df_train['logreg_prob'].apply(lambda x : 1 if x>=threshold else 0)

  x = pd.concat(x[::order], 1)


In [68]:
df_test['logreg_prob'] = results.predict(sm.add_constant(df_test[features]))
df_test['logreg_pred'] = df_test['logreg_prob'].apply(lambda x : 1 if x>=threshold else 0)

  x = pd.concat(x[::order], 1)


In [69]:
logreg_submission= df_test[['PassengerId','logreg_pred']].rename(columns={"logreg_pred":'Transported'}).copy()
logreg_submission['Transported'] = logreg_submission['Transported'].map({0:False,1:True})

In [70]:
logreg_submission.to_csv('Submission.csv',index=False)

In [71]:
#!kaggle competitions submit spaceship-titanic -f Submission.csv -m "Base Model using Logisitc Regression"

## Model 2 : Decision Tree Based

In [72]:
from sklearn.tree import DecisionTreeClassifier
dt_base = DecisionTreeClassifier(random_state=42)
model = dt_base.fit(X_train,y_train)
y_train_dt = dt_base.predict(X_train)
y_validate_dt = dt_base.predict(X_validate)

In [73]:
print('DT : Train accuracy is ',accuracy_score(y_train,y_train_dt))
print('DT : Validation accuracy is ',accuracy_score(y_validate,y_validate_dt))

DT : Train accuracy is  0.999424791486914
DT : Validation accuracy is  0.7429557216791259


In [74]:
print(model.tree_.max_depth)

43


In [75]:
## Model is definitely overfitting
## Lets tune it

In [76]:
grid={
    'max_depth' : [10,15,20,25,30,35,40],
    'min_samples_split': [10,15,20,25,30,50,60,70,80,90,100,110,120,130,150,200],
    'max_features':[None,'auto','sqrt',5]
}

In [77]:
from sklearn.model_selection import GridSearchCV
dt =DecisionTreeClassifier()
cv = GridSearchCV(dt,param_grid=grid,cv=5,scoring='accuracy',return_train_score=True,verbose=2,n_jobs=-1)


In [78]:
cv.fit(X_train,y_train)

Fitting 5 folds for each of 448 candidates, totalling 2240 fits


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'max_depth': [10, 15, 20, 25, 30, 35, 40],
                         'max_features': [None, 'auto', 'sqrt', 5],
                         'min_samples_split': [10, 15, 20, 25, 30, 50, 60, 70,
                                               80, 90, 100, 110, 120, 130, 150,
                                               200]},
             return_train_score=True, scoring='accuracy', verbose=2)

In [79]:
cv.best_params_

{'max_depth': 15, 'max_features': None, 'min_samples_split': 150}

In [80]:
cv_scores = pd.DataFrame(cv.cv_results_)

In [81]:
cv.best_score_

0.7855916503317835

In [82]:
dt_best = cv.best_estimator_ 

In [83]:
## since validation best score is poorer than the Logistic Regression Best Score :( , not going to use this.
## but still need to get the predictions
features_dt = list(X_train.columns)
df_train['dt_pred'] = dt_best.predict(df_train[features_dt])
df_test['dt_pred'] = dt_best.predict(df_test[features_dt])
df_train['dt_pred'] = df_train['dt_pred'].map({0:False,1:True})
df_test['dt_pred'] = df_test['dt_pred'].map({0:False,1:True})

In [84]:
df_test[['PassengerId','dt_pred']].rename(columns={'dt_pred':'Transported'}).to_csv('Submission_dt.csv',index=False)

In [85]:
#!kaggle competitions submit spaceship-titanic -f Submission_dt.csv -m "Decison tree Based Model."

## Model 3: Random Forest Based ensemble model

In [86]:
from sklearn.ensemble import RandomForestClassifier
rf_base = RandomForestClassifier(random_state=42)
model = rf_base.fit(X_train,y_train)
y_train_rfbase = model.predict(X_train)
y_validate_rfbase = model.predict(X_validate)

In [87]:
print('RF-Base : Train accuracy is ',accuracy_score(y_train,y_train_rfbase))
print('RD-Base : Validation accuracy is ',accuracy_score(y_validate,y_validate_rfbase))

RF-Base : Train accuracy is  0.999424791486914
RD-Base : Validation accuracy is  0.8067855089131685


In [88]:
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [89]:
## looks like overfitting much

In [90]:
## tune the model
grid={
    'max_depth' : [10,15,20,25,30],
    'min_samples_split': [60,70,80,90,100,110,120,130,150,200],
    'n_estimators':[50,75,100]
}

In [91]:
rf = RandomForestClassifier(random_state=42)
cv = GridSearchCV(rf,param_grid=grid,cv=5,scoring='accuracy',return_train_score=True,verbose=2,n_jobs=-1)

In [92]:
cv.fit(X_train,y_train)

Fitting 5 folds for each of 150 candidates, totalling 750 fits


GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [10, 15, 20, 25, 30],
                         'min_samples_split': [60, 70, 80, 90, 100, 110, 120,
                                               130, 150, 200],
                         'n_estimators': [50, 75, 100]},
             return_train_score=True, scoring='accuracy', verbose=2)

In [93]:
cv.best_params_

{'max_depth': 25, 'min_samples_split': 60, 'n_estimators': 50}

In [94]:
RFCV_scores = pd.DataFrame(cv.cv_results_)

In [95]:
rf_best = cv.best_estimator_
print("Best Cross-Valdiation score is ",cv.best_score_)

Best Cross-Valdiation score is  0.802561275207009


In [96]:
df_train['rf_pred'] = rf_best.predict(df_train[features_dt])
df_test['rf_pred'] = rf_best.predict(df_test[features_dt])
df_train['rf_pred'] = df_train['rf_pred'].map({0:False,1:True})
df_test['rf_pred'] = df_test['rf_pred'].map({0:False,1:True})

In [97]:
df_test[['PassengerId','rf_pred']].rename(columns={'rf_pred':'Transported'}).to_csv('Submission_rf.csv',index=False)

In [98]:
#!kaggle competitions submit spaceship-titanic -f Submission_rf.csv -m "Random Forest Ensemble based model"

Model 4: Lets try to blend 

In [106]:
blend_df = df_train[['PassengerId','logreg_pred','dt_pred','rf_pred','Transported']].copy()
blend_df[['dt_pred','rf_pred']] = blend_df[['dt_pred','rf_pred']].astype('int')
blend_train,blend_validate = train_test_split(blend_df,train_size=0.8,stratify=blend_df['Transported'],random_state=42)

In [114]:
features_blend = ['logreg_pred','dt_pred','rf_pred']

In [116]:
blend_model = sm.GLM(blend_train['Transported'],sm.add_constant(blend_train[features_blend]),family=sm.families.Binomial())
result_blend = blend_model.fit()
result_blend.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Transported,No. Observations:,6954.0
Model:,GLM,Df Residuals:,6950.0
Model Family:,Binomial,Df Model:,3.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2948.3
Date:,"Sun, 27 Feb 2022",Deviance:,5896.5
Time:,10:53:21,Pearson chi2:,6950.0
No. Iterations:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.8290,0.051,-36.183,0.000,-1.928,-1.730
logreg_pred,0.1005,0.113,0.892,0.372,-0.120,0.321
dt_pred,1.2525,0.095,13.117,0.000,1.065,1.440
rf_pred,2.3508,0.106,22.226,0.000,2.144,2.558


In [117]:
features_blend2 = ['dt_pred','rf_pred']

In [130]:
blend_model2 = sm.GLM(blend_train['Transported'],sm.add_constant(blend_train[features_blend2]),family=sm.families.Binomial())
result_blend2 = blend_model2.fit()
result_blend2.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,Transported,No. Observations:,6954.0
Model:,GLM,Df Residuals:,6951.0
Model Family:,Binomial,Df Model:,2.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2948.6
Date:,"Sun, 27 Feb 2022",Deviance:,5897.3
Time:,11:02:52,Pearson chi2:,6960.0
No. Iterations:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.8198,0.049,-36.841,0.000,-1.917,-1.723
dt_pred,1.2876,0.087,14.808,0.000,1.117,1.458
rf_pred,2.4050,0.087,27.647,0.000,2.235,2.575


In [131]:
blend_train['prob_b1'] = result_blend.predict(sm.add_constant(blend_train[features_blend]))
blend_train['pred_b1'] = blend_train['prob_b1'].apply(lambda x:1 if x>=0.5 else 0)
blend_validate['prob_b1'] = result_blend.predict(sm.add_constant(blend_validate[features_blend]))
blend_validate['pred_b1'] = blend_validate['prob_b1'].apply(lambda x:1 if x>=0.5 else 0)

  x = pd.concat(x[::order], 1)


In [132]:
print('Blend-1 : Train accuracy is ',accuracy_score(blend_train['Transported'],blend_train['pred_b1']))
print('Blend-1 : Validation accuracy is ',accuracy_score(blend_validate['Transported'],blend_validate['pred_b1']))

Blend-1 : Train accuracy is  0.8408110440034513
Blend-1 : Validation accuracy is  0.8027602070155262


In [133]:
blend_train['prob_b2'] = result_blend2.predict(sm.add_constant(blend_train[features_blend2]))
blend_train['pred_b2'] = blend_train['prob_b2'].apply(lambda x:1 if x>=0.5 else 0)
blend_validate['prob_b2'] = result_blend2.predict(sm.add_constant(blend_validate[features_blend2]))
blend_validate['pred_b2'] = blend_validate['prob_b2'].apply(lambda x:1 if x>=0.5 else 0)

  x = pd.concat(x[::order], 1)


In [134]:
print('Blend-2 : Train accuracy is ',accuracy_score(blend_train['Transported'],blend_train['pred_b2']))
print('Blend-2 : Validation accuracy is ',accuracy_score(blend_validate['Transported'],blend_validate['pred_b2']))

Blend-2 : Train accuracy is  0.8408110440034513
Blend-2 : Validation accuracy is  0.8027602070155262


In [136]:
df_test['blend_prob'] = result_blend2.predict(sm.add_constant(df_test[features_blend2].astype('int')))

  x = pd.concat(x[::order], 1)


In [137]:
df_test['blend_pred'] = df_test['blend_prob'].apply(lambda x: 1 if x>=0.5 else 0)

In [140]:
df_test['blend_pred'] =df_test['blend_pred'].map({0:False,1:True})

In [141]:
df_test[['PassengerId','blend_pred']].rename(columns={'blend_pred':'Transported'}).to_csv('Submission_blend.csv',index=False)

In [143]:
!kaggle competitions submit spaceship-titanic -f Submission_blend.csv -m "Blend of logreg,dt and rf models"

100% 56.4k/56.4k [00:05<00:00, 10.6kB/s]
Successfully submitted to Spaceship Titanic