# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
#your code here
print (spaceship.shape)

(8693, 14)


**Check for data types**

In [4]:
#your code here
print (spaceship.dtypes)

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


**Check for missing values**

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [5]:
#your code here
print(spaceship.isnull().sum())

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


In [6]:
print(spaceship.dropna(how='all'))


     PassengerId HomePlanet CryoSleep     Cabin    Destination   Age    VIP  \
0        0001_01     Europa     False     B/0/P    TRAPPIST-1e  39.0  False   
1        0002_01      Earth     False     F/0/S    TRAPPIST-1e  24.0  False   
2        0003_01     Europa     False     A/0/S    TRAPPIST-1e  58.0   True   
3        0003_02     Europa     False     A/0/S    TRAPPIST-1e  33.0  False   
4        0004_01      Earth     False     F/1/S    TRAPPIST-1e  16.0  False   
...          ...        ...       ...       ...            ...   ...    ...   
8688     9276_01     Europa     False    A/98/P    55 Cancri e  41.0   True   
8689     9278_01      Earth      True  G/1499/S  PSO J318.5-22  18.0  False   
8690     9279_01      Earth     False  G/1500/S    TRAPPIST-1e  26.0  False   
8691     9280_01     Europa     False   E/608/S    55 Cancri e  32.0  False   
8692     9280_02     Europa     False   E/608/S    TRAPPIST-1e  44.0  False   

      RoomService  FoodCourt  ShoppingMall     Spa 

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [7]:
spaceship['Cabin'] = spaceship['Cabin'].str[0]

spaceship['Cabin'] = spaceship['Cabin'].replace('T', 'A')

print(spaceship['Cabin'].head())

0    B
1    F
2    A
3    A
4    F
Name: Cabin, dtype: object


In [8]:
print(spaceship.columns)

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')


- Drop PassengerId and Name

In [9]:
#your code here

spaceship.drop(columns=['PassengerId', 'Name', 'ShoppingMall', 'Age'], inplace=True)

print(spaceship.head())

  HomePlanet CryoSleep Cabin  Destination    VIP  RoomService  FoodCourt  \
0     Europa     False     B  TRAPPIST-1e  False          0.0        0.0   
1      Earth     False     F  TRAPPIST-1e  False        109.0        9.0   
2     Europa     False     A  TRAPPIST-1e   True         43.0     3576.0   
3     Europa     False     A  TRAPPIST-1e  False          0.0     1283.0   
4      Earth     False     F  TRAPPIST-1e  False        303.0       70.0   

      Spa  VRDeck  Transported  
0     0.0     0.0        False  
1   549.0    44.0         True  
2  6715.0    49.0        False  
3  3329.0   193.0        False  
4   565.0     2.0         True  


- For non-numerical columns, do dummies.

In [10]:
#your code here

encoded_df = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep','Cabin', 'Destination', 'VIP'], dtype=int)

**Perform Train Test Split**

In [11]:
#your code here

features = encoded_df.drop(columns=['Transported'])

features

Unnamed: 0,RoomService,FoodCourt,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,...,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.0,0.0,0.0,0.0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,1,1,0
1,109.0,9.0,549.0,44.0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,1,0
2,43.0,3576.0,6715.0,49.0,0,1,0,1,0,1,...,0,0,0,0,0,0,0,1,0,1
3,0.0,1283.0,3329.0,193.0,0,1,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
4,303.0,70.0,565.0,2.0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,0.0,6819.0,1643.0,74.0,0,1,0,1,0,1,...,0,0,0,0,0,1,0,0,0,1
8689,0.0,0.0,0.0,0.0,1,0,0,0,1,0,...,0,0,0,0,1,0,1,0,1,0
8690,0.0,0.0,1.0,0.0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,1,1,0
8691,0.0,1049.0,353.0,3235.0,0,1,0,1,0,0,...,0,0,1,0,0,1,0,0,1,0


In [12]:
target = encoded_df['Transported']

In [13]:
X_train, X_test, y_train, y_test = train_test_split (features, target, test_size=0.2, random_state=12)

X_train, X_test, y_train, y_test

(      RoomService  FoodCourt    Spa  VRDeck  HomePlanet_Earth  \
 7503          0.0        0.0    0.0     0.0                 0   
 7300          0.0        0.0    0.0     0.0                 0   
 1853          0.0        0.0    0.0     0.0                 0   
 5962          0.0        0.0    0.0     0.0                 0   
 4805          0.0        0.0    0.0     0.0                 1   
 ...           ...        ...    ...     ...               ...   
 278           0.0        0.0    0.0     0.0                 1   
 3714          0.0        0.0    0.0     0.0                 1   
 7409          0.0        0.0    0.0     0.0                 0   
 3325          0.0        0.0    0.0     0.0                 0   
 5787          0.0     3095.0  197.0    40.0                 0   
 
       HomePlanet_Europa  HomePlanet_Mars  CryoSleep_False  CryoSleep_True  \
 7503                  0                1                0               1   
 7300                  0                1         

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [14]:
from sklearn.neighbors import KNeighborsClassifier

In [15]:
#your code here

knn = KNeighborsClassifier(n_neighbors=3)
knn

- Evaluate your model's performance. Comment it

In [19]:
#your code here

knn.fit(X_train, y_train)

knn.score(X_test, y_test)

ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values