# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [28]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [29]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [30]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [31]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [32]:
#your code here
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [33]:
#your code here
spaceship = spaceship.dropna()

In [34]:
spaceship.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [35]:
#your code here
spaceship['Cabin'] = spaceship['Cabin'].str[0]

In [36]:
spaceship['Cabin'].value_counts()

Cabin
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64

- Drop PassengerId and Name

In [37]:
#your code here
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

In [38]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    6606 non-null   object 
 1   CryoSleep     6606 non-null   object 
 2   Cabin         6606 non-null   object 
 3   Destination   6606 non-null   object 
 4   Age           6606 non-null   float64
 5   VIP           6606 non-null   object 
 6   RoomService   6606 non-null   float64
 7   FoodCourt     6606 non-null   float64
 8   ShoppingMall  6606 non-null   float64
 9   Spa           6606 non-null   float64
 10  VRDeck        6606 non-null   float64
 11  Transported   6606 non-null   bool   
dtypes: bool(1), float64(6), object(5)
memory usage: 625.8+ KB


- For non-numerical columns, do dummies.

In [39]:
#your code here
from sklearn.preprocessing import LabelEncoder

non_numeric_columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination']

spaceship = pd.get_dummies(spaceship, columns=non_numeric_columns)

In [40]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        6606 non-null   float64
 1   VIP                        6606 non-null   object 
 2   RoomService                6606 non-null   float64
 3   FoodCourt                  6606 non-null   float64
 4   ShoppingMall               6606 non-null   float64
 5   Spa                        6606 non-null   float64
 6   VRDeck                     6606 non-null   float64
 7   Transported                6606 non-null   bool   
 8   HomePlanet_Earth           6606 non-null   bool   
 9   HomePlanet_Europa          6606 non-null   bool   
 10  HomePlanet_Mars            6606 non-null   bool   
 11  CryoSleep_False            6606 non-null   bool   
 12  CryoSleep_True             6606 non-null   bool   
 13  Cabin_A                    6606 non-null   bool   
 1

In [41]:
spaceship = spaceship.astype(int)

In [42]:
spaceship

Unnamed: 0,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,39,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1
1,24,0,109,9,25,549,44,1,1,0,...,0,0,0,0,1,0,0,0,0,1
2,58,1,43,3576,0,6715,49,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3,33,0,0,1283,371,3329,193,0,0,1,...,0,0,0,0,0,0,0,0,0,1
4,16,0,303,70,151,565,2,1,1,0,...,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41,1,0,6819,0,1643,74,0,0,1,...,0,0,0,0,0,0,0,1,0,0
8689,18,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
8690,26,0,0,0,1872,1,0,1,1,0,...,0,0,0,0,0,1,0,0,0,1
8691,32,0,0,1049,0,353,3235,0,0,1,...,0,0,0,1,0,0,0,1,0,0


In [43]:
#your code here
from sklearn.preprocessing import LabelEncoder

more_non_numerical = ['VIP','Transported']

# Inicializar el LabelEncoder
label_encoder = LabelEncoder()

# Aplicar LabelEncoder a cada columna no numérica
for column in more_non_numerical:
    spaceship[column] = label_encoder.fit_transform(spaceship[column])

In [44]:
spaceship.head()

Unnamed: 0,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,39,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1
1,24,0,109,9,25,549,44,1,1,0,...,0,0,0,0,1,0,0,0,0,1
2,58,1,43,3576,0,6715,49,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3,33,0,0,1283,371,3329,193,0,0,1,...,0,0,0,0,0,0,0,0,0,1
4,16,0,303,70,151,565,2,1,1,0,...,0,0,0,0,1,0,0,0,0,1


In [45]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   Age                        6606 non-null   int64
 1   VIP                        6606 non-null   int64
 2   RoomService                6606 non-null   int64
 3   FoodCourt                  6606 non-null   int64
 4   ShoppingMall               6606 non-null   int64
 5   Spa                        6606 non-null   int64
 6   VRDeck                     6606 non-null   int64
 7   Transported                6606 non-null   int64
 8   HomePlanet_Earth           6606 non-null   int64
 9   HomePlanet_Europa          6606 non-null   int64
 10  HomePlanet_Mars            6606 non-null   int64
 11  CryoSleep_False            6606 non-null   int64
 12  CryoSleep_True             6606 non-null   int64
 13  Cabin_A                    6606 non-null   int64
 14  Cabin_B                    66

**Perform Train Test Split**

In [46]:
#your code here
features = spaceship.drop(columns = ["Transported"])
target = spaceship["Transported"]

In [47]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [48]:
normalizer = MinMaxScaler()

In [49]:
normalizer.fit(X_train)

In [50]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [51]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.405063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.050633,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.379747,0.0,0.0,0.007916,0.0,0.051276,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.21519,0.0,0.00131,0.0,0.046111,0.016378,4.9e-05,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.329114,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [52]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.632911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.227848,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.189873,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.658228,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.78481,1.0,0.0,0.054775,0.0,0.07774,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [63]:
#your code here
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=2)

In [64]:
knn.fit(X_train_norm, y_train)

In [65]:
knn.score(X_test_norm, y_test)

0.7465960665658093

- Evaluate your model's performance. Comment it

The score of approximately 0.7465 suggests that the accuracy of the KNN model on the test set is around 74.65%. This means that the model correctly predicts the class of about 74.65% of the instances in the test set.