# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [31]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [32]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [33]:
spaceship.shape

(8693, 14)

**Check for data types**

In [34]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [35]:
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [36]:
spaceship=spaceship.dropna()

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [37]:
# Extraer la letra antes de la primera barra usando regex
spaceship['Cabin'] = spaceship['Cabin'].str.extract(r'^([A-Z])')

In [38]:
spaceship["Cabin"].unique()

array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

In [39]:
target=spaceship["Transported"]

- Drop PassengerId and Name

In [40]:
spaceship= spaceship.drop(columns=["PassengerId", "Name", "Transported"])

In [41]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    6606 non-null   object 
 1   CryoSleep     6606 non-null   object 
 2   Cabin         6606 non-null   object 
 3   Destination   6606 non-null   object 
 4   Age           6606 non-null   float64
 5   VIP           6606 non-null   object 
 6   RoomService   6606 non-null   float64
 7   FoodCourt     6606 non-null   float64
 8   ShoppingMall  6606 non-null   float64
 9   Spa           6606 non-null   float64
 10  VRDeck        6606 non-null   float64
dtypes: float64(6), object(5)
memory usage: 619.3+ KB


In [42]:
spaceship.columns

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'],
      dtype='object')

- For non-numerical columns, do dummies.

In [43]:
categorigal=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']
numerical=['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

In [44]:
spaceship = pd.get_dummies(spaceship, columns=["HomePlanet","Cabin", "Destination"])

In [45]:
#ABT
spaceship

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,...,True,False,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,False,...,False,False,False,False,True,False,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,...,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,...,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,False,...,False,False,False,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,True,...,False,False,False,False,False,False,False,True,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,True,False,...,False,False,False,False,False,True,False,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,False,...,False,False,False,False,False,True,False,False,False,True
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,True,...,False,False,False,True,False,False,False,True,False,False


**Perform Train Test Split**

In [46]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

In [47]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(spaceship, target, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [83]:
knn = KNeighborsClassifier(n_neighbors=9)

In [84]:
knn.fit(X_train, y_train)

- Evaluate your model's performance. Comment it

In [85]:
pred = knn.predict(X_test)
pred

array([ True,  True, False, ...,  True,  True,  True])

In [86]:
y_test.values

array([ True,  True,  True, ...,  True,  True,  True])

In [87]:
knn.score(X_test, y_test)

0.8048411497730711

In [88]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

precision = precision_score(y_test, pred, average='macro')
print(f'Precisión: {precision}')

Precisión: 0.8052636287930406


In [89]:
# Recall
recall = recall_score(y_test, pred, average='macro')
print(f'Recall: {recall}')

Recall: 0.8045928072572948


In [90]:
# F1-score
f1 = f1_score(y_test, pred, average='macro')
print(f'F1-score: {f1}')

F1-score: 0.8046623186513364


In [91]:
from sklearn.metrics import confusion_matrix
# Assuming 'y_test' are the true labels and 'pred' are the model's predictions


# Complete classification report
report = classification_report(y_test, pred)
print(report)

              precision    recall  f1-score   support

       False       0.81      0.78      0.80       653
        True       0.80      0.83      0.81       669

    accuracy                           0.80      1322
   macro avg       0.81      0.80      0.80      1322
weighted avg       0.81      0.80      0.80      1322



In [95]:
import plotly.express as px
import pandas as pd
import numpy as np
# Calculate the correlation matrix
correlation_matrix = np.abs(spaceship.corr())

# Create the heatmap using Plotly Express
fig = px.imshow(correlation_matrix,
                x=correlation_matrix.columns,
                y=correlation_matrix.columns,
                color_continuous_scale='RdBu_r',  # Red-Blue diverging color scale
                zmin=-1,
                zmax=1,
                aspect="auto",
                title='Correlation Heatmap of Numerical Variables')

# Update the layout for better readability
fig.update_layout(
    xaxis_title="",
    yaxis_title="",
    xaxis={'side': 'top'},  # Move x-axis labels to the top
    width=800,
    height=700
)

# Add correlation values as text annotations
for i, row in enumerate(correlation_matrix.values):
    for j, value in enumerate(row):
        fig.add_annotation(
            x=correlation_matrix.columns[j],
            y=correlation_matrix.columns[i],
            text=f"{value:.2f}",
            showarrow=False,
            font=dict(size=8)
        )

# Show the plot
fig.show()

In [92]:
X_train.columns

Index(['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck', 'HomePlanet_Earth', 'HomePlanet_Europa',
       'HomePlanet_Mars', 'Cabin_A', 'Cabin_B', 'Cabin_C', 'Cabin_D',
       'Cabin_E', 'Cabin_F', 'Cabin_G', 'Cabin_T', 'Destination_55 Cancri e',
       'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e'],
      dtype='object')

In [93]:
#PCA
columns_dest=['Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e']
from sklearn.decomposition import PCA
pca_train_destination=PCA(n_components=1)
pca_train_destination.fit(X_train[columns_dest])
X_train["Destination_pca"]=pca_train_destination.transform(X_train[columns_dest])
X_test["Destination_pca"]=pca_train_destination.transform(X_test[columns_dest])

In [None]:
#PCA para reducir la dimnesión de 'HomePlanet_Earth', 'HomePlanet_Europa','HomePlanet_Mars'
columns=['Cabin_A','Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G','Cabin_T']
from sklearn.decomposition import PCA
pca_train_planet=PCA(n_components=0.7)
pca_train_planet.fit(X_train[columns])
pca_result_train=pca_train_planet.transform(X_train[columns])

num_components = pca_result_train.shape[1]

# Crear nombres de columnas genéricos basados en el número de componentes
pca_column_names = [f'PCA_cabin_{i+1}' for i in range(num_components)]

# Agregar las nuevas columnas al DataFrame X_train
X_train[pca_column_names] = pca_result_train

pca_result_test=pca_train_planet.transform(X_test[columns])

num_components = pca_result_test.shape[1]

# Crear nombres de columnas genéricos basados en el número de componentes
pca_column_names = [f'PCA_cabin_{i+1}' for i in range(num_components)]

# Agregar las nuevas columnas al DataFrame X_train
X_test[pca_column_names] = pca_result_test

In [None]:
#PCA para reducir la dimnesión de 'HomePlanet_Earth', 'HomePlanet_Europa','HomePlanet_Mars'
columns=['HomePlanet_Earth', 'HomePlanet_Europa','HomePlanet_Mars']
from sklearn.decomposition import PCA
pca_train_planet=PCA(n_components=1)
pca_train_planet.fit(X_train[columns])
X_train['homeplanet_PCA']=pca_train_planet.transform(X_train[columns])
X_test['homeplanet_PCA']=pca_train_planet.transform(X_test[columns])

In [None]:
features_selected=['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck','Destination_PCA', 'PCA_cabin_1', 'PCA_cabin_2', 'PCA_cabin_3',
       'homeplanet_PCA']

In [None]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

knn = KNeighborsClassifier(n_neighbors=10)

knn.fit(X_train, y_train)

pred = knn.predict(X_test)
pred

# Assuming 'y_test' are the true labels and 'pred' are the model's predictions


# Complete classification report
report = classification_report(y_test, pred)
print(report)

In [None]:
"""
Tengo variables:
Numéricas continuas
Binarias (provenientes de object y resultantes de get_dummies)
El análisis de correlación lineal aplicado a todo, es aceptable?

Divido datos en features y target
Obtengo df de X_train, X_test, etc.

--> aquí entra la estandarización y normalización

SI aplico PCA para generar un único vector para las variables VIP:
-VIP, proviene de 2 variables binarias. ¿Se estandariza y normaliza? NO, i solo incluyo una de ellas

nunca normalizar ni normalizar antes de Split

-Cabin, proviene de 8 variables binarias. ¿Se estandariza y normaliza? no se tienen que normalizar ni estandarizan, tampoco los datos que se tratan de lo mismo. (precios entre precios, etc)

El PCA se entrena con X_train y se aplica a Xtrain y Xtest?

cuando tengo dos cstegorias binarias, elimino una de ellas cuando están tan relacionadas

EL PCA SE APLICA DESPUÉS DEL SPLIT, EL PCA SE GUARDA EN PKL

"""