# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [17]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [18]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [19]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [20]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [21]:
#your code here
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [22]:
#your code here
spaceship_cleaned = spaceship.dropna()
spaceship_cleaned.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [23]:
#your code here
# Assuming the 'Cabin' column has values like 'B/0/P', we'll extract the first letter
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0]

# Check the unique values to ensure the transformation worked
spaceship_cleaned['Cabin'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0]


array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

- Drop PassengerId and Name

In [24]:
#your code here
# Dropping the 'PassengerId' and 'Name' columns
spaceship_cleaned = spaceship_cleaned.drop(columns=['PassengerId', 'Name'])

# Verify the columns have been dropped
spaceship_cleaned.head()


Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


- For non-numerical columns, do dummies.

In [25]:
#your code here
# Identify non-numerical (categorical) columns
categorical_columns = spaceship_cleaned.select_dtypes(include=['object', 'bool']).columns

# Create dummy variables for the categorical columns
spaceship_cleaned_dummies = pd.get_dummies(spaceship_cleaned, columns=categorical_columns, drop_first=True)

# Check the resulting DataFrame
spaceship_cleaned_dummies.head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,Transported_True
0,39.0,0.0,0.0,0.0,0.0,0.0,1,0,0,1,0,0,0,0,0,0,0,1,0,0
1,24.0,109.0,9.0,25.0,549.0,44.0,0,0,0,0,0,0,0,1,0,0,0,1,0,1
2,58.0,43.0,3576.0,0.0,6715.0,49.0,1,0,0,0,0,0,0,0,0,0,0,1,1,0
3,33.0,0.0,1283.0,371.0,3329.0,193.0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
4,16.0,303.0,70.0,151.0,565.0,2.0,0,0,0,0,0,0,0,1,0,0,0,1,0,1


In [26]:
spaceship_cleaned_dummies.dtypes

Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
HomePlanet_Europa              uint8
HomePlanet_Mars                uint8
CryoSleep_True                 uint8
Cabin_B                        uint8
Cabin_C                        uint8
Cabin_D                        uint8
Cabin_E                        uint8
Cabin_F                        uint8
Cabin_G                        uint8
Cabin_T                        uint8
Destination_PSO J318.5-22      uint8
Destination_TRAPPIST-1e        uint8
VIP_True                       uint8
Transported_True               uint8
dtype: object

**Perform Train Test Split**

In [27]:
#your code here
# Define the features (X) and target variable (y)
X = spaceship_cleaned_dummies.drop(columns=['Transported_True'])   # Eliminar la variable objetivo de las características
y = spaceship_cleaned_dummies['Transported_True'] # Esta es la variable que queremos predecir

# Perform the train-test split (80% entrenamiento, 20% prueba)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the resulting datasets
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (5284, 19)
X_test shape: (1322, 19)
y_train shape: (5284,)
y_test shape: (1322,)


In [28]:
X_test.dtypes

Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
HomePlanet_Europa              uint8
HomePlanet_Mars                uint8
CryoSleep_True                 uint8
Cabin_B                        uint8
Cabin_C                        uint8
Cabin_D                        uint8
Cabin_E                        uint8
Cabin_F                        uint8
Cabin_G                        uint8
Cabin_T                        uint8
Destination_PSO J318.5-22      uint8
Destination_TRAPPIST-1e        uint8
VIP_True                       uint8
dtype: object

In [29]:
y_train

7832    0
5842    0
3928    1
4091    1
7679    1
       ..
4984    1
6864    1
6919    0
7137    0
1162    1
Name: Transported_True, Length: 5284, dtype: uint8

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [42]:
# 1. Importar KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 2. Definir el modelo KNN con el número de vecinos (k)
k = 9  # Puedes ajustar este valor para optimizar el modelo
knn = KNeighborsClassifier(n_neighbors=k)

# 3. Entrenar el modelo con el conjunto de entrenamiento
knn.fit(X_train, y_train)

# 4. Realizar predicciones en el conjunto de prueba
y_pred = knn.predict(X_test)

# 5. Evaluar el modelo
accuracy = accuracy_score(y_test, y_pred)
print(f"Exactitud del modelo KNN con k={k}: {accuracy:.2f}")

# 6. Imprimir el reporte de clasificación
print("\nReporte de clasificación:")
print(classification_report(y_test, y_pred))

# 7. Matriz de confusión
print("Matriz de confusión:")
print(confusion_matrix(y_test, y_pred))


Exactitud del modelo KNN con k=9: 0.81

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.81      0.79      0.80       653
           1       0.80      0.82      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322

Matriz de confusión:
[[516 137]
 [119 550]]


- Evaluate your model's performance. Comment it