# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [6]:
spaceship_shape = spaceship.shape
print("Shape of the dataset:", spaceship_shape)

Shape of the dataset: (8693, 14)


**Check for data types**

In [8]:
# Check data types of each column
print("Data types of each column:")
print(spaceship.dtypes)

Data types of each column:
PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


**Check for missing values**

In [10]:
# Check for missing values in each column
print("Missing values in each column:")
print(spaceship.isnull().sum())

Missing values in each column:
PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [12]:
# Drop rows with any missing values
spaceship_cleaned = spaceship.dropna()

# Check the shape after dropping missing values
print("Shape after dropping missing values:", spaceship_cleaned.shape)

Shape after dropping missing values: (6606, 14)


- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [14]:
# Extract the deck from Cabin (first part before the first '/')
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str.split('/').str[0]

# Check the unique values of Cabin to confirm
print("Unique values in Cabin after transformation:", spaceship_cleaned['Cabin'].unique())

Unique values in Cabin after transformation: ['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str.split('/').str[0]


- Drop PassengerId and Name

In [16]:
# Drop PassengerId and Name columns
spaceship_cleaned = spaceship_cleaned.drop(columns=['PassengerId', 'Name'])

# Check the shape after dropping columns
print("Shape after dropping PassengerId and Name:", spaceship_cleaned.shape)

Shape after dropping PassengerId and Name: (6606, 12)


- For non-numerical columns, do dummies.

In [19]:
# Convert CryoSleep, VIP, and Transported to boolean
spaceship_cleaned['CryoSleep'] = spaceship_cleaned['CryoSleep'].astype(bool)
spaceship_cleaned['VIP'] = spaceship_cleaned['VIP'].astype(bool)
spaceship_cleaned['Transported'] = spaceship_cleaned['Transported'].astype(bool)

# Perform one-hot encoding on non-numerical columns
non_numerical_cols = ['HomePlanet', 'Cabin', 'Destination']
spaceship_encoded = pd.get_dummies(spaceship_cleaned, columns=non_numerical_cols, drop_first=True)

# Check the shape after encoding
print("Shape after one-hot encoding:", spaceship_encoded.shape)

# Check the first few rows of the encoded dataset
print(spaceship_encoded.head())

Shape after one-hot encoding: (6606, 20)
   CryoSleep   Age    VIP  RoomService  FoodCourt  ShoppingMall     Spa  \
0      False  39.0  False          0.0        0.0           0.0     0.0   
1      False  24.0  False        109.0        9.0          25.0   549.0   
2      False  58.0   True         43.0     3576.0           0.0  6715.0   
3      False  33.0  False          0.0     1283.0         371.0  3329.0   
4      False  16.0  False        303.0       70.0         151.0   565.0   

   VRDeck  Transported  HomePlanet_Europa  HomePlanet_Mars  Cabin_B  Cabin_C  \
0     0.0        False               True            False     True    False   
1    44.0         True              False            False    False    False   
2    49.0        False               True            False    False    False   
3   193.0        False               True            False    False    False   
4     2.0         True              False            False    False    False   

   Cabin_D  Cabin_E  Cabin_

**Perform Train Test Split**

In [21]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = spaceship_encoded.drop(columns=['Transported'])  # All columns except the target
y = spaceship_encoded['Transported']  # Target variable

# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (5284, 19)
X_test shape: (1322, 19)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Scale the numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the KNN model (using k=5 as default)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of KNN model:", accuracy)

# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy of KNN model: 0.7874432677760969
Classification Report:
              precision    recall  f1-score   support

       False       0.78      0.79      0.79       653
        True       0.79      0.79      0.79       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Evaluate your model's performance. Comment it

In [None]:
The KNN model with k=5 achieved an accuracy of 78.44% on the test set, indicating a decent performance for a baseline model on the SpaceShip Titanic dataset.
Precision, recall, and F1-score are balanced between the False and True classes (~0.78-0.79), suggesting that the model performs equally well for both classes and that the dataset is fairly balanced.
However, there is room for improvement. Performance could be enhanced by tuning hyperparameters (e.g., trying different values of k) or by incorporating additional feature engineering, such as creating a total expenditure feature or extracting more information from the Cabin column.
The loss of 24% of the data due to dropping missing values might impact the model's generalizability. In a real-world scenario, imputation could be considered as an alternative approach to retain more data.