# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
spaceship.shape

(8693, 14)

**Check for data types**

In [4]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [5]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [6]:
# Drop any row that contains at least one missing value
spaceship_cleaned = spaceship.dropna()

# Verify the new shape to see how many rows were removed
spaceship_cleaned.shape

(6606, 14)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [7]:
# Extract the first character (the deck) from the Cabin column
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0]

# Verify the unique values to ensure we have {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
print(spaceship_cleaned['Cabin'].unique())

['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0]


- Drop PassengerId and Name

In [8]:
# Drop the columns that are not useful for prediction
spaceship_cleaned = spaceship_cleaned.drop(['PassengerId', 'Name'], axis=1)

# Check the first few rows to confirm they are gone
spaceship_cleaned.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


- For non-numerical columns, do dummies.

In [9]:
# Convert categorical variables into dummy/indicator variables
# We use drop_first=True to avoid the "dummy variable trap" (multicollinearity)
spaceship_dummies = pd.get_dummies(spaceship_cleaned, drop_first=True)

# View the new columns
spaceship_dummies.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,True,False,False,False,False,False,False,False,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,False,False,False,True,False,False,False,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,False,False,False,False,False,False,False,True,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,False,False,False,True,False,False,False,True,False


**Perform Train Test Split**

In [11]:
# 1. Separate features and target
X = spaceship_dummies.drop('Transported', axis=1)
y = spaceship_dummies['Transported']

# 2. Perform the split (using 20% for testing and a random_state for reproducibility)
# Corrected the keyword argument to 'test_size'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes to confirm the split
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

Training set size: (5284, 19)
Testing set size: (1322, 19)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# 1. Scale the data (Crucial for KNN performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Initialize the KNN model
# We'll start with k=5 (the default)
knn = KNeighborsClassifier(n_neighbors=5)

# 3. Train the model
knn.fit(X_train_scaled, y_train)

print("KNN model training complete.")

KNN model training complete.


- Evaluate your model's performance. Comment it

In [13]:
from sklearn.metrics import accuracy_score, classification_report

# 1. Generate predictions using the scaled test data
y_pred = knn.predict(X_test_scaled)

# 2. Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# 3. Detailed performance report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 78.59%

Classification Report:
              precision    recall  f1-score   support

       False       0.79      0.78      0.78       653
        True       0.79      0.79      0.79       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



In [None]:
'''
Model Performance Analysis
The KNN model worked pretty well! It got an accuracy of 78.59%, which means it correctly predicted if a passenger was transported about 79% of the time.

My observations:

Balanced Results: The precision and recall are both around 0.79. This is good because it shows the model is just as good at finding people who were transported as it is at finding people who were not.

Equal Classes: Looking at the "support" column, we have a similar number of False (653) and True (669) cases. This makes the accuracy score more trustworthy because the data isn't biased toward just one outcome.

Why it worked: I think cleaning the data and scaling the numbers before training really helped the KNN algorithm. Since KNN uses "distance" to find neighbors, scaling ensured that big numbers (like spending) didn't drown out small numbers (like age).
'''