# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [7]:
#your code here
print("Data shape:", spaceship.shape)

Data shape: (8693, 14)


**Check for data types**

In [15]:
#your code here
# Check data types of the cleaned dataset
print(spaceship.dtypes)


PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


In [43]:
# Select numerical columns only
num_cols = spaceship.select_dtypes(include=[np.number])

# Check for null values in numerical columns
print("Null values in numerical columns:\n", num_cols.isnull().sum())


Null values in numerical columns:
 Age             179
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
dtype: int64


**Check for missing values**

In [11]:
#your code here
print("Missing values:\n", spaceship.isnull().sum())

Missing values:
 PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


In [45]:
# For columns assumed to be categorical: fill with mode
for col in ['GEOCODE2', 'WEALTH1']:
    if col in spaceship.columns:
        mode_value = spaceship[col].mode()[0]
        spaceship[col] = spaceship[col].fillna(mode_value)

# For numerical columns: fill with median
for col in ['ADI', 'DMA', 'MSA']:
    if col in spaceship.columns:
        median_value = spaceship[col].median()
        spaceship[col] = spaceship[col].fillna(median_value)


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [17]:
#your code here
# Drop any rows with null values
spaceship_clean = spaceship.dropna()
print("Data shape after dropping nulls:", spaceship_clean.shape)

Data shape after dropping nulls: (6606, 14)


- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [28]:
#your code here
spaceship_clean = spaceship.dropna().copy()  # safer copy

# Now safely apply transformation
spaceship_clean['Cabin'] = spaceship_clean['Cabin'].apply(lambda x: x[0])


- Drop PassengerId and Name

In [30]:
#your code here
# Check first
print(spaceship_clean.columns)

# Drop if exists
cols_to_drop = ['PassengerId', 'Name']
cols_existing = [col for col in cols_to_drop if col in spaceship_clean.columns]

spaceship_clean = spaceship_clean.drop(columns=cols_existing)


Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')


- For non-numerical columns, do dummies.

In [32]:
#your code here
# Get dummies for categorical columns (including 'Cabin')
spaceship_clean = pd.get_dummies(spaceship_clean, drop_first=True)
spaceship_clean.head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,True,False,False,False,False,False,False,False,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,False,False,False,True,False,False,False,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,False,False,False,False,False,False,False,True,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,False,False,False,True,False,False,False,True,False


**Perform Train Test Split**

In [36]:
#your code here
# Define X and y
X = spaceship_clean.drop('Transported', axis=1)
y = spaceship_clean['Transported']

# Split into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [38]:
#your code here
# Initialize KNN
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

# Predictions
y_pred = knn.predict(X_test)


- Evaluate your model's performance. Comment it

In [41]:
#your code here
# Evaluation
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy Score: 0.7874432677760969

Classification Report:
               precision    recall  f1-score   support

       False       0.78      0.79      0.79       653
        True       0.79      0.78      0.79       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322

