# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [4]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [5]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [11]:
#your code here
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [12]:
#your code here
#your code here
# Drop rows with any missing values
spaceship_cleaned = spaceship.dropna()

# Check the shape of the cleaned dataset
print(spaceship_cleaned.shape)

# Verify that there are no more missing values
missing_values_cleaned = spaceship_cleaned.isnull().sum()
print(missing_values_cleaned)

(6606, 14)
PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64


- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [13]:
#your code here
# Extract first letter of 'Cabin' to represent the deck
spaceship_clean['Deck'] = spaceship_clean['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else np.nan)

# Check the unique values of 'Deck'
print("Unique values in 'Deck':", spaceship_clean['Deck'].unique())

Unique values in 'Deck': ['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_clean['Deck'] = spaceship_clean['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else np.nan)


- Drop PassengerId and Name

In [14]:
#your code here
# Drop 'PassengerId' and 'Name' columns
spaceship_clean = spaceship_clean.drop(['PassengerId', 'Name'], axis=1)

# Check the first few rows of the cleaned data
print(spaceship_clean.head())


  HomePlanet CryoSleep  Cabin  Destination   Age    VIP  RoomService  \
0     Europa     False  B/0/P  TRAPPIST-1e  39.0  False          0.0   
1      Earth     False  F/0/S  TRAPPIST-1e  24.0  False        109.0   
2     Europa     False  A/0/S  TRAPPIST-1e  58.0   True         43.0   
3     Europa     False  A/0/S  TRAPPIST-1e  33.0  False          0.0   
4      Earth     False  F/1/S  TRAPPIST-1e  16.0  False        303.0   

   FoodCourt  ShoppingMall     Spa  VRDeck  Transported Deck  
0        0.0           0.0     0.0     0.0        False    B  
1        9.0          25.0   549.0    44.0         True    F  
2     3576.0           0.0  6715.0    49.0        False    A  
3     1283.0         371.0  3329.0   193.0        False    A  
4       70.0         151.0   565.0     2.0         True    F  


- For non-numerical columns, do dummies.

In [15]:
#your code here
# Convert non-numerical columns to dummy variables
spaceship_clean = pd.get_dummies(spaceship_clean)

# Check the first few rows of the cleaned data
print(spaceship_clean.head())

    Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0  39.0          0.0        0.0           0.0     0.0     0.0        False   
1  24.0        109.0        9.0          25.0   549.0    44.0         True   
2  58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3  33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4  16.0        303.0       70.0         151.0   565.0     2.0         True   

   HomePlanet_Earth  HomePlanet_Europa  HomePlanet_Mars  ...  VIP_False  \
0             False               True            False  ...       True   
1              True              False            False  ...       True   
2             False               True            False  ...      False   
3             False               True            False  ...       True   
4              True              False            False  ...       True   

   VIP_True  Deck_A  Deck_B  Deck_C  Deck_D  Deck_E  Deck_F  Deck_G  Deck_T  
0 

**Perform Train Test Split**

In [17]:
#your code here
from sklearn.model_selection import train_test_split

# Define your target variable (for example 'Survived') and your features
target = 'Transported'
features = spaceship_clean.drop(columns=target).columns

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(spaceship_clean[features], spaceship_clean[target], test_size=0.2, random_state=42)

# Check the size of the train and test sets
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


Training set size: (5284, 5329)
Test set size: (1322, 5329)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [23]:
#your code here

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize the KNN model
knn = KNeighborsClassifier(n_neighbors=15)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy:", accuracy)


Model accuracy: 0.8071104387291982


- Evaluate your model's performance. Comment it

In [24]:
#your code here
from sklearn.metrics import classification_report

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.82      0.78      0.80       653
        True       0.79      0.84      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322

