# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [6]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from tabulate import tabulate

In [9]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [10]:
# Check the shape of your data
print(tabulate([spaceship.shape], headers=['Rows', 'Columns'], tablefmt='grid'))

+--------+-----------+
|   Rows |   Columns |
|   8693 |        14 |
+--------+-----------+


**Check for data types**

In [11]:
#Check for data types
print(tabulate(spaceship.dtypes.reset_index().values, headers=['Column', 'Data Type'], tablefmt='grid'))

+--------------+-------------+
| Column       | Data Type   |
| PassengerId  | object      |
+--------------+-------------+
| HomePlanet   | object      |
+--------------+-------------+
| CryoSleep    | object      |
+--------------+-------------+
| Cabin        | object      |
+--------------+-------------+
| Destination  | object      |
+--------------+-------------+
| Age          | float64     |
+--------------+-------------+
| VIP          | object      |
+--------------+-------------+
| RoomService  | float64     |
+--------------+-------------+
| FoodCourt    | float64     |
+--------------+-------------+
| ShoppingMall | float64     |
+--------------+-------------+
| Spa          | float64     |
+--------------+-------------+
| VRDeck       | float64     |
+--------------+-------------+
| Name         | object      |
+--------------+-------------+
| Transported  | bool        |
+--------------+-------------+


**Check for missing values**

In [12]:
#Check for missing values
print(tabulate(spaceship.isnull().sum().reset_index().values, headers=['Column', 'Missing Values'], tablefmt='grid'))

+--------------+------------------+
| Column       |   Missing Values |
| PassengerId  |                0 |
+--------------+------------------+
| HomePlanet   |              201 |
+--------------+------------------+
| CryoSleep    |              217 |
+--------------+------------------+
| Cabin        |              199 |
+--------------+------------------+
| Destination  |              182 |
+--------------+------------------+
| Age          |              179 |
+--------------+------------------+
| VIP          |              203 |
+--------------+------------------+
| RoomService  |              181 |
+--------------+------------------+
| FoodCourt    |              183 |
+--------------+------------------+
| ShoppingMall |              208 |
+--------------+------------------+
| Spa          |              183 |
+--------------+------------------+
| VRDeck       |              188 |
+--------------+------------------+
| Name         |              200 |
+--------------+------------

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [13]:
#Removing all rows or all columns containing missing data.
spaceship.dropna(axis=0, how='any', inplace=True)

print(tabulate(spaceship.isnull().sum().reset_index().values, headers=['Column', 'Missing Values'], tablefmt='grid'))


+--------------+------------------+
| Column       |   Missing Values |
| PassengerId  |                0 |
+--------------+------------------+
| HomePlanet   |                0 |
+--------------+------------------+
| CryoSleep    |                0 |
+--------------+------------------+
| Cabin        |                0 |
+--------------+------------------+
| Destination  |                0 |
+--------------+------------------+
| Age          |                0 |
+--------------+------------------+
| VIP          |                0 |
+--------------+------------------+
| RoomService  |                0 |
+--------------+------------------+
| FoodCourt    |                0 |
+--------------+------------------+
| ShoppingMall |                0 |
+--------------+------------------+
| Spa          |                0 |
+--------------+------------------+
| VRDeck       |                0 |
+--------------+------------------+
| Name         |                0 |
+--------------+------------

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [14]:
import pandas as pd

# Function to extract the deck from the Cabin string
def extract_deck(cabin):
    if isinstance(cabin, str):
        return cabin.split('/')[0]
    return None  # Handle potential missing or malformed values

# Apply the function to the 'Cabin' column to create a new 'Deck' column
spaceship['Deck'] = spaceship['Cabin'].apply(extract_deck)

# Display the first few rows with the new 'Deck' column
print(spaceship[['Cabin', 'Deck']].head())

# Check the unique values in the 'Deck' column
print("\nUnique values in 'Deck' column:")
print(spaceship['Deck'].unique())

   Cabin Deck
0  B/0/P    B
1  F/0/S    F
2  A/0/S    A
3  A/0/S    A
4  F/1/S    F

Unique values in 'Deck' column:
['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


- Drop PassengerId and Name

In [15]:
# Drop the 'PassengerId' and 'Name' columns
spaceship_dropped = spaceship.drop(columns=['PassengerId', 'Name'])

# Display the first few rows of the DataFrame after dropping the columns
print(spaceship_dropped.head())

# Print the remaining columns to verify
print("\nRemaining Columns:")
print(spaceship_dropped.columns)

  HomePlanet CryoSleep  Cabin  Destination   Age    VIP  RoomService  \
0     Europa     False  B/0/P  TRAPPIST-1e  39.0  False          0.0   
1      Earth     False  F/0/S  TRAPPIST-1e  24.0  False        109.0   
2     Europa     False  A/0/S  TRAPPIST-1e  58.0   True         43.0   
3     Europa     False  A/0/S  TRAPPIST-1e  33.0  False          0.0   
4      Earth     False  F/1/S  TRAPPIST-1e  16.0  False        303.0   

   FoodCourt  ShoppingMall     Spa  VRDeck  Transported Deck  
0        0.0           0.0     0.0     0.0        False    B  
1        9.0          25.0   549.0    44.0         True    F  
2     3576.0           0.0  6715.0    49.0        False    A  
3     1283.0         371.0  3329.0   193.0        False    A  
4       70.0         151.0   565.0     2.0         True    F  

Remaining Columns:
Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'Deck']

- For non-numerical columns, do dummies.

In [16]:
# Select non-numerical columns
categorical_cols = spaceship_dropped.select_dtypes(include=['object']).columns

print("Categorical Columns before One-Hot Encoding:")
print(categorical_cols)

# Perform one-hot encoding (creating dummy variables)
spaceship_dummies = pd.get_dummies(spaceship_dropped, columns=categorical_cols, drop_first=True)

# Display the first few rows of the DataFrame with dummy variables
print("\nDataFrame with Dummy Variables (first 5 rows):")
print(spaceship_dummies.head())

# Print the new columns after one-hot encoding
print("\nColumns after One-Hot Encoding:")
print(spaceship_dummies.columns)

Categorical Columns before One-Hot Encoding:
Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Deck'], dtype='object')

DataFrame with Dummy Variables (first 5 rows):
    Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0  39.0          0.0        0.0           0.0     0.0     0.0        False   
1  24.0        109.0        9.0          25.0   549.0    44.0         True   
2  58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3  33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4  16.0        303.0       70.0         151.0   565.0     2.0         True   

   HomePlanet_Europa  HomePlanet_Mars  CryoSleep_True  ...  \
0               True            False           False  ...   
1              False            False           False  ...   
2               True            False           False  ...   
3               True            False           False  ...   
4              False            False 

**Perform Train Test Split**

In [17]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y) from the dataframe with dummies
X_processed = spaceship_dummies.drop(columns=['Transported'])
y = spaceship_dummies['Transported'].astype(int) # Ensure target is numerical

# Split the processed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (5284, 5323)
X_test shape: (1322, 5323)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the training features and transform them
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test features using the fitted scaler
X_test_scaled = scaler.transform(X_test)

# Initialize the KNN Classifier (you can start with a default number of neighbors, e.g., 5)
knn_classifier = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN classifier to the scaled training data
knn_classifier.fit(X_train_scaled, y_train)

print("KNN Classifier initialized and fitted to the scaled training data.")

KNN Classifier initialized and fitted to the scaled training data.


- Evaluate your model's performance. Comment it

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Make predictions on the scaled test data
y_pred = knn_classifier.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print("Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

print("\nComments on Model Performance:")
print("- **Accuracy:** This metric tells us the overall percentage of passengers whose transportation status was correctly predicted by the model on the test set.")
print(f"  With an accuracy of {accuracy:.4f}, the model correctly predicted approximately {accuracy * 100:.2f}% of the cases.")
print("- **Precision:** Precision measures out of all the passengers the model predicted as 'Transported', what proportion were actually transported. It's about the model's ability to avoid false positives.")
print(f"  A precision of {precision:.4f} indicates that when the model predicted a passenger would be transported, it was correct about {precision * 100:.2f}% of the time.")
print("- **Recall:** Recall measures out of all the passengers who were actually 'Transported', what proportion did the model correctly identify. It's about the model's ability to avoid false negatives.")
print(f"  A recall of {recall:.4f} suggests that the model correctly identified {recall * 100:.2f}% of all the passengers who were actually transported.")
print("- **F1-score:** The F1-score is the harmonic mean of precision and recall. It provides a single score that balances both concerns. It's particularly useful when you have an uneven class distribution.")
print(f"  An F1-score of {f1:.4f} gives a balanced view of the model's performance in terms of both precision and recall.")

print("\nFurther Considerations:")
print("- The specific values of these metrics should be interpreted in the context of the problem and the importance of avoiding false positives versus false negatives.")
print("- Depending on the goals of predicting passenger transportation, you might prioritize one metric over others.")
print("- This evaluation is based on the default `n_neighbors` value (5). Tuning this and other hyperparameters could potentially improve the model's performance.")

Evaluation Metrics:
Accuracy: 0.6369
Precision: 0.6056
Recall: 0.8102
F1-score: 0.6931

Comments on Model Performance:
- **Accuracy:** This metric tells us the overall percentage of passengers whose transportation status was correctly predicted by the model on the test set.
  With an accuracy of 0.6369, the model correctly predicted approximately 63.69% of the cases.
- **Precision:** Precision measures out of all the passengers the model predicted as 'Transported', what proportion were actually transported. It's about the model's ability to avoid false positives.
  A precision of 0.6056 indicates that when the model predicted a passenger would be transported, it was correct about 60.56% of the time.
- **Recall:** Recall measures out of all the passengers who were actually 'Transported', what proportion did the model correctly identify. It's about the model's ability to avoid false negatives.
  A recall of 0.8102 suggests that the model correctly identified 81.02% of all the passengers 