# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [3]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import warnings
import matplotlib.pyplot as plt
# Suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [6]:
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [7]:
#your code here
def fill_missing_values(spaceship):
    for column in spaceship.columns:
        if column in ["Name"]:
            continue
        if spaceship[column].dtype == 'float64':
            spaceship[column].fillna(spaceship[column].mean(), inplace=True)
        elif spaceship[column].dtype == 'object':
            spaceship[column].fillna(spaceship[column].mode()[0], inplace=True)
    
    # Special case for 'Name' column
    spaceship['Name'].fillna('J. D.', inplace=True)

    return spaceship

# Apply the function to DataFrame
spaceship = fill_missing_values(spaceship)

# Verify that there are no remaining missing values
print(spaceship.isnull().sum())

# Output the DataFrame to ensure correct replacements
spaceship.head()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [8]:
#your code here
spaceship['Cabin'] = spaceship['Cabin'].str[0]

In [13]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   object 
 2   CryoSleep     8693 non-null   bool   
 3   Cabin         8693 non-null   object 
 4   Destination   8693 non-null   object 
 5   Age           8693 non-null   float64
 6   VIP           8693 non-null   bool   
 7   RoomService   8693 non-null   float64
 8   FoodCourt     8693 non-null   float64
 9   ShoppingMall  8693 non-null   float64
 10  Spa           8693 non-null   float64
 11  VRDeck        8693 non-null   float64
 12  Name          8693 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(3), float64(6), object(5)
memory usage: 772.6+ KB


In [14]:
#Drop the 'PassegnerId' and 'Name' columns
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)


In [15]:
#your code here
df_dummies = pd.get_dummies(spaceship, columns=spaceship.select_dtypes(include=['object']).columns, drop_first=True)

In [16]:
df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   CryoSleep                  8693 non-null   bool   
 1   Age                        8693 non-null   float64
 2   VIP                        8693 non-null   bool   
 3   RoomService                8693 non-null   float64
 4   FoodCourt                  8693 non-null   float64
 5   ShoppingMall               8693 non-null   float64
 6   Spa                        8693 non-null   float64
 7   VRDeck                     8693 non-null   float64
 8   Transported                8693 non-null   bool   
 9   HomePlanet_Europa          8693 non-null   bool   
 10  HomePlanet_Mars            8693 non-null   bool   
 11  Cabin_B                    8693 non-null   bool   
 12  Cabin_C                    8693 non-null   bool   
 13  Cabin_D                    8693 non-null   bool 

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [19]:
#your code here
numerical_spaceship = df_dummies.select_dtypes(include=['float64','bool'])
numerical_spaceship

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,True,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,False,False,True,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,True,False,False,False,False,False,False,False,False,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False,False,True,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,False,False,False,False,False,False,False,True,False,False,True
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,True,False,False,False,False,True,False,False,False,False,False


In [20]:
#your code here
features = numerical_spaceship.drop(columns = ["Transported"])
target = numerical_spaceship["Transported"]

**Perform Train Test Split**

In [21]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

# **Normalizing data**

In [22]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [23]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

# **Starting the tests**

## Bagging and Pasting

In [53]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

Training Bagging model with our normalized data

In [54]:
bagging_reg.fit(X_train_norm, y_train)

Evaluate model's performance

In [55]:
pred = bagging_reg.predict(X_test_norm)

print("Bagging and Pasting MAE", mean_absolute_error(pred, y_test))
print("Bagging and Pasting RMSE", mean_squared_error(pred, y_test, squared=False))
print("Bagging and Pasting R2 score", bagging_reg.score(X_test_norm, y_test))

Bagging and Pasting MAE 0.28502580408897515
Bagging and Pasting RMSE 0.37564094312630664
Bagging and Pasting R2 score 0.4355439833245667


## Random Forests

In [49]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

- Training the model

In [50]:
forest.fit(X_train_norm, y_train)

- Evaluate model's performance

In [51]:
pred = forest.predict(X_test_norm)

print("Random Forests MAE", mean_absolute_error(pred, y_test))
print("Random Forests RMSE", mean_squared_error(pred, y_test, squared=False))
print("Random Forests R2 score", forest.score(X_test_norm, y_test))

Random Forests MAE 0.27338195042315583
Random Forests RMSE 0.38078751393533594
Random Forests R2 score 0.41997106258242023


## Gradient Boosting

- Initialize a AdaBoost model

In [32]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

- Training the model

In [33]:
gb_reg.fit(X_train_norm, y_train)

Evaluate model's performance

In [36]:
pred = gb_reg.predict(X_test_norm)

print("Gradient Boosting MAE", mean_absolute_error(pred, y_test))
print("Gradient Boosting RMSE", mean_squared_error(pred, y_test, squared=False))
print("Gradient Boosting R2 score", gb_reg.score(X_test_norm, y_test))

Gradient Boosting MAE 0.2661578334011716
Gradient Boosting RMSE 0.42718627195503867
Gradient Boosting R2 score 0.27000676126515033


## Adaptive Boosting

- Initialize a AdaBoost model

In [29]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

- Training the model

In [30]:
ada_reg.fit(X_train_norm, y_train)

Evaluate model's performance

In [37]:
pred = ada_reg.predict(X_test_norm)

print("Adaptive Boosting MAE", mean_absolute_error(pred, y_test))
print("Adaptive Boosting RMSE", mean_squared_error(pred, y_test, squared=False))
print("Adaptive Boosting R2 score", ada_reg.score(X_test_norm, y_test))

Adaptive Boosting MAE 0.26345880687749407
Adaptive Boosting RMSE 0.4338803391356253
Adaptive Boosting R2 score 0.24694932175662898


Which model is the best and why?

Bagging and Pasting MAE 0.2843287072744124

Bagging and Pasting RMSE 0.3738103315001174

Bagging and Pasting R2 score 0.44103210688927774

Bagging and Pasting has the best results in total might be due to the fact we are not doing a proper test of all the data or some hyperparameters are wrong. 
We'll be able to account for those from what we learn in the next class