# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [41]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [42]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [43]:
print(spaceship.shape)
print(spaceship.dtypes)
print(spaceship.isnull().sum())

spaceship = spaceship.dropna()

(8693, 14)
PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object
PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [44]:
print(spaceship["Cabin"]) 
spaceship['Cabin'] = spaceship['Cabin'].str[0] #keeping only the first letter of the cabin
print(spaceship["Cabin"])

0          B/0/P
1          F/0/S
2          A/0/S
3          A/0/S
4          F/1/S
          ...   
8688      A/98/P
8689    G/1499/S
8690    G/1500/S
8691     E/608/S
8692     E/608/S
Name: Cabin, Length: 6606, dtype: object
0       B
1       F
2       A
3       A
4       F
       ..
8688    A
8689    G
8690    G
8691    E
8692    E
Name: Cabin, Length: 6606, dtype: object


In [45]:
spaceship = spaceship.drop(columns=["PassengerId", "Name"]) #dropping unuseful columns

In [46]:
# Separate features and target
X = spaceship.drop(columns=['Transported'])
y = spaceship['Transported']


In [47]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [48]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import pandas as pd

# One-hot encode categorical features 
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns

oh_encoder = OneHotEncoder(drop='first', sparse_output=False)

X_train_cat = oh_encoder.fit_transform(X_train[categorical_cols])
X_test_cat = oh_encoder.transform(X_test[categorical_cols])

X_train_cat = pd.DataFrame(
    X_train_cat,
    columns=oh_encoder.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    X_test_cat,
    columns=oh_encoder.get_feature_names_out(categorical_cols),
    index=X_test.index
)

# Scale numerical features
numeric_cols = X_train.drop(categorical_cols, axis=1).columns

scaler = MinMaxScaler()
X_train_num = pd.DataFrame(
    scaler.fit_transform(X_train[numeric_cols]),
    columns=numeric_cols,
    index=X_train.index
)
X_test_num = pd.DataFrame(
    scaler.transform(X_test[numeric_cols]),
    columns=numeric_cols,
    index=X_test.index
)


X_train_t = pd.concat([X_train_num, X_train_cat], axis=1)
X_test_t = pd.concat([X_test_num, X_test_cat], axis=1)



**Perform Train Test Split**

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bg_reg = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=10), n_estimators=100)
bg_reg.fit(X_train_t, y_train)

print(bg_reg.score(X_train_t, y_train)) #R2 0,86
print(bg_reg.score(X_test_t, y_test)) #R2 0,80

0.8590083270249811
0.8010590015128594


- Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_boot = RandomForestClassifier(n_estimators=100, max_depth=10, bootstrap=True)
rf_boot.fit(X_train_t, y_train)

print(rf_boot.score(X_train_t, y_train)) #R2 0,86
print(rf_boot.score(X_test_t, y_test)) #R2 0,82

0.8599545798637396
0.8169440242057489


In [None]:
rf_pasting = RandomForestClassifier(n_estimators=100, max_depth=10, bootstrap=False)
rf_pasting.fit(X_train_t, y_train)

print(rf_pasting.score(X_train_t, y_train)) #R2 0,87
print(rf_pasting.score(X_test_t, y_test)) #R2 0,81

0.8711203633610901
0.8116490166414524


- Gradient Boosting

In [None]:

from sklearn.ensemble import GradientBoostingClassifier


gb = GradientBoostingClassifier(max_depth=10, n_estimators=100)
gb.fit(X_train_t, y_train)

print(gb.score(X_train_t, y_train)) #R2 0,94
print(gb.score(X_test_t, y_test)) #R2 0,81



0.9383043149129447
0.81089258698941


- Adaptive Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_boost = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=10), n_estimators=100)
ada_boost.fit(X_train_t, y_train)

print(ada_boost.score(X_train_t, y_train)) #R2 0,94
print(ada_boost.score(X_test_t, y_test)) #R2 0,79


0.9398183194549584
0.7927382753403933


Which model is the best and why?

Even tough all the models are overfit, Random Forest Boot is the best model because train and test R2 score are closer than the other models R2 scores.