# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor,RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn import tree


In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
spaceship["Cabin"] = spaceship["Cabin"].apply(lambda x: str(x).split()[0][0])

In [4]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop="first")

non_num = spaceship.select_dtypes(include="object").drop(columns=["Name", "PassengerId"]).columns.tolist()
encoded = enc.fit_transform(spaceship[non_num])
encoded_df = pd.DataFrame(encoded, columns=enc.get_feature_names_out(non_num))
spaceship = spaceship.drop(non_num, axis=1)
spaceship = pd.concat([spaceship, encoded_df], axis=1)
spaceship.drop(columns=["PassengerId", "Name"], inplace=True)

In [5]:
for i in spaceship.columns[:6]:
    spaceship[i].fillna(spaceship[i].mean(), inplace=True)

In [6]:
features = spaceship.drop(columns=["Transported"])
target = spaceship["Transported"]

**Perform Train Test Split**

In [7]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)

In [8]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)
X_train_norm = pd.DataFrame(normalizer.transform(X_train), columns=X_train.columns)
X_test_norm = pd.DataFrame(normalizer.transform(X_test), columns=X_test.columns)


In [9]:
X_train_norm.isna().sum()

Age                          0
RoomService                  0
FoodCourt                    0
ShoppingMall                 0
Spa                          0
VRDeck                       0
HomePlanet_Europa            0
HomePlanet_Mars              0
HomePlanet_nan               0
CryoSleep_True               0
CryoSleep_nan                0
Cabin_B                      0
Cabin_C                      0
Cabin_D                      0
Cabin_E                      0
Cabin_F                      0
Cabin_G                      0
Cabin_T                      0
Cabin_n                      0
Destination_PSO J318.5-22    0
Destination_TRAPPIST-1e      0
Destination_nan              0
VIP_True                     0
VIP_nan                      0
dtype: int64

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [10]:
bagreg = BaggingRegressor(tree.DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples=1000,
                               random_state=42)

bagreg.fit(X_train_norm, y_train)

In [11]:
bagreg.score(X_test_norm, y_test)

0.42051460425050047

- Random Forests

In [12]:
ranfor = RandomForestClassifier(max_depth=20,
                               random_state=42)

ranfor.fit(X_train_norm, y_train)

In [13]:
ranfor.score(X_test_norm, y_test)

0.7780333525014376

- Gradient Boosting

In [14]:
grabo = GradientBoostingClassifier(max_depth=20,
                                  random_state=42)

grabo.fit(X_train_norm, y_train)

In [15]:
grabo.score(X_test_norm, y_test)

0.7607820586543991

- Adaptive Boosting

In [17]:
adabo = AdaBoostClassifier(tree.DecisionTreeClassifier(max_depth=20),
                          random_state=42)
adabo.fit(X_train_norm, y_train)

In [18]:
adabo.score(X_test_norm, y_test)

0.7630822311673375

Which model is the best and why?

In [20]:
import lightgbm as lgb

In [21]:
model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=15,random_state=42)
model.fit(X_train_norm, y_train, eval_set=[(X_test_norm, y_test),(X_train_norm, y_train)],eval_metric='logloss')

[LightGBM] [Info] Number of positive: 3500, number of negative: 3454
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000839 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1388
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 23
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503307 -> initscore=0.013230
[LightGBM] [Info] Start training from score 0.013230


In [22]:
model.score(X_test_norm, y_test)



0.7935595169637722

RandomForestClassifier seems to have worked the best, with the exception of LightGBoost.