# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [11]:
#Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error, precision_score, accuracy_score, recall_score


In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
#your code here

spaceship.dropna(inplace=True)
spaceship["Cabin"] = spaceship["Cabin"].apply(lambda x: x.split("/")[0])
spaceship.drop(columns = ["PassengerId","Name"], inplace=True)

spaceship2 = spaceship.__deepcopy__()
spacedummies = pd.get_dummies(spaceship2, columns=['HomePlanet', 'Cabin', 'Destination'], drop_first=False)
spacedummies = pd.get_dummies(spacedummies, columns=['CryoSleep', 'VIP', 'Transported'], drop_first=True)

**Perform Train Test Split**

In [5]:
#your code here
target = spacedummies["Transported_True"]
features = spacedummies.drop("Transported_True", axis=1)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [8]:
# With normalization
from sklearn.preprocessing import MinMaxScaler, StandardScaler

binary_features = ['HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 
                  'Cabin_A', 'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 
                  'Cabin_F', 'Cabin_G', 'Cabin_T', 
                  'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 
                  'Destination_TRAPPIST-1e', 'CryoSleep_True', 'VIP_True']

numerical_features = X_train.columns.difference(binary_features) # Identify numerical features

# Normalize only the numerical features
normalizer = MinMaxScaler()
normalizer.fit(X_train[numerical_features])

# Transform the numerical features
X_train_norm = normalizer.transform(X_train[numerical_features])
X_test_norm = normalizer.transform(X_test[numerical_features])

# Convert to DataFrame, keeping the original indices
X_train_norm = pd.DataFrame(X_train_norm, columns=numerical_features, index=X_train.index)
X_test_norm = pd.DataFrame(X_test_norm, columns=numerical_features, index=X_test.index)

# Combine normalized numerical features with unchanged binary features
X_train_norm = pd.concat([X_train_norm, X_train[binary_features]], axis=1)
X_test_norm = pd.concat([X_test_norm, X_test[binary_features]], axis=1)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

Previous performance: <br>

- Accuracy: 0.7844175491679274 <br>
- Precision: 0.7993630573248408 <br>
- Recall: 0.7594553706505295 <br>


- Bagging and Pasting

In [97]:
#your code here -> bagging
bagging_reg = BaggingClassifier(DecisionTreeClassifier(
                                max_depth=20
                                ),                               
                               n_estimators=200,
                               max_samples = 1000,
                               #bootstrap = False,
                               n_jobs=-1
                               )

bagging_reg.fit(X_train_norm, y_train)

pred = bagging_reg.predict(X_test_norm)

precision = precision_score(y_test, pred)
accuracy = accuracy_score(y_test, pred)
recall = recall_score(y_test, pred)

print("Accuracy bag:", accuracy)
print("Precision bag:", precision)
print("Recall bag:", recall)

print("\n")

#your code here -> paste
bagging_reg = BaggingClassifier(DecisionTreeClassifier(
                                ##max_depth=20
                                ),                               
                               n_estimators=200,
                               max_samples = 1000,
                               bootstrap = False,
                               n_jobs=-1
                               )

bagging_reg.fit(X_train_norm, y_train)

pred = bagging_reg.predict(X_test_norm)

precision = precision_score(y_test, pred)
accuracy = accuracy_score(y_test, pred)
recall = recall_score(y_test, pred)

print("Accuracy paste:", accuracy)
print("Precision paste:", precision)
print("Recall paste:", recall)



Accuracy bag: 0.7829046898638427
Precision bag: 0.7766272189349113
Recall bag: 0.794251134644478


Accuracy paste: 0.7768532526475038
Precision paste: 0.7798165137614679
Recall paste: 0.7715582450832073


- Random Forests

In [28]:
forest = RandomForestClassifier(n_estimators=200,
                             max_depth=20)

forest.fit(X_train_norm, y_train)

pred = forest.predict(X_test_norm)

precision = precision_score(y_test, pred)
accuracy = accuracy_score(y_test, pred)
recall = recall_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.7881996974281392
Precision: 0.7917304747320061
Recall: 0.7821482602118003


- Gradient Boosting

In [34]:
#your code here

gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=200)

gb_reg.fit(X_train_norm, y_train)

pred = gb_reg.predict(X_test_norm)

precision = precision_score(y_test, pred)
accuracy = accuracy_score(y_test, pred)
recall = recall_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.791981845688351
Precision: 0.7880597014925373
Recall: 0.7987897125567323


- Adaptive Boosting

In [39]:
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=200
                            )

ada_reg.fit(X_train_norm, y_train)

pred = ada_reg.predict(X_test_norm)

precision = precision_score(y_test, pred)
accuracy = accuracy_score(y_test, pred)
recall = recall_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)



Accuracy: 0.7715582450832073
Precision: 0.7597684515195369
Recall: 0.794251134644478


Which model is the best and why?

In [98]:
#comment here
"""
Most metrics give us an accuracy between 0.77-0.79. 
While KNN's scored higher in Precision, GradientBoost and Bagging performed better in Recall.

For our metric of identifying if passenger was transported or not, Recall might be our most valued metric, so GradientBoost.
However, we should procede with some crossvalidation on all tests because there's high variation depending on sampling, and our metrics are too close for a definitive answer.
"""



"\nMost metrics give us an accuracy between 0.77-0.79. \nWhile KNN's scored higher in Precision, GradientBoost and Bagging performed better in Recall.\n\nFor our metric of identifying if passenger was transported or not, Recall might be our most valued metric, so GradientBoost.\nHowever, we should procede with some crossvalidation on all tests because there's high variation depending on sampling, and our metrics are too close for a definitive answer.\n"