# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [64]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [65]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [66]:
# let's do the usual cleaning. 

# check for data types
spaceship.info()

# Let's correct the VIP column
spaceship['VIP'] = spaceship['VIP'].astype('bool')
spaceship['CryoSleep'] = spaceship['CryoSleep'].astype('bool')


# drop columns PassengerId and Name
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

# define categorial vs numerical columns
cat_var = spaceship.select_dtypes(include="object").columns
num_var = spaceship.select_dtypes(exclude="object").columns

print("Categorical Variables: ", cat_var)
print("Numerical Variables: ", num_var)

len(cat_var) + len(num_var) == len (spaceship.columns)

# we will drop rows containing any missing value
spaceship = spaceship.dropna(axis=0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB
Categorical Variables:  Index(['HomePlanet', 'Cabin', 'Destination'], dtype='object')
Numerical Variables:  Index(['CryoSleep', 'Age', 'VIP', 

In [67]:
# let's explore the columns of type float: 
# they all look to be integers: 
float_columns = spaceship.select_dtypes(include="float").columns

test = spaceship[float_columns].sum()
test2 = spaceship[float_columns].astype(int).sum()

test - test2

# we will cast them as int
spaceship[float_columns] = spaceship[float_columns].astype(int)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [68]:
# feature engineering
# transform cabin in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]

# do dummies for non-numerical columns 
spaceship = pd.get_dummies(spaceship, columns=cat_var, drop_first=True)
# we'll drop the first column for each categorical variable to avoid having correlated variables. 

In [69]:
# feature selection
# let's see if any variables are correlated. 

import plotly.express as px

# Calculate the correlation matrix
correlation_matrix = np.abs(spaceship.corr())

# Create the heatmap using Plotly Express
fig = px.imshow(correlation_matrix,
                x=correlation_matrix.columns,
                y=correlation_matrix.columns,
                color_continuous_scale='RdBu_r',  # Red-Blue diverging color scale
                zmin=-1,
                zmax=1,
                aspect="auto",
                title='Correlation Heatmap of Numerical Variables')

# Update the layout for better readability
fig.update_layout(
    xaxis_title="",
    yaxis_title="",
    xaxis={'side': 'top'},  # Move x-axis labels to the top
    width=800,
    height=700
)

# Add correlation values as text annotations
for i, row in enumerate(correlation_matrix.values):
    for j, value in enumerate(row):
        fig.add_annotation(
            x=correlation_matrix.columns[j],
            y=correlation_matrix.columns[i],
            text=f"{value:.2f}",
            showarrow=False,
            font=dict(size=8)
        )

# Show the plot
fig.show()


# all the variables seem quite independent. 

**Perform Train Test Split**

In [70]:
# train test split
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [71]:
# normalise the data for continuous variables
from sklearn.preprocessing import MinMaxScaler

normalizer = MinMaxScaler()

X_train_norm = normalizer.fit_transform(X_train)

X_test_norm = normalizer.transform(X_test)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [82]:
# apply bagging and pasting
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

- Bagging and Pasting

In [74]:
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train_norm, y_train)

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# evaluate the model
pred = bagging_reg.predict(X_test_norm)



In [79]:
#evaluate the model 

# Predict the labels of the test set
pred = bagging_reg.predict(X_test_norm)

# calculate the accuracy: 
print("Accuracy:", bagging_reg.score(X_test_norm, y_test))


from sklearn.metrics import f1_score

# Compute the F1 score
f1 = f1_score(y_test, pred)

print("F1 score:", f1)

# the accuracy improved slightly (not much)

Accuracy: 0.7970946579194002
F1 score: 0.798510935318753


- Random Forests

In [80]:
# initialise Random Forest Classifier

forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
forest.fit(X_train_norm, y_train)


# evaluate the model
pred = forest.predict(X_test_norm)

print("accuracy", forest.score(X_test_norm, y_test))
print("f1 score", f1_score(y_test, pred))

accuracy 0.7877225866916588
f1 score 0.788020589611605


- Gradient Boosting

In [81]:
# initialise Gradient Boosting Classifier
gb_reg = GradientBoostingClassifier(max_depth=20,
                            n_estimators=100)
gb_reg.fit(X_train_norm, y_train)


# evaluate the model
pred = gb_reg.predict(X_test_norm)

print("accuracy", gb_reg.score(X_test_norm, y_test))
print("f1 score", f1_score(y_test, pred))

accuracy 0.7460168697282099
f1 score 0.7554151624548736


- Adaptive Boosting

In [83]:
# initialise AdaBoost Classifier
ada_reg = AdaBoostClassifier(n_estimators=100)
ada_reg.fit(X_train_norm, y_train)


# evaluate the model
pred = ada_reg.predict(X_test_norm)

print("accuracy", ada_reg.score(X_test_norm, y_test))
print("f1 score", f1_score(y_test, pred))





accuracy 0.7938144329896907
f1 score 0.8051372896368467


Which model is the best and why?

In [None]:
# the AdaBoost Classifier is the best.