# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [12]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier

In [13]:
# 1. Load the data
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [14]:
# 2. Drop missing values 
spaceship_cleaned = spaceship.dropna()
spaceship_cleaned.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [15]:
# 3. Convert to numerical data --> dummify
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].apply(lambda x: x[0])

spaceship_cleaned_2 = spaceship_cleaned.drop(columns=['PassengerId', 'Name'])

spaceship_cleaned_2 = pd.get_dummies(spaceship_cleaned_2)
boolean_columns = spaceship_cleaned_2.select_dtypes(include=['bool']).columns
spaceship_cleaned_2[boolean_columns] = spaceship_cleaned_2[boolean_columns].astype(int)

# spaceship_cleaned_2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].apply(lambda x: x[0])


In [16]:
# 4. Feature scaling 
# adjusting range of feature on a similar scale --> method standardising 
numerical_features = spaceship_cleaned_2.select_dtypes(include=[np.number]).columns.drop('Transported')

scaler = StandardScaler()

spaceship_cleaned_2[numerical_features] = scaler.fit_transform(spaceship_cleaned_2[numerical_features])

spaceship_cleaned_2.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.695413,-0.345756,-0.285355,-0.309494,-0.273759,-0.269534,0,-1.083063,1.717147,-0.510811,...,-0.244975,-0.339578,-0.695098,-0.652578,-0.017402,-0.52022,-0.322689,0.666047,0.158555,-0.158555
1,-0.336769,-0.176748,-0.279993,-0.266112,0.206165,-0.230494,1,0.923307,-0.582361,-0.510811,...,-0.244975,-0.339578,1.438646,-0.652578,-0.017402,-0.52022,-0.322689,0.666047,0.158555,-0.158555
2,2.002842,-0.279083,1.845163,-0.309494,5.596357,-0.226058,0,-1.083063,1.717147,-0.510811,...,-0.244975,-0.339578,-0.695098,-0.652578,-0.017402,-0.52022,-0.322689,0.666047,-6.306963,6.306963
3,0.28254,-0.345756,0.479034,0.334285,2.636384,-0.098291,0,-1.083063,1.717147,-0.510811,...,-0.244975,-0.339578,-0.695098,-0.652578,-0.017402,-0.52022,-0.322689,0.666047,0.158555,-0.158555
4,-0.887266,0.124056,-0.24365,-0.04747,0.220152,-0.267759,1,0.923307,-0.582361,-0.510811,...,-0.244975,-0.339578,1.438646,-0.652578,-0.017402,-0.52022,-0.322689,0.666047,0.158555,-0.158555


In [17]:
# 5. Feature selection --> random forest 
# Encode categorical features
spaceship_encoded = spaceship_cleaned_2.copy()
categorical_features = spaceship_cleaned_2.select_dtypes(include=[object]).columns

for col in categorical_features:
    spaceship_encoded[col] = LabelEncoder().fit_transform(spaceship_cleaned_2[col].astype(str))

# Define target and features
X = spaceship_encoded.drop(columns=['Transported'])  
y = spaceship_encoded['Transported']

# Initialize the model
model = RandomForestClassifier()

# Fit the model
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Create a DataFrame for visualization
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})

# Sort the DataFrame by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

# Display the most important features
feature_importances

Unnamed: 0,Feature,Importance
0,Age,0.165152
4,Spa,0.125644
5,VRDeck,0.115458
1,RoomService,0.113112
2,FoodCourt,0.108686
3,ShoppingMall,0.09183
10,CryoSleep_True,0.07356
9,CryoSleep_False,0.064795
6,HomePlanet_Earth,0.021194
7,HomePlanet_Europa,0.018957


**Perform Train Test Split**

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the resulting splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (5284, 24)
X_test shape: (1322, 24)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [19]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Define the base learner
base_learner = DecisionTreeClassifier()

# Initialize the BaggingClassifier
bagging = BaggingClassifier(base_estimator=base_learner, n_estimators=100, bootstrap=True, random_state=42)

# Fit the BaggingClassifier
bagging.fit(X_train, y_train)

# Predict and evaluate
y_pred_bagging = bagging.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

print(f"Bagging Accuracy: {accuracy_bagging}")



Bagging Accuracy: 0.8048411497730711


- Random Forests

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Random Forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model
rf_clf.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy}")

Random Forest Accuracy: 0.8116490166414524


- Gradient Boosting

In [21]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model
gb_clf.fit(X_train, y_train)

# Predict on the test set
y_pred = gb_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Gradient Boosting Accuracy: {accuracy}")

Gradient Boosting Accuracy: 0.8101361573373677


- Adaptive Boosting

In [22]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize the AdaBoost classifier with a DecisionTreeClassifier as the base estimator
base_learner = DecisionTreeClassifier(max_depth=1)
ada_clf = AdaBoostClassifier(base_estimator=base_learner, n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model
ada_clf.fit(X_train, y_train)

# Predict on the test set
y_pred = ada_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"AdaBoost Accuracy: {accuracy}")



AdaBoost Accuracy: 0.7760968229954615


Which model is the best and why?

# Summary of Model Performance

Random Forest: 0.8116

Gradient Boosting: 0.8101

Bagging: 0.8048

AdaBoost: 0.7761

Random Forest is the best model in this context because:
Highest Accuracy: It achieved the highest accuracy among the models, indicating it is the most effective in correctly predicting the target variable.
Robustness: Random Forests tend to be robust to overfitting due to the ensemble of many decision trees.
Flexibility: It works well with a variety of different data types and distributions.