# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
# Cek info awal
spaceship.info()

# Cek missing values
spaceship.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [8]:
from sklearn.preprocessing import StandardScaler

# Pilih kolom numerik
num_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Inisialisasi scaler
scaler = StandardScaler()

# Simpan hasil scaling sebagai DataFrame baru
scaled_features = scaler.fit_transform(spaceship[num_cols])

import pandas as pd
scaled_df = pd.DataFrame(scaled_features, columns=num_cols)
scaled_df.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.702095,-0.337025,-0.284274,-0.287317,-0.273736,-0.266098
1,-0.333233,-0.173528,-0.278689,-0.245971,0.209267,-0.227692
2,2.01351,-0.272527,1.934922,-0.287317,5.634034,-0.223327
3,0.287964,-0.337025,0.511931,0.32625,2.655075,-0.097634
4,-0.885407,0.117466,-0.240833,-0.03759,0.223344,-0.264352


In [10]:
# Gabungkan scaled features ke data utama
spaceship_scaled = spaceship.copy()
spaceship_scaled[num_cols] = scaled_df

# Drop kolom yang tidak relevan
spaceship_scaled = spaceship_scaled.drop(['PassengerId', 'Name', 'Cabin'], axis=1)

# Lihat hasil
spaceship_scaled.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,TRAPPIST-1e,0.702095,False,-0.337025,-0.284274,-0.287317,-0.273736,-0.266098,False
1,Earth,False,TRAPPIST-1e,-0.333233,False,-0.173528,-0.278689,-0.245971,0.209267,-0.227692,True
2,Europa,False,TRAPPIST-1e,2.01351,True,-0.272527,1.934922,-0.287317,5.634034,-0.223327,False
3,Europa,False,TRAPPIST-1e,0.287964,False,-0.337025,0.511931,0.32625,2.655075,-0.097634,False
4,Earth,False,TRAPPIST-1e,-0.885407,False,0.117466,-0.240833,-0.03759,0.223344,-0.264352,True


**Perform Train Test Split**

In [26]:
# One-hot encoding semua kolom kategori
spaceship_encoded = pd.get_dummies(spaceship_scaled)

# Pisahkan X dan y baru
X = spaceship_encoded.drop('Transported', axis=1)
y = spaceship_encoded['Transported']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Cek shape-nya
X_train.shape, X_test.shape

((6954, 16), (1739, 16))

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [30]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Model dasar
base_model = DecisionTreeClassifier(random_state=42)

# Bagging
bagging_model = BaggingClassifier(estimator=base_model, n_estimators=100, bootstrap=True, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_preds)
print("Bagging Accuracy:", bagging_acc)

# Pasting
pasting_model = BaggingClassifier(estimator=base_model, n_estimators=100, bootstrap=False, random_state=42)
pasting_model.fit(X_train, y_train)
pasting_preds = pasting_model.predict(X_test)
pasting_acc = accuracy_score(y_test, pasting_preds)
print("Pasting Accuracy:", pasting_acc)

Bagging Accuracy: 0.772857964347326
Pasting Accuracy: 0.7222541690626797


- Random Forests

In [32]:
from sklearn.ensemble import RandomForestClassifier

# Inisialisasi dan latih model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Prediksi dan evaluasi
rf_preds = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, rf_preds)

print("Random Forest Accuracy:", rf_acc)


Random Forest Accuracy: 0.7757331799884991


- Gradient Boosting

In [36]:
from sklearn.impute import SimpleImputer

# Buat salinan data dulu kalau perlu
from copy import deepcopy
X_filled = deepcopy(X)

# Inisialisasi imputernya (ganti NaN dengan mean untuk numeric)
imputer = SimpleImputer(strategy='mean')

# Terapkan ke semua fitur
X_filled = pd.DataFrame(imputer.fit_transform(X_filled), columns=X.columns)

# Split ulang data
X_train, X_test, y_train, y_test = train_test_split(X_filled, y, test_size=0.2, random_state=42)

from sklearn.ensemble import GradientBoostingClassifier

# Inisialisasi dan latih model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)

# Prediksi dan evaluasi
gb_preds = gb_model.predict(X_test)
gb_acc = accuracy_score(y_test, gb_preds)

print("Gradient Boosting Accuracy:", gb_acc)

Gradient Boosting Accuracy: 0.7832087406555491


- Adaptive Boosting

In [38]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Model dasar (weak learner)
base_estimator = DecisionTreeClassifier(max_depth=1, random_state=42)

# Inisialisasi AdaBoost
ada_model = AdaBoostClassifier(estimator=base_estimator, n_estimators=100, learning_rate=0.5, random_state=42)
ada_model.fit(X_train, y_train)

# Prediksi dan evaluasi
ada_preds = ada_model.predict(X_test)
ada_acc = accuracy_score(y_test, ada_preds)

print("AdaBoost Accuracy:", ada_acc)



AdaBoost Accuracy: 0.7763082231167338


Which model is the best and why?

In [None]:
In this lab, we evaluated five ensemble models—Pasting, Bagging, Random Forest, AdaBoost, and Gradient Boosting—to predict the Transported variable from the Spaceship Titanic dataset. Among them, Gradient Boosting achieved the highest accuracy at 78.32%, thanks to its iterative approach that focuses on correcting previous errors. Although Random Forest (77.57%) and AdaBoost (77.63%) also delivered strong results, Gradient Boosting stood out as the most effective. For future classification tasks, we recommend using Gradient Boosting for the best accuracy, Random Forest for faster and more stable predictions, and AdaBoost as a lightweight alternative with solid performance.