# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [3]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


**Perform Train Test Split**

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Separate features and target
X = spaceship.drop(columns='Transported')
y = spaceship['Transported'].astype(int)

# One-hot encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Feature Selection
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

selected_features = X_encoded.columns[selector.get_support()].tolist()
selected_features


  117  119  127  128  137  143  145  148  152  159  166  169  170  175
  185  208  219  222  223  231  235  236  241  250  273  285  289  291
  303  307  322  325  326  342  359  363  374  379  386  394  402  411
  417  420  445  458  471  475  496  504  513  528  533  538  540  543
  545  548  552  558  564  569  572  576  578  582  596  606  609  644
  647  650  654  661  674  675  676  682  688  702  734  736  743  752
  753  755  758  760  763  785  799  819  825  826  833  837  840  847
  852  857  858  868  869  884  891  896  897  900  907  929  937  939
  941  942  951  952  953  959  966  988  998 1010 1012 1031 1036 1038
 1043 1050 1051 1054 1056 1063 1074 1075 1076 1085 1103 1104 1106 1116
 1122 1129 1131 1135 1148 1150 1158 1169 1179 1185 1186 1199 1202 1208
 1217 1222 1236 1238 1241 1242 1248 1251 1256 1275 1283 1286 1289 1297
 1298 1303 1312 1317 1320 1322 1323 1324 1326 1336 1341 1349 1351 1356
 1358 1364 1366 1378 1380 1403 1409 1412 1419 1420 1431 1434 1436 1441
 1442 

['Age',
 'RoomService',
 'FoodCourt',
 'Spa',
 'VRDeck',
 'HomePlanet_Europa',
 'CryoSleep_True',
 'Cabin_G/943/S',
 'Destination_TRAPPIST-1e',
 'VIP_True']

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [15]:
# Define the base estimator
base_model = DecisionTreeClassifier(random_state=42)

# Bagging 
bagging = BaggingClassifier(
    base_estimator=base_model,
    n_estimators=50,
    max_samples=0.8,
    bootstrap=True,    
    random_state=42
)
bagging.fit(X_train_selected, y_train)
y_pred_bag = bagging.predict(X_test_selected)
bagging_acc = accuracy_score(y_test, y_pred_bag)

# Step 3: Pasting 
pasting = BaggingClassifier(
    base_estimator=base_model,
    n_estimators=50,
    max_samples=0.8,
    bootstrap=False, 
    random_state=42
)
pasting.fit(X_train_selected, y_train)
y_pred_paste = pasting.predict(X_test_selected)
pasting_acc = accuracy_score(y_test, y_pred_paste)

bagging_acc
pasting_acc


NameError: name 'DecisionTreeClassifier' is not defined

- Random Forests

In [19]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    random_state=42
)
rf_model.fit(X_train_selected, y_train)

y_pred_rf = rf_model.predict(X_test_selected)

from sklearn.metrics import accuracy_score, classification_report
rf_acc = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", round(rf_acc, 4))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.7805

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.77      0.78       699
           1       0.76      0.79      0.78       654

    accuracy                           0.78      1353
   macro avg       0.78      0.78      0.78      1353
weighted avg       0.78      0.78      0.78      1353



- Gradient Boosting

In [20]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_model.fit(X_train_selected, y_train)

y_pred_gb = gb_model.predict(X_test_selected)

gb_acc = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", round(gb_acc, 4))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_gb))


Gradient Boosting Accuracy: 0.7938

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.74      0.79       699
           1       0.76      0.85      0.80       654

    accuracy                           0.79      1353
   macro avg       0.80      0.80      0.79      1353
weighted avg       0.80      0.79      0.79      1353



- Adaptive Boosting

In [22]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report

ada_model = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)

ada_model.fit(X_train_selected, y_train)

y_pred_ada = ada_model.predict(X_test_selected)

ada_acc = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Accuracy:", round(ada_acc, 4))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_ada))

AdaBoost Accuracy: 0.7886

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.73      0.78       699
           1       0.75      0.85      0.80       654

    accuracy                           0.79      1353
   macro avg       0.79      0.79      0.79      1353
weighted avg       0.79      0.79      0.79      1353





Which model is the best and why?

In [None]:
#comment here