# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship_encoded = pd.read_csv(r"C:\Users\Eros\IH-labs\lab-feature-engineering\spaceship_encoded")
print(spaceship_encoded.head())

    Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0  39.0          0.0        0.0           0.0     0.0     0.0        False   
1  24.0        109.0        9.0          25.0   549.0    44.0         True   
2  58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3  33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4  16.0        303.0       70.0         151.0   565.0     2.0         True   

   HomePlanet_Europa  HomePlanet_Mars  CryoSleep_True  Cabin_B  Cabin_C  \
0               True            False           False     True    False   
1              False            False           False    False    False   
2               True            False           False    False    False   
3               True            False           False    False    False   
4              False            False           False    False    False   

   Cabin_D  Cabin_E  Cabin_F  Cabin_G  Cabin_T  Destination_PSO J318.5-22  \
0  

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
#your code here
from sklearn.preprocessing import StandardScaler

# Select the numerical features
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical features
spaceship_encoded[numerical_features] = scaler.fit_transform(spaceship_encoded[numerical_features])

spaceship_encoded.head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.695413,-0.345756,-0.285355,-0.309494,-0.273759,-0.269534,False,True,False,False,True,False,False,False,False,False,False,False,True,False
1,-0.336769,-0.176748,-0.279993,-0.266112,0.206165,-0.230494,True,False,False,False,False,False,False,False,True,False,False,False,True,False
2,2.002842,-0.279083,1.845163,-0.309494,5.596357,-0.226058,False,True,False,False,False,False,False,False,False,False,False,False,True,True
3,0.28254,-0.345756,0.479034,0.334285,2.636384,-0.098291,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,-0.887266,0.124056,-0.24365,-0.04747,0.220152,-0.267759,True,False,False,False,False,False,False,False,True,False,False,False,True,False


**Perform Train Test Split**

In [9]:
#your code here
from sklearn.model_selection import train_test_split

# Separate the features and the target variable
X = spaceship_encoded.drop(columns=['Transported'])
y = spaceship_encoded['Transported']

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((5284, 19), (1322, 19), (5284,), (1322,))

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [None]:
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import accuracy_score

# Assume X_train, X_test, y_train, y_test are already defined and preprocessed

# Initialize the models
bagging_model = BaggingClassifier(estimator=RandomForestClassifier(random_state=42), n_estimators=100, random_state=42, bootstrap=True)
pasting_model = BaggingClassifier(estimator=RandomForestClassifier(random_state=42), n_estimators=100, random_state=42, bootstrap=False)

# Train and evaluate Bagging
bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

# Train and evaluate Pasting
pasting_model.fit(X_train, y_train)
y_pred_pasting = pasting_model.predict(X_test)
accuracy_pasting = accuracy_score(y_test, y_pred_pasting)

# Display the accuracies
accuracy_bagging, accuracy_pasting




- Random Forests

In [15]:
#your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)

accuracy_rf

0.8063540090771558

- Gradient Boosting

In [None]:
#your code here
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

# Make predictions
y_pred_gb = gb_model.predict(X_test)

# Evaluate the model
accuracy_gb = accuracy_score(y_test, y_pred_gb)

accuracy_gb


- Adaptive Boosting

In [None]:
#your code here
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the data
file_path = 'path_to_your_csv_file/spaceship_encoded.csv'
spaceship_encoded = pd.read_csv(file_path)

# Define features and target
X = spaceship_encoded.drop(columns=['Transported'])
y = spaceship_encoded['Transported']

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the AdaBoost model
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the model
ada_model.fit(X_train, y_train)

# Make predictions
y_pred_ada = ada_model.predict(X_test)

# Evaluate the model
accuracy_ada = accuracy_score(y_test, y_pred_ada)

accuracy_ada


Which model is the best and why?

In [None]:
#comment here
Random Forests