# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier




In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
# Checking data
print(spaceship.shape)
print(spaceship.dtypes)
print(spaceship.isnull())
df= spaceship.dropna()

(8693, 14)
PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object
      PassengerId  HomePlanet  CryoSleep  Cabin  Destination    Age    VIP  \
0           False       False      False  False        False  False  False   
1           False       False      False  False        False  False  False   
2           False       False      False  False        False  False  False   
3           False       False      False  False        False  False  False   
4           False       False      False  False        False  False  False   
...           ...         ...        ...    ...          ...    ...    ...   
8688        False       False      False  False        False  False  False   
8689        Fal

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
df.drop(['PassengerId', 'Name'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['PassengerId', 'Name'], axis=1, inplace=True)


In [5]:
# Creating dummy variables
df = pd.get_dummies(df, drop_first=True)

**Perform Train Test Split**

In [6]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop('Transported', axis=1)
y = df['Transported']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
# Initialize the KNN model
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [8]:
# Create a BaggingClassifier with your chosen base estimator (e.g., KNN)
bagging_model = BaggingClassifier(KNeighborsClassifier(n_neighbors=5), n_estimators=100, random_state=42)

# Train the Bagging model
bagging_model.fit(X_train, y_train)

- Random Forests

In [9]:
# Create a Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

In [10]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor model
rf_model_reg = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model_reg.fit(X_train, y_train)


- Gradient Boosting

In [11]:
from sklearn.ensemble import GradientBoostingClassifier  # For classification
from sklearn.ensemble import GradientBoostingRegressor  # For regression

# Create a Gradient Boosting model (adjust hyperparameters as needed)
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the Gradient Boosting model
gb_model.fit(X_train, y_train)

- Adaptive Boosting

In [12]:
from sklearn.ensemble import AdaBoostClassifier  # For classification
from sklearn.ensemble import AdaBoostRegressor  # For regression

# Create an AdaBoost model (adjust hyperparameters as needed)
ada_model = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the AdaBoost model
ada_model.fit(X_train, y_train)



Which model is the best and why?

In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score  # For classification
from sklearn.metrics import mean_squared_error, mean_absolute_error  # For regression

# Example for classification:
y_pred_rf = rf_model.predict(X_test)
y_pred_ada = ada_model.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
# ... (calculate other metrics)

# Compare metrics for all models

# Example for regression:
y_pred_rf_reg = rf_model_reg.predict(X_test)
y_pred_ada_reg = ada_model_reg.predict(X_test)

mse_rf_reg = mean_squared_error(y_test, y_pred_rf_reg)
mae_rf_reg = mean_absolute_error(y_test, y_pred_rf_reg)
# ... (calculate other metrics)

# Compare metrics for all models

NameError: name 'ada_model_reg' is not defined