# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

Now perform the same as before:
- Feature Scaling
- Feature Selection


**Perform Train Test Split**

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

- Random Forests

- Gradient Boosting

- Adaptive Boosting

Which model is the best and why?

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from xgboost import XGBClassifier
import time

# Load dataset
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

spaceship['CryoSleep'] = spaceship['CryoSleep'].astype(bool)
spaceship['VIP'] = spaceship['VIP'].astype(bool)
spaceship = spaceship.dropna()
spaceship['Cabin'] = spaceship['Cabin'].astype(str)  # Ensure 'Cabin' is of type string
spaceship['cabin_class'] = spaceship['Cabin'].str[0]  # taking the first letter of the cabin
spaceship = spaceship.drop(columns=['PassengerId', 'Name', 'Cabin'])  # dropping the columns
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'cabin_class', 'Destination'])  # adding dummies

# Convert target to integer
spaceship['Transported'] = spaceship['Transported'].astype(int)

# Feature Selection
features = spaceship[['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
                      'Spa', 'VRDeck', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
                      'cabin_class_A', 'cabin_class_B', 'cabin_class_C', 'cabin_class_D',
                      'cabin_class_E', 'cabin_class_F', 'cabin_class_G', 'cabin_class_T',
                      'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e']]
target = spaceship['Transported']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=4)

# Scale the data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert to DataFrame for consistency
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Define a function to evaluate and store model results along with computation time
def evaluate_model(model, X_train_scaled, y_train, X_test_scaled, y_test):
    start_time = time.time()  # Record start time
    model.fit(X_train_scaled, y_train)
    pred = model.predict(X_test_scaled)
    end_time = time.time()  # Record end time
    
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    score = model.score(X_test_scaled, y_test)
    computation_time = end_time - start_time  # Calculate computation time
    
    return accuracy, precision, recall, f1, score, computation_time

# Initialize results dictionary
results = {}

# KNN
knn = KNeighborsClassifier(n_neighbors=10)
results['KNN'] = evaluate_model(knn, X_train_scaled, y_train, X_test_scaled, y_test)

# Bagging Classifier
bagging_clf = BaggingClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=1000, max_samples=1000, random_state=1)
results['Bagging'] = evaluate_model(bagging_clf, X_train_scaled, y_train, X_test_scaled, y_test)

# Random Forest Classifier
forest_clf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=1)
results['Random Forest'] = evaluate_model(forest_clf, X_train_scaled, y_train, X_test_scaled, y_test)

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(max_depth=20, n_estimators=100, random_state=1)
results['Gradient Boosting'] = evaluate_model(gb_clf, X_train_scaled, y_train, X_test_scaled, y_test)

# AdaBoost Classifier with SAMME algorithm
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=100, algorithm='SAMME', random_state=1)
results['AdaBoost'] = evaluate_model(ada_clf, X_train_scaled, y_train, X_test_scaled, y_test)

# XGBoost Classifier
xgb_clf = XGBClassifier()
results['XGBoost'] = evaluate_model(xgb_clf, X_train_scaled, y_train, X_test_scaled, y_test)

# Display results in a table
results_df = pd.DataFrame(results, index=['Accuracy', 'Precision', 'Recall', 'F1 Score', 'Model Score', 'Time (s)']).T
print(results_df)


                   Accuracy  Precision    Recall  F1 Score  Model Score  \
KNN                0.751620   0.751261  0.694099  0.721550     0.751620   
Bagging            0.791217   0.770642  0.782609  0.776579     0.791217   
Random Forest      0.776818   0.759317  0.759317  0.759317     0.776818   
Gradient Boosting  0.753780   0.720760  0.765528  0.742470     0.753780   
AdaBoost           0.750180   0.713056  0.771739  0.741238     0.750180   
XGBoost            0.786177   0.754772  0.798137  0.775849     0.786177   

                   Time (s)  
KNN                0.028537  
Bagging            2.418979  
Random Forest      0.248019  
Gradient Boosting  3.242196  
AdaBoost           1.342693  
XGBoost            0.141908  
