# LAB | Ensemble Methods

## Load the data

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

**Libraries**

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score


**Load Dataset**

In [34]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

## Data Explorations

In [35]:
spaceship

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


**Check the shape of your data**

In [3]:
spaceship.shape

(8693, 14)

**Check for data types**

In [32]:
spaceship.dtypes

HomePlanet                    object
CryoSleep                     object
Cabin                         object
Destination                   object
Age                          float64
VIP                           object
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
HomePlanet_Earth               int32
HomePlanet_Europa              int32
HomePlanet_Mars                int32
CryoSleep_False                int32
CryoSleep_True                 int32
Cabin_A                        int32
Cabin_B                        int32
Cabin_C                        int32
Cabin_D                        int32
Cabin_E                        int32
Cabin_F                        int32
Cabin_G                        int32
Cabin_T                        int32
Destination_55 Cancri e        int32
Destination_PSO J318.5-22      int32
D

In [38]:
spaceship.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')

In [46]:
spaceship['HomePlanet'].unique()

array(['Europa', 'Earth', 'Mars', nan], dtype=object)

In [48]:
spaceship['CryoSleep'].unique()

array([False, True, nan], dtype=object)

In [51]:
spaceship['Destination'].unique()

array(['TRAPPIST-1e', 'PSO J318.5-22', '55 Cancri e', nan], dtype=object)

In [52]:
spaceship['Age'].nunique()

80

In [54]:
spaceship['VIP'].unique()

array([False, True, nan], dtype=object)

In [62]:
spaceship['Transported'].unique()

array([False,  True])

## Data Wrangling

Now perform the same as before:
- Feature Scaling
- Feature Selection

### Feature Selection

In [63]:
spaceship.dropna(inplace=True)

In [64]:
spaceship['Cabin'] = spaceship['Cabin'].str[0]

In [65]:
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)

In [66]:
# List of non--numerical column 
spaceship_non_numerical_columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination']

In [67]:
def create_dummies(df, column_names):
    """
    Create dummy variables for non-numerical columns in a the dataframe.
    
    Args:
    - df (pd.DataFrame): The input DataFrame.
    - column_names (list of str): List of column names to create dummy variables for.
    
    Returns:
    - pd.DataFrame: A DataFrame with dummy variables added.
    """
    for column in column_names:
        dummies = pd.get_dummies(df[column], prefix=column, dtype=int)
        df = pd.concat([df, dummies], axis=1)
    return df

columns_to_dummy = spaceship_non_numerical_columns 
spaceship = create_dummies(spaceship, columns_to_dummy)

### Feature Scaling

**Perform Train Test Split**

In [68]:
spaceship_features = spaceship.select_dtypes(include=np.number)

In [69]:
target = spaceship['Transported']

In [70]:
# random is called seeds: 42 
X_train, X_test, y_train, y_test = train_test_split(spaceship_features, target, test_size=0.20, random_state=42)

**Scaling with MinMax Scaler**

In [71]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [72]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [73]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.316456,0.0,0.056116,0.0,0.02865,0.030094,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.455696,0.0,0.088015,0.135232,0.124911,4.9e-05,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.43038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,0.468354,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.278481,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [74]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.367089,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.164557,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.632911,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.075949,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.468354,0.024458,0.000101,0.049049,4.5e-05,0.003344,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Modeling

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

### Bagging and Pasting

In [75]:
bagging_class = BaggingClassifier(DecisionTreeClassifier(max_depth=20), # 20 layers of the decision tree 
                               n_estimators=100, # 100 Trees working
                               max_samples = 1000) # 1,000 Observations (Data) to train the model

In [76]:
bagging_class.fit(X_train_norm, y_train)

In [77]:
# Evaluate model's performance
pred = bagging_class.predict(X_test_norm)

In [78]:
bagging_class.score(X_test, y_test)

0.7473524962178517

### Random Forests

In [79]:
# - Initialize a Random Forest
forest = RandomForestClassifier(n_estimators=100, max_depth=20)

In [80]:
forest.fit(X_train_norm, y_train)

In [81]:
# Evaluate model's performance
pred = forest.predict(X_test_norm)

In [82]:
forest.score(X_test, y_test)

0.7382753403933434

### Adaptive Boosting

In [83]:
ada_class = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=100)

In [84]:
ada_class.fit(X_train_norm, y_train)



In [85]:
# Evaluate the model
pred = ada_class.predict(X_test_norm)

In [86]:
ada_class.score(X_test, y_test)

0.7004538577912254

### Gradient Boosting

In [87]:
# Initialize a AdaBoost model
gb_class = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)

In [88]:
gb_class.fit(X_train_norm, y_train)

In [89]:
# Evaluate the model
pred = gb_class.predict(X_test_norm)

In [90]:
gb_class.score(X_test, y_test)

0.7254160363086233

## Insights

### Which model is the best and why?

To determine the best model among the ones aboved tested let's compare their results based on accuracy test:
1. *Bagging and Pasting* performs the best with a score of **0.7474**
2. *Random Forests* follows at **0.7383**
3. *Gradient Boosting* ranks later at **0.7254**
4. *Adaptive Boosting* enlisted last at **0.7005**

**Analysis:**

Based on test accuracy for this lab, *Bagging and Pasting* achieved the highest performance of 74.74%. This indicates that it correctly classified the highest proportion of instances in the test set compared to the other models.

### Comparison with KNN

In [95]:
knn = KNeighborsClassifier()

In [96]:
knn.fit(X_train_norm, y_train)

In [97]:
pred = knn.predict(X_test_norm)
print("Accuracy: ", knn.score(X_test_norm, y_test))

Accuracy:  0.7776096822995462


**Conclusions**

However, the K-Nearest Neighbors (KNN) model, implemented in the previous lab, achieved an accuracy of approximately 0.7776 on the test set. This indicates that 77.76% of the model's predictions were correct, making KNN the best-performing model so far.