# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [26]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import  classification_report

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
#your code here
spaceship=spaceship.dropna(how="any")
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   6606 non-null   object 
 1   HomePlanet    6606 non-null   object 
 2   CryoSleep     6606 non-null   object 
 3   Cabin         6606 non-null   object 
 4   Destination   6606 non-null   object 
 5   Age           6606 non-null   float64
 6   VIP           6606 non-null   object 
 7   RoomService   6606 non-null   float64
 8   FoodCourt     6606 non-null   float64
 9   ShoppingMall  6606 non-null   float64
 10  Spa           6606 non-null   float64
 11  VRDeck        6606 non-null   float64
 12  Name          6606 non-null   object 
 13  Transported   6606 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 729.0+ KB


In [10]:
spaceship['Cabin_Category']=spaceship['Cabin'].str[0]

In [12]:
dummies = pd.get_dummies(spaceship[['HomePlanet','Cabin_Category','Destination']], drop_first=True)
dummies.shape

(6606, 11)

In [14]:
spaceship.reset_index(drop=True, inplace=True)
dummies.reset_index(drop=True, inplace=True)
spaceship=pd.concat([spaceship, dummies], axis=1)
spaceship.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported', 'Cabin_Category', 'HomePlanet_Europa',
       'HomePlanet_Mars', 'Cabin_Category_B', 'Cabin_Category_C',
       'Cabin_Category_D', 'Cabin_Category_E', 'Cabin_Category_F',
       'Cabin_Category_G', 'Cabin_Category_T', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'HomePlanet_Europa', 'HomePlanet_Mars',
       'Cabin_Category_B', 'Cabin_Category_C', 'Cabin_Category_D',
       'Cabin_Category_E', 'Cabin_Category_F', 'Cabin_Category_G',
       'Cabin_Category_T', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e'],
      dtype='object')

In [22]:
features = spaceship.drop(columns= ["Transported", "Cabin", "Destination", "HomePlanet", "PassengerId", "Name", "Cabin_Category"])
target=spaceship["Transported"]
features.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,...,HomePlanet_Mars.1,Cabin_Category_B,Cabin_Category_C,Cabin_Category_D,Cabin_Category_E,Cabin_Category_F,Cabin_Category_G,Cabin_Category_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,True,False,...,False,True,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,False,False,...,False,False,False,False,False,True,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,True,False,...,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,True,False,...,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,False,False,...,False,False,False,False,False,True,False,False,False,True


**Perform Train Test Split**

In [28]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)
normalizer = MinMaxScaler()
normalizer.fit(X_train)

In [30]:
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [32]:
#your code here
bagging_class = BaggingClassifier(DecisionTreeClassifier(max_depth=20),n_estimators=100,max_samples=1000)

bagging_class.fit(X_train_norm, y_train)

pred_bag = bagging_class.predict(X_test_norm)

print(classification_report(y_test, pred_bag))

              precision    recall  f1-score   support

       False       0.80      0.79      0.79       661
        True       0.79      0.80      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Random Forests

In [34]:
#your code here

forest_class = RandomForestClassifier(n_estimators=100,max_depth=20)
forest_class.fit(X_train_norm, y_train)
pred_forest = forest_class.predict(X_test_norm)


print(classification_report(y_test, pred_forest))

              precision    recall  f1-score   support

       False       0.78      0.80      0.79       661
        True       0.79      0.78      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Gradient Boosting

In [36]:
#your code here
gb_class = GradientBoostingClassifier(max_depth=20,n_estimators=100)
gb_class.fit(X_train_norm, y_train)

pred_gb = gb_class.predict(X_test_norm)

print(classification_report(y_test, pred_gb))

              precision    recall  f1-score   support

       False       0.80      0.78      0.79       661
        True       0.79      0.81      0.80       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Adaptive Boosting

In [37]:
#your code here
ada_class = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=100)

ada_class.fit(X_train_norm,y_train)

pred_ada = ada_class.predict(X_test_norm).astype(int)



print(classification_report(y_test, pred_ada))



              precision    recall  f1-score   support

       False       0.78      0.75      0.76       661
        True       0.76      0.78      0.77       661

    accuracy                           0.76      1322
   macro avg       0.77      0.76      0.76      1322
weighted avg       0.77      0.76      0.76      1322



Which model is the best and why?

In [None]:
#comment here
#I will choose random forest classifier since it has slighlty better precision and recall scores.