# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [29]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, classification_report

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
# check the shape of your data
spaceship.shape

(8693, 14)

In [5]:
# check for data types
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [6]:
# check for missing values 
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [7]:
# drop missing values
spaceship_clean = spaceship.dropna()
spaceship_clean.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

In [8]:
# transform cabin variable
spaceship_clean['Cabin'].unique()
spaceship_trans = spaceship_clean
spaceship_trans['Cabin'] = spaceship_clean['Cabin'].str[0]
spaceship_trans['Cabin'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_trans['Cabin'] = spaceship_clean['Cabin'].str[0]


array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

In [9]:
# drop passengerid and name
spaceship_trans = spaceship_trans.drop(columns=["PassengerId", "Name"])


In [11]:
# transform in dummies non-numerical columns
spaceship_trans = pd.get_dummies(spaceship_trans, columns=['Destination', 'VIP', 'HomePlanet','CryoSleep', 'Cabin'])

**Perform Train Test Split**

In [12]:
features = spaceship_trans.drop(columns=["Transported"], axis = 1)
target = spaceship_trans["Transported"]

In [13]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [16]:
# normalize data
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [18]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [19]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,...,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,0.405063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.050633,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.379747,0.0,0.007916,0.0,0.051276,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05,0.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.329114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [20]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,...,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,0.632911,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.227848,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.189873,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.658228,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.78481,0.0,0.054775,0.0,0.07774,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [33]:
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [34]:
bagging_reg.fit(X_train_norm, y_train)

In [38]:
pred = bagging_reg.predict(X_test_norm)

acc_bagging = accuracy_score( y_test, pred)
print("Accuracy", accuracy_score( y_test, pred))
print(classification_report(y_test, pred))

Accuracy 0.7829046898638427
              precision    recall  f1-score   support

       False       0.78      0.78      0.78       661
        True       0.78      0.79      0.78       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Random Forests

In [39]:
# Initialize a Random Forest
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
# Training the model
forest.fit(X_train_norm, y_train)
# Evaluate the model
pred = forest.predict(X_test_norm)

acc_random_forest = accuracy_score( y_test, pred)
print("Accuracy", accuracy_score( y_test, pred))
print(classification_report(y_test, pred))


Accuracy 0.7904689863842662
              precision    recall  f1-score   support

       False       0.78      0.81      0.79       661
        True       0.80      0.77      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Gradient Boosting

In [40]:
# Initialize the model
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)

# Training the model
gb_reg.fit(X_train_norm, y_train)

# Evaluate the model
pred = gb_reg.predict(X_test_norm)

acc_gradient_boosting = accuracy_score( y_test, pred)
print("Accuracy", accuracy_score( y_test, pred))
print(classification_report(y_test, pred))

Accuracy 0.7844175491679274
              precision    recall  f1-score   support

       False       0.80      0.76      0.78       661
        True       0.77      0.80      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Adaptive Boosting

In [41]:
# Initialize the model
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=100)
# Training the model
ada_reg.fit(X_train_norm, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test_norm)

acc_adap_boosting = accuracy_score( y_test, pred)
print("Accuracy", accuracy_score( y_test, pred))
print(classification_report(y_test, pred))

Accuracy 0.7768532526475038
              precision    recall  f1-score   support

       False       0.79      0.76      0.77       661
        True       0.77      0.79      0.78       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



Which model is the best and why?

In [42]:
print(f'Accuracy of bagging and pasting: {acc_bagging} \n')
print(f'Accuracy of random forest: {acc_random_forest}\n')
print(f'Accuracy of gradient bosting: {acc_gradient_boosting}\n' )
print(f'Accuracy of adaptative bosting: {acc_adap_boosting}\n' )

Accuracy of bagging and pasting: 0.7829046898638427 

Accuracy of random forest: 0.7904689863842662

Accuracy of gradient bosting: 0.7844175491679274

Accuracy of adaptative bosting: 0.7768532526475038



The best model is random forest due to it has the highest accuracy