# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [202]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler, StandardScaler


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


In [88]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [None]:
spaceship.drop(columns = ["PassengerId", "Name"], inplace = True)

In [118]:
spaceship = spaceship.dropna(how="any")

spaceship["Cabin"] = spaceship["Cabin"].apply(lambda x: x.split("/")).apply(lambda x: x[0])
spaceship["Cabin"].value_counts()

cat_df = spaceship.select_dtypes("object")
num_df = spaceship.select_dtypes("number")
bool_df = spaceship.select_dtypes("bool")

spaceship["Transported"] = spaceship["Transported"].astype(float)

df_cat_num = pd.get_dummies(cat_df, dtype = int)
spaceship_num = pd.concat([df_cat_num, num_df, bool_df.drop(columns = "Transported")], axis=1)



**Perform Train Test Split**

In [120]:
features = spaceship_num
target = spaceship["Transported"]

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 0)

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [122]:
log_reg = LogisticRegression()

In [124]:
log_reg.fit(X_train_scaled, y_train)

In [238]:
pred = log_reg.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, pred))
print(classification_report(y_test, pred))

Accuracy: 0.7602905569007264
              precision    recall  f1-score   support

         0.0       0.74      0.81      0.77       831
         1.0       0.79      0.71      0.75       821

    accuracy                           0.76      1652
   macro avg       0.76      0.76      0.76      1652
weighted avg       0.76      0.76      0.76      1652



### Decision Tree

In [214]:
tree = DecisionTreeClassifier(max_depth=50)

In [216]:
tree.fit(X_train_scaled, y_train)

In [218]:
pred = tree.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("Accuracy:", accuracy_score(y_test, pred))

print(classification_report(y_test, pred))

MAE 0.24334140435835352
RMSE 0.4932964670037213
Accuracy: 0.7566585956416465
              precision    recall  f1-score   support

         0.0       0.78      0.72      0.75       831
         1.0       0.74      0.79      0.76       821

    accuracy                           0.76      1652
   macro avg       0.76      0.76      0.76      1652
weighted avg       0.76      0.76      0.76      1652





In [220]:
tree = DecisionTreeClassifier(max_depth=5)

In [222]:
tree.fit(X_train_scaled, y_train)

In [224]:
pred = tree.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("Accuracy:", accuracy_score(y_test, pred))

print(classification_report(y_test, pred))


MAE 0.23002421307506055
RMSE 0.47960839554271834
Accuracy: 0.7699757869249395
              precision    recall  f1-score   support

         0.0       0.82      0.69      0.75       831
         1.0       0.73      0.85      0.79       821

    accuracy                           0.77      1652
   macro avg       0.78      0.77      0.77      1652
weighted avg       0.78      0.77      0.77      1652





In [226]:
tree = DecisionTreeClassifier(max_depth=15)

In [228]:
tree.fit(X_train_scaled, y_train)

In [230]:
pred = tree.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("Accuracy:", accuracy_score(y_test, pred))

print(classification_report(y_test, pred))

MAE 0.23365617433414043
RMSE 0.48337994821272884
Accuracy: 0.7663438256658596
              precision    recall  f1-score   support

         0.0       0.78      0.74      0.76       831
         1.0       0.75      0.79      0.77       821

    accuracy                           0.77      1652
   macro avg       0.77      0.77      0.77      1652
weighted avg       0.77      0.77      0.77      1652





- Bagging and Pasting

- Random Forests

In [None]:
#your code here

- Gradient Boosting

In [None]:
#your code here

- Adaptive Boosting

In [None]:
#your code here

Which model is the best and why?

In [None]:
#comment here