# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [35]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [37]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")


In [39]:
spaceship

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [42]:
spaceship = spaceship.dropna()

spaceship['Cabin'] = spaceship['Cabin'].str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship['Cabin'] = spaceship['Cabin'].str[0]


**Perform Train Test Split**

In [46]:
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)
spaceship = pd.get_dummies(spaceship)

features = spaceship.drop(columns= 'Transported')
target = spaceship["Transported"]



X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

#Normalize Data after Train Split
normalizer = MinMaxScaler() #define normalizer

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train) # Normalize 80% training dats
X_test_norm = normalizer.transform(X_test) # Normalize 20% Testing Data

#Apply to test and training data
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [49]:
#Logistic Regression


In [51]:
# Initialize the model
log_reg = LogisticRegression()

# Fit the model to the training data
log_reg.fit(X_train_norm, y_train)

# Predict and evaluate
pred = log_reg.predict(X_test_norm)
accuracy = accuracy_score(y_test, pred)


accuracy

0.7700453857791225

In [53]:
pred


array([ True,  True,  True, ...,  True,  True,  True])

In [55]:
#Decision Tree

tree = DecisionTreeClassifier(max_depth=10)

tree.fit(X_train, y_train)

In [57]:
pred = tree.predict(X_test)

pred

array([ True,  True,  True, ...,  True,  True,  True])

In [59]:
accuracy = accuracy_score(y_test, pred)
accuracy

0.7723146747352496

In [61]:
tree_importance = {feature : importance for feature, importance in zip(X_train_norm.columns, tree.feature_importances_)}

In [72]:
from sklearn.tree import export_text

tree_viz = export_text(tree, feature_names=list(X_train_norm.columns))
#print(tree_viz)


In [65]:
from sklearn.tree import plot_tree

In [67]:
plt.figure(figsize=(12,8))  # Set plot size
plot_tree(tree, feature_names=X_train.columns, filled=True)

plt.show()

NameError: name 'plt' is not defined

- Bagging and Pasting

In [148]:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Initialize the Bagging Classifier with DecisionTreeClassifier as the base estimator
bagging_cl = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples=1000)

# Fit the model on the training data
bagging_cl.fit(X_train_norm, y_train)

# Make predictions on the test data
pred = bagging_cl.predict(X_test_norm)

# Calculate and print evaluation metrics
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))

Accuracy: 0.7813918305597579
F1 Score: 0.7813917054751921
Confusion Matrix:
 [[516 145]
 [144 517]]


- Random Forests

In [146]:
# Initialize the Random Forest Classifier
forest = RandomForestClassifier(n_estimators=100, max_depth=20)

# Fit the model on the training data
forest.fit(X_train_norm, y_train)

# Make predictions on the test data
pred = forest.predict(X_test_norm)

# Calculate and print evaluation metrics for classification
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))

Accuracy: 0.7844175491679274
F1 Score: 0.7844075570939165
Confusion Matrix:
 [[523 138]
 [147 514]]


- Gradient Boosting

In [142]:
# Initialize the Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(max_depth=20, n_estimators=100)

# Fit the model on the training data
gb_clf.fit(X_train_norm, y_train)

# Make predictions on the test data
pred = gb_clf.predict(X_test_norm)

# Calculate and print evaluation metrics for classification
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))

Accuracy: 0.783661119515885
F1 Score: 0.7836011904761905
Confusion Matrix:
 [[507 154]
 [132 529]]


- Adaptive Boosting

In [144]:
# Initialize the AdaBoost Classifier with DecisionTreeClassifier as the base estimator
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                              n_estimators=100)

# Fit the model on the training data
ada_clf.fit(X_train_norm, y_train)

# Make predictions on the test data
pred = ada_clf.predict(X_test_norm)

# Calculate and print evaluation metrics for classification
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))



Accuracy: 0.7700453857791225
F1 Score: 0.7698346700387839
Confusion Matrix:
 [[489 172]
 [132 529]]


Which model is the best and why?

Random Forest has had the highest accuracy and F1 Score. It predicted the highest Actual Positives and has the highest Actual Negatives, while it falsely predicted the lowest number of positives and negatives.