# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [57]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import BaggingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.ensemble import VotingRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostRegressor

In [58]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [59]:
spaceship.shape

(8693, 14)

In [60]:
spaceship.dtypes

Unnamed: 0,0
PassengerId,object
HomePlanet,object
CryoSleep,object
Cabin,object
Destination,object
Age,float64
VIP,object
RoomService,float64
FoodCourt,float64
ShoppingMall,float64


In [61]:
spaceship.isnull().sum()

Unnamed: 0,0
PassengerId,0
HomePlanet,201
CryoSleep,217
Cabin,199
Destination,182
Age,179
VIP,203
RoomService,181
FoodCourt,183
ShoppingMall,208


In [62]:
spaceship_cleaned = spaceship.dropna()

In [63]:
spaceship_cleaned.shape

(6606, 14)

In [64]:
def transform_cabin(cabin):
    if pd.isna(cabin):
        return 'T'  # Handling missing values by assigning them 'T'
    first_letter = cabin[0]
    if first_letter in ['A', 'B', 'C', 'D', 'E', 'F', 'G']:
        return first_letter
    else:
        return 'T'  # Assign 'T' for any other letters (e.g., 'X')

spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].apply(transform_cabin)

spaceship_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].apply(transform_cabin)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [65]:
spaceship_cleaned = spaceship_cleaned.drop(columns=['PassengerId', 'Name'])

spaceship_cleaned.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [66]:
# Drop the non-numeric columns for scaling (e.g., 'Name', 'Transported')
spaceship_dummies = pd.get_dummies(spaceship_cleaned, drop_first=True)  # Create dummy variables

# Separate numeric columns (excluding 'Transported', as it's the target)
numeric_cols = spaceship_dummies.select_dtypes(include=['float64', 'int64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Perform the scaling on the numeric columns
spaceship_dummies[numeric_cols] = scaler.fit_transform(spaceship_dummies[numeric_cols])

# Display the first few rows to check the scaling
spaceship_dummies.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.695413,-0.345756,-0.285355,-0.309494,-0.273759,-0.269534,False,True,False,False,True,False,False,False,False,False,False,False,True,False
1,-0.336769,-0.176748,-0.279993,-0.266112,0.206165,-0.230494,True,False,False,False,False,False,False,False,True,False,False,False,True,False
2,2.002842,-0.279083,1.845163,-0.309494,5.596357,-0.226058,False,True,False,False,False,False,False,False,False,False,False,False,True,True
3,0.28254,-0.345756,0.479034,0.334285,2.636384,-0.098291,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,-0.887266,0.124056,-0.24365,-0.04747,0.220152,-0.267759,True,False,False,False,False,False,False,False,True,False,False,False,True,False


In [67]:
# Calculate correlation matrix
correlation_matrix = spaceship_dummies.corr()

# Check the correlation of each feature with the target (Transported)
target_corr = correlation_matrix['Transported'].sort_values(ascending=False)

# Display the correlation of all features with 'Transported'
target_corr

Unnamed: 0,Transported
Transported,1.0
CryoSleep_True,0.462803
HomePlanet_Europa,0.182004
Cabin_B,0.146288
Cabin_C,0.109988
FoodCourt,0.055025
Cabin_G,0.022711
HomePlanet_Mars,0.012357
ShoppingMall,0.011602
Destination_PSO J318.5-22,0.001281


In [68]:
# Initialize the model for feature selection
model = LogisticRegression(max_iter=1000)

# Use RFE for feature selection
selector = RFE(model, n_features_to_select=5)  # Selecting top 5 features
selector = selector.fit(spaceship_dummies.drop('Transported', axis=1), spaceship_dummies['Transported'])

# Get the features that were selected
selected_features = spaceship_dummies.drop('Transported', axis=1).columns[selector.support_]

# Display the selected features
print(f"Selected features: {selected_features}")

Selected features: Index(['Spa', 'VRDeck', 'HomePlanet_Europa', 'CryoSleep_True', 'Cabin_C'], dtype='object')


**Perform Train Test Split**

In [69]:
# Separate the features (X) and target (y)
X = spaceship_dummies.drop('Transported', axis=1)  # Features (all columns except 'Transported')
y = spaceship_dummies['Transported']  # Target (the 'Transported' column)

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (5284, 19)
X_test shape: (1322, 19)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [70]:
# Initialize the base model (Decision Tree in this case)
base_model = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging Classifier with Decision Tree as the base model
bagging_model = BaggingClassifier(base_model, n_estimators=100, random_state=42)

# Fit the Bagging model on the training data
bagging_model.fit(X_train, y_train)

# Predict on the test data
y_pred_bagging = bagging_model.predict(X_test)

# Evaluate the model
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy of Bagging Classifier: {accuracy_bagging:.4f}")

Accuracy of Bagging Classifier: 0.8033


In [71]:
# Initialize individual classifiers
lr = LogisticRegression(random_state=42)
svc = SVC(random_state=42)
dt = DecisionTreeClassifier(random_state=42)

# Combine the classifiers in a voting ensemble (Pasting is also a form of voting, just without replacement)
voting_model = VotingClassifier(estimators=[('lr', lr), ('svc', svc), ('dt', dt)], voting='hard')

# Fit the Voting model on the training data
voting_model.fit(X_train, y_train)

# Predict on the test data
y_pred_voting = voting_model.predict(X_test)

# Evaluate the model
accuracy_voting = accuracy_score(y_test, y_pred_voting)
print(f"Accuracy of Pasting (Voting) Classifier: {accuracy_voting:.4f}")

Accuracy of Pasting (Voting) Classifier: 0.8139


In [72]:
# Initialize the base model (Linear Regression in this case)
base_model = LinearRegression()

# Initialize the Bagging Regressor with Linear Regression as the base model
bagging_regressor = BaggingRegressor(base_model, n_estimators=100, random_state=42)

# Fit the Bagging Regressor on the training data
bagging_regressor.fit(X_train, y_train)

# Predict on the test data
y_pred_bagging_regressor = bagging_regressor.predict(X_test)

# Evaluate the model
mae_bagging = mean_absolute_error(y_test, y_pred_bagging_regressor)
mse_bagging = mean_squared_error(y_test, y_pred_bagging_regressor)
print(f"MAE of Bagging Regressor: {mae_bagging:.4f}")
print(f"MSE of Bagging Regressor: {mse_bagging:.4f}")

MAE of Bagging Regressor: 0.3349
MSE of Bagging Regressor: 0.1574


In [73]:
# Initialize individual regression models
lr = LinearRegression()
dt = DecisionTreeRegressor(random_state=42)
svr = SVR()

# Combine the regression models in a voting ensemble
voting_regressor = VotingRegressor(estimators=[('lr', lr), ('dt', dt), ('svr', svr)])

# Fit the Voting Regressor on the training data
voting_regressor.fit(X_train, y_train)

# Predict on the test data
y_pred_voting_regressor = voting_regressor.predict(X_test)

# Evaluate the model
mae_voting = mean_absolute_error(y_test, y_pred_voting_regressor)
mse_voting = mean_squared_error(y_test, y_pred_voting_regressor)
print(f"MAE of Pasting (Voting) Regressor: {mae_voting:.4f}")
print(f"MSE of Pasting (Voting) Regressor: {mse_voting:.4f}")

MAE of Pasting (Voting) Regressor: 0.2697
MSE of Pasting (Voting) Regressor: 0.1401


- Random Forests

In [74]:
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model on the training data
rf_classifier.fit(X_train, y_train)

# Predict on the test data
y_pred_rf = rf_classifier.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy of Random Forest Classifier: {accuracy_rf:.4f}")

Accuracy of Random Forest Classifier: 0.8064


In [75]:
# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on the training data
rf_regressor.fit(X_train, y_train)

# Predict on the test data
y_pred_rf_regressor = rf_regressor.predict(X_test)

# Evaluate the model
mae_rf = mean_absolute_error(y_test, y_pred_rf_regressor)
mse_rf = mean_squared_error(y_test, y_pred_rf_regressor)

print(f"MAE of Random Forest Regressor: {mae_rf:.4f}")
print(f"MSE of Random Forest Regressor: {mse_rf:.4f}")

MAE of Random Forest Regressor: 0.2595
MSE of Random Forest Regressor: 0.1418


- Gradient Boosting

In [76]:
# Initialize the Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
gb_classifier.fit(X_train, y_train)

# Predict on the test data
y_pred_gb = gb_classifier.predict(X_test)

# Evaluate the model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Accuracy of Gradient Boosting Classifier: {accuracy_gb:.4f}")

Accuracy of Gradient Boosting Classifier: 0.8071


In [77]:
# Initialize the Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
gb_regressor.fit(X_train, y_train)

# Predict on the test data
y_pred_gb_regressor = gb_regressor.predict(X_test)

# Evaluate the model
mae_gb = mean_absolute_error(y_test, y_pred_gb_regressor)
mse_gb = mean_squared_error(y_test, y_pred_gb_regressor)

print(f"MAE of Gradient Boosting Regressor: {mae_gb:.4f}")
print(f"MSE of Gradient Boosting Regressor: {mse_gb:.4f}")

MAE of Gradient Boosting Regressor: 0.2777
MSE of Gradient Boosting Regressor: 0.1335


- Adaptive Boosting

In [79]:
# Initialize a decision tree as the weak learner (stump)
base_learner = DecisionTreeClassifier(max_depth=1)

# Initialize the AdaBoost classifier with the correct argument name 'estimator'
ada_boost_classifier = AdaBoostClassifier(estimator=base_learner, n_estimators=50, learning_rate=1, random_state=42)

# Fit the model on the training data
ada_boost_classifier.fit(X_train, y_train)

# Predict on the test data
y_pred_ada = ada_boost_classifier.predict(X_test)

# Evaluate the model
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"Accuracy of AdaBoost Classifier: {accuracy_ada:.4f}")



Accuracy of AdaBoost Classifier: 0.7920


In [81]:
# Initialize a decision tree regressor as the base learner
base_learner_regressor = DecisionTreeRegressor(max_depth=1)

# Initialize the AdaBoost regressor with the correct argument 'estimator'
ada_boost_regressor = AdaBoostRegressor(estimator=base_learner_regressor, n_estimators=50, learning_rate=1, random_state=42)

# Fit the model on the training data
ada_boost_regressor.fit(X_train, y_train)

# Predict on the test data
y_pred_ada_regressor = ada_boost_regressor.predict(X_test)

# Evaluate the model
mse_ada = mean_squared_error(y_test, y_pred_ada_regressor)
print(f"Mean Squared Error of AdaBoost Regressor: {mse_ada:.4f}")

Mean Squared Error of AdaBoost Regressor: 0.1952


Which model is the best and why?

# Model Results and Analysis

This document presents and analyzes the performance of different ensemble models applied to classification and regression tasks.

## **Classification Models:**

### 1. **Bagging Classifier**
   - **Accuracy**: **0.8033**
   - **Analysis**:
     - The Bagging Classifier performs decently with an accuracy of 80.33%.
     - Bagging reduces variance by averaging multiple base learners, leading to stable predictions with reduced overfitting.

### 2. **Pasting (Voting) Classifier**
   - **Accuracy**: **0.8139**
   - **Analysis**:
     - The Pasting (Voting) Classifier achieves slightly better accuracy (81.39%) compared to Bagging.
     - By combining multiple models, the Voting Classifier reduces bias and boosts performance by leveraging the strengths of different base classifiers.

### 3. **Random Forest Classifier**
   - **Accuracy**: **0.8064**
   - **Analysis**:
     - Random Forest performs similarly to Bagging (80.64% accuracy).
     - It combines multiple decision trees trained on different subsets of data and features, providing reduced overfitting and increased generalization.

### 4. **Gradient Boosting Classifier**
   - **Accuracy**: **0.8071**
   - **Analysis**:
     - Gradient Boosting achieves an accuracy of 80.71%, slightly higher than Random Forest.
     - It builds trees sequentially, each correcting the errors of the previous tree. Gradient Boosting can overfit if not tuned properly, but it performs well when set up correctly.

### 5. **AdaBoost Classifier**
   - **Accuracy**: **0.7920**
   - **Analysis**:
     - AdaBoost achieves the lowest accuracy at 79.20%.
     - AdaBoost works by improving weak learners, but it can be sensitive to noisy data, which may contribute to its lower performance compared to the other models.

---

## **Regression Models:**

### 1. **Bagging Regressor**
   - **MAE**: **0.3349**
   - **MSE**: **0.1574**
   - **Analysis**:
     - Bagging Regressor performs well with an MAE of 0.3349 and MSE of 0.1574, indicating stable predictions with low variance.
     - The model's predictions are reasonably close to the true values, and it performs well in reducing overfitting.

### 2. **Pasting (Voting) Regressor**
   - **MAE**: **0.2697**
   - **MSE**: **0.1401**
   - **Analysis**:
     - Pasting (Voting) Regressor outperforms Bagging Regressor with lower MAE (0.2697) and MSE (0.1401).
     - This suggests that the ensemble nature of the Voting Regressor provides better predictions, leveraging the strengths of multiple models.

### 3. **Random Forest Regressor**
   - **MAE**: **0.2595**
   - **MSE**: **0.1418**
   - **Analysis**:
     - Random Forest Regressor performs similarly to Pasting, but with a slightly better MAE (0.2595).
     - It provides good predictions with balanced performance in terms of both MAE and MSE.

### 4. **Gradient Boosting Regressor**
   - **MAE**: **0.2777**
   - **MSE**: **0.1335**
   - **Analysis**:
     - Gradient Boosting Regressor has the lowest MSE (0.1335), indicating it provides highly accurate predictions.
     - However, its MAE (0.2777) is slightly higher than that of Random Forest and Pasting, suggesting it has occasional larger errors.

### 5. **AdaBoost Regressor**
   - **MSE**: **0.1952**
   - **Analysis**:
     - AdaBoost Regressor has the highest MSE (0.1952), indicating it performs the worst in regression tasks.
     - While AdaBoost can work well in some situations, its sensitivity to noise and outliers seems to affect its performance here.

---

## **Key Takeaways:**

### **Classification Performance:**
- The **Pasting (Voting) Classifier** performs the best with an accuracy of **81.39%**, closely followed by **Gradient Boosting** at **80.71%**.
- **AdaBoost Classifier** performs the worst, with an accuracy of **79.20%**.

### **Regression Performance:**
- **Gradient Boosting** Regressor achieves the lowest **MSE (0.1335)**, showing the best predictive performance in terms of error.
- **Random Forest** and **Pasting** Regressors are solid performers with competitive MAE and MSE values.
- **AdaBoost** Regressor performs the worst in terms of **MSE (0.1952)**.

---

## **Conclusion:**

### **For Classification:**
- **Pasting (Voting) Classifier** is the best performing model, followed by **Gradient Boosting**.
- **AdaBoost** has the lowest accuracy and may require tuning for better performance.

### **For Regression:**
- **Gradient Boosting** performs the best in terms of **MSE**, but its MAE suggests some larger prediction errors.
- **Random Forest** and **Pasting** are also strong contenders with balanced error metrics.
- **AdaBoost** is less effective in regression tasks, with the highest **MSE**.