<center>
 <img src = "JHU.png"  width="200" alt="Johns Hopkins University logo"/>
</center>

## Heart Failure Disease Classification Using Ensemble Machine Learning Techniques

Estimated time needed: **60** minutes

### Overview:

In this hands-on lab, we will generate ensembles of classifiers using the Bagging method and compare their performance to regular classifiers. We will perform classification on heart failure disease using the provided **heart_dataset**. The steps involve creating basic classifiers, forming ensembles, and evaluating their performance using cross-validation. The classifiers in focus are GaussianNB, LinearSVC, MLPClassifier, DecisionTreeClassifier, and RandomForestClassifier.

### Learning objectives:

- Load and preprocess data, including feature scaling and one-hot encoding.
- Train and evaluate multiple classifiers using 10-fold cross-validation.
- Build and evaluate ensemble models using the Bagging method.
- Implement custom ensemble functions for subsampling and majority voting.
- Compare and visualize performance between regular classifiers and ensemble models.


### Dataset Information:

   **heart_dataset** contains various features related to heart health and patient outcomes.


### Step 1: Load and Preprocess the Dataset
- Use the `heart_dataset.csv` dataset from the module content.
- Load the dataset into your development environment.
- Examine the features and note that they include a mixture of numerical and nominal (categorical) data.
- **Hint:** Use appropriate pre-processing techniques such as converting nominal data to numerical data using `OneHotEncoder`.
- Ensure your preprocessing pipeline is correct. It might be helpful to start by running a baseline classifier.


In [None]:
#Load the heart failure dataset and preprocess it.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [None]:
#Load the dataset
df = pd.read_csv('heart_dataset.csv')

In [None]:
# Write your code here!
# Encode categorical features
categorical_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# Split into features and target
X = df_encoded.drop('HeartDisease', axis=1)

# Scale the features


# Split the dataset into training and testing sets




<details>
<summary>Click here to view/hide solution.</summary>

```   
# Encode categorical features
categorical_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
```


```
# Split into features and target
X = df_encoded.drop('HeartDisease', axis=1)
y = df_encoded['HeartDisease']
```

```
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

```
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
```
</details>

### Step 2: 10-Fold Cross-Validation of Basic Classifiers
-  Perform 10-fold cross-validation for the following classifiers using default parameters:
  - **GaussianNB**
  - **Linear SVC** (`SVC(kernel='linear', probability=True)`)
  - **MLPClassifier**
  - **DecisionTreeClassifier**
- **Hint:** Use `cross_val_score` to perform cross-validation.
- **Bonus:** Report the performance of `RandomForestClassifier` (no need for CV since it’s already an ensemble).


> **Note**: Running 10-fold cross-validation for each classifier, especially with models like RandomForestClassifier, may take a considerable amount of time. Please be patient while the computation completes, as it involves fitting and evaluating multiple models.

In [None]:
# Load necessary libraries
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Task: Perform 10-fold cross-validation on the specified classifiers and report their performance.
# Initialize classifiers
# Write your code here! 

classifiers = {

    
}
# 10-fold CV for each classifier




<details>
<summary>Click here to view/hide solution.</summary>

```
# Task: Perform 10-fold cross-validation on the specified classifiers and report their performance.

# Initialize classifiers
classifiers = {
    'GaussianNB': GaussianNB(),
    'Linear SVC': SVC(kernel='linear', probability=True),
    'MLPClassifier': MLPClassifier(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'RandomForestClassifier': RandomForestClassifier()
}

# 10-fold CV for each classifier
cv_results = {}
for name, clf in classifiers.items():
    if name != 'RandomForestClassifier':
        scores = cross_val_score(clf, X_train, y_train, cv=10)
        cv_results[name] = scores.mean()
    else:
        clf.fit(X_train, y_train)
        cv_results[name] = clf.score(X_test, y_test)

print(cv_results)
    
```
</details>

### Step 3: Create Weak Classifiers and Ensembles
-  Create an ensemble of 100 classifiers for each of the basic classifiers from Step 2.
- To generate weak classifiers within your ensembles:
  - For **MLPClassifier**, set hidden sizes to `(3, 3)`, max iterations to `30`, and tolerance to `1e-1`.
  - For **DecisionTreeClassifier**, set max depth to `5` and max features to `5`.
- **Task:** Report the performance of the first classifier in each ensemble.
  - **Hint:** Just run the first classifier in each ensemble on your test data and compare results.


In [None]:
# Load necessary libraries
import random
from sklearn.ensemble import BaggingClassifier

In [None]:
# Write your code here! 
# Generate an ensemble of 100 classifiers for the specified models with underpowered hyperparameters.

def create_ensemble(base_classifier, n_estimators=100, **params):


# Create weak ensembles



<details>
<summary>Click here to view/hide solution.</summary>

```
# Generate an ensemble of 100 classifiers for the specified models with underpowered hyperparameters.
    
def create_ensemble(base_classifier, n_estimators=100, **params):
    ensemble = [base_classifier(**params) for _ in range(n_estimators)]
    return ensemble

```

```
# Create weak ensembles
    
mlp_ensemble = create_ensemble(MLPClassifier, hidden_layer_sizes=(3,3), max_iter=30, tol=1e-1)
dt_ensemble = create_ensemble(DecisionTreeClassifier, max_depth=5, max_features=5)
```
</details>

### Step 4: Implement Ensemble Training with Bagging
  - Write a function `ensemble_fit()` to train your ensemble using the Bagging method.
  - **Hint:** Use `random.sample` to create subsets of your training data for each classifier in the ensemble.
  - Ensure that each classifier only sees a different subset of the training data.


In [None]:
# Write your code here! 
# Write a function to train the ensemble on subsets of the data.

def ensemble_fit(ensemble, X_train, y_train, subsample_ratio=0.2):

    
    

<details>
<summary>Click here to view/hide solution.</summary>
    
```
# Write a function to train the ensemble on subsets of the data.
def ensemble_fit(ensemble, X_train, y_train, subsample_ratio=0.2):
    n_samples = int(len(X_train) * subsample_ratio)
    for clf in ensemble:
        X_sub, y_sub = zip(*random.sample(list(zip(X_train, y_train)), n_samples))
        clf.fit(X_sub, y_sub)

ensemble_fit(mlp_ensemble, X_train, y_train, subsample_ratio=0.2)
    
```
</details>

### Step 5: Implement Ensemble Prediction with Voting
  -  Write a function `ensemble_predict()` to make predictions using your trained ensemble.
  - **Hint:** Aggregate the predictions from all classifiers in the ensemble using a majority voting scheme.
  - **Note:** Use `np.argmax()` to determine the final prediction from the votes.


In [None]:
# Load necessary libraries
import numpy as np

In [None]:
# Write your code here! 
# Task: Implement the ensemble_predict() function to make predictions using the trained ensemble.
def ensemble_predict(ensemble, X_test):


# Example of predicting with one ensemble
y_pred = ensemble_predict(mlp_ensemble, X_test)

<details>
<summary>Click here to view/hide solution.</summary>
    
```
# Task: Implement the ensemble_predict() function to make predictions using the trained ensemble.

def ensemble_predict(ensemble, X_test):
    predictions = np.array([clf.predict(X_test) for clf in ensemble])
    # Majority vote for the final prediction
    final_prediction = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=predictions)
    return final_prediction

# Example of predicting with one ensemble
y_pred = ensemble_predict(mlp_ensemble, X_test)
```
    
</details>

### Step 6: Evaluate Ensemble Performance with Different Subsample Ratios
- Perform 10-fold CV for the ensembles with a subsample ratio of `0.2`.
- Compare these results to a regular decision tree trained on the same subsample ratio.
- Repeat the process with a subsample ratio of `0.05`.
- **Hint:** You can use the same function from Step 4 to train on different subsample ratios.


In [None]:
# Write your code here! 
# Perform 10-fold CV for ensembles with different subsample ratios and compare them to regular classifiers.
def evaluate_ensemble(ensemble, X_train, y_train, X_test, y_test, subsample_ratio):


# Perform CV with different subsample ratios
subsample_ratios = [0.2, 0.05]


# Compare to regular classifiers



<details>
<summary>Click here to view/hide solution.</summary>
    
```
# Perform 10-fold CV for ensembles with different subsample ratios and compare them to regular classifiers.

def evaluate_ensemble(ensemble, X_train, y_train, X_test, y_test, subsample_ratio):
    ensemble_fit(ensemble, X_train, y_train, subsample_ratio=subsample_ratio)
    y_pred = ensemble_predict(ensemble, X_test)
    accuracy = np.mean(y_pred == y_test)
    return accuracy

# Perform CV with different subsample ratios
subsample_ratios = [0.2, 0.05]
ensemble_accuracies = {}
for ratio in subsample_ratios:
    ensemble_accuracies[f'MLP_{ratio}'] = evaluate_ensemble(mlp_ensemble, X_train, y_train, X_test, y_test, ratio)
    ensemble_accuracies[f'DT_{ratio}'] = evaluate_ensemble(dt_ensemble, X_train, y_train, X_test, y_test, ratio)

# Compare to regular classifiers
regular_dt = DecisionTreeClassifier()
regular_dt.fit(X_train, y_train)
regular_accuracy = regular_dt.score(X_test, y_test)
ensemble_accuracies['Regular_DT'] = regular_accuracy

ensemble_accuracies


```

### Step 7: Evaluate Performance Across Multiple Subsample Ratios
- Report the 10-fold CV performances of the ensembles for subsample ratios of:
  - `0.005`, `0.01`, `0.03`, `0.05`, `0.1`, `0.2`
- Train regular versions of the classifiers from Step 2 on these subsample ratios and report their performance.
  - **Hint:** You can use a list containing one element (the regular classifier) to pass through the same ensemble CV function.


In [None]:
# Write your code here! 
# Evaluate performance of ensembles at different subsample ratios and compare to regular classifiers.
subsample_ratios = [0.005, 0.01, 0.03, 0.05, 0.1, 0.2]


# Evaluate regular classifiers
regular_results = {}
for ratio in subsample_ratios:
    


<details>
<summary>Click here to view/hide solution.</summary>
    
```
# Evaluate performance of ensembles at different subsample ratios and compare to regular classifiers.

subsample_ratios = [0.005, 0.01, 0.03, 0.05, 0.1, 0.2]
ensemble_cv_results = {}

for ratio in subsample_ratios:
    ensemble_cv_results[f'MLP_{ratio}'] = evaluate_ensemble(mlp_ensemble, X_train, y_train, X_test, y_test, ratio)
    ensemble_cv_results[f'DT_{ratio}'] = evaluate_ensemble(dt_ensemble, X_train, y_train, X_test, y_test, ratio)

# Evaluate regular classifiers
regular_results = {}
for ratio in subsample_ratios:
    regular_dt = DecisionTreeClassifier(max_depth=5, max_features=5)
    regular_dt.fit(X_train, y_train)
    regular_results[f'Regular_DT_{ratio}'] = regular_dt.score(X_test, y_test)

ensemble_cv_results.update(regular_results)
ensemble_cv_results
```

### Step 8: Plot Performance Comparisons
- For each classifier, plot the performance of the ensemble at different subsample ratios.
- On the same plot, include the performance of the regular classifier at the same subsample ratios.
- **Hint:** Use two different colors to distinguish between ensemble and regular classifier performances.
- **Deliverable:** You should have four plots, one for each classifier type.

In [None]:
# Load necessary libraries
import matplotlib.pyplot as plt

In [None]:
# Write your code here! 
# Plot the performance of the ensembles vs regular classifiers at different subsample ratios.
def plot_performance(ensemble_cv_results, title):

    

<details>
<summary>Click here to view/hide solution.</summary>
    
```
# Plot the performance of the ensembles vs regular classifiers at different subsample ratios.

import matplotlib.pyplot as plt

def plot_performance(ensemble_cv_results, title):
    subsample_ratios = [0.005, 0.01, 0.03, 0.05, 0.1, 0.2]
    ensemble_acc = [ensemble_cv_results[f'MLP_{r}'] for r in subsample_ratios]
    regular_acc = [ensemble_cv_results[f'Regular_DT_{r}'] for r in subsample_ratios]
    
    plt.plot(subsample_ratios, ensemble_acc, label='Ensemble', color='blue')
    plt.plot(subsample_ratios, regular_acc, label='Regular', color='orange')
    plt.xlabel('Subsample Ratio')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.show()

plot_performance(ensemble_cv_results, "MLP vs Decision Tree Performance")


### Summary:

In this lab, you practiced building multiple classifiers and ensembles using the Bagging method, evaluating their performance using cross-validation. You learned how to create and evaluate weak classifiers, generate ensembles, subsample data, and compare the performance of ensemble models to regular classifiers.

This process demonstrated how Bagging can improve model accuracy by reducing overfitting and improving stability through ensemble voting.