# Bagging

Bagging, or bootstrap aggregating, is an ensemble method that involves training multiple iterations of the same model on different subsets of the training data. Specifically, the training data is randomly sampled with replacement to create multiple subsets. Each subset is used to train a model, and the final prediction is the average of the predictions of all models. This method is particularly useful for reducing overfitting and improving the stability and accuracy of the model. Since the model choice is the same for all iterations, the bias of the model is not reduced, but the variance is reduced. Bagging is commonly used with decision trees, as this notebook explores below.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")

print("Path to dataset files:", path)

# Convert to pandas dataframe
df = pd.read_csv(path + "/heart.csv")

In [None]:
print(df.head())

# Split data into features and target
X = df.drop("HeartDisease", axis=1)
y = df["HeartDisease"]

# Print number of positive versus negative samples
print("Number of positive samples:", np.sum(y == 1))
print("Number of negative samples:", np.sum(y == 0))

In [None]:
# Convert sex to numerical values
X['Sex'] = X['Sex'].map({'M': 0, 'F': 1})

# Convert chest pain type to numerical values
X['ChestPainType'] = X['ChestPainType'].map({'TA': 0, 'ATA': 1, 'NAP': 2, 'ASY': 3})

# Convert resting ECG to numerical values
X['RestingECG'] = X['RestingECG'].map({'Normal': 0, 'ST': 1, 'LVH': 2})

# Convert exercise angina to numerical values
X['ExerciseAngina'] = X['ExerciseAngina'].map({'N': 0, 'Y': 1})

# Convert ST slope to numerical values
X['ST_Slope'] = X['ST_Slope'].map({'Up': 0, 'Flat': 1, 'Down': 2})

print(X.head())

In [6]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

# One model versus many

If we want to make a fair comparison between a single model and an ensemble of models, we need to train a single model on the same data and figure out the hyperparameters that give the best performance.

We set aside a small validation set above that we can use to compare the performance of the single model across varying depths. We will use the same validation set to compare the performance of the ensemble model. Once the best parameters are found, we can train the model on the entire training set and make predictions on the test set.

In [None]:
# Determine the best single model based on depth
best_depth = 0
best_accuracy = 0
for depth in range(1, 21):
    clf = DecisionTreeClassifier(max_depth=depth)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    print(f"Depth: {depth}, Accuracy: {accuracy}")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_depth = depth

print(f"Best depth: {best_depth}, Best accuracy: {best_accuracy}")

# Train the best model
X_combined = pd.concat([X_train, X_val])
y_combined = pd.concat([y_train, y_val])
clf = DecisionTreeClassifier(max_depth=best_depth)
clf.fit(X_combined, y_combined)

# Evaluate the model
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
# Create bagging pipeline using DecisionTreeClassifier
np.random.seed(1337)
num_models = 30
max_depth = 10

models = []

for i in range(num_models):
    # Sample N random samples from the training set including y values with replacement
    sample_indices = np.random.choice(X_train.index, size=len(X_train), replace=True)
    X_train_sample = X_train.loc[sample_indices]
    y_train_sample = y_train.loc[sample_indices]

    model = DecisionTreeClassifier(max_depth=max_depth)
    model.fit(X_train_sample, y_train_sample)
    models.append(model)

# Predict using all models
predictions = np.zeros((num_models, len(X_val)))

for i, model in enumerate(models):
    predictions[i] = model.predict(X_val)

# Determine the best number of models
best_num_models = 0
best_accuracy = 0
for i in range(1, num_models + 1):
    final_predictions = np.round(np.mean(predictions[:i], axis=0))
    accuracy = accuracy_score(y_val, final_predictions)
    print(f"Number of models: {i}, Accuracy: {accuracy}")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_num_models = i

print(f"Best number of models: {best_num_models}, Best accuracy: {best_accuracy}")

In [None]:
# Retrain the best configuration on the combined training and validation sets
models = []

for i in range(best_num_models):
    # Sample N random samples from the training set including y values with replacement
    sample_indices = np.random.choice(X_combined.index, size=len(X_combined), replace=True)
    X_combined_sample = X_combined.loc[sample_indices]
    y_combined_sample = y_combined.loc[sample_indices]

    model = DecisionTreeClassifier(max_depth=max_depth)
    model.fit(X_combined_sample, y_combined_sample)
    models.append(model)

# Predict using all models
predictions = np.zeros((best_num_models, len(X_test)))

for i, model in enumerate(models):
    predictions[i] = model.predict(X_test)

final_predictions = np.round(np.mean(predictions, axis=0))
accuracy = accuracy_score(y_test, final_predictions)
print(f"Accuracy: {accuracy}")
