## Ensemble Machine Learning P5

# Ensemble ML, Spiral with Wine
# **Author:** Brenda Fuemmeler
# **Date:** November 21, 2025
# **Objective:** Use Ensemble Models to improve performance over individual models

## Introduction
This project uses the Wine dataset to show how ensemble models can perform as compared to individual models. 

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)


## Section 1. Load and Inspect the data

### 1.1 Load the Wine dataset and display the first 10 rows


In [4]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequalityred.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).



# Load spiral dataset
spiral = pd.read_csv("spiral.csv")

# Display basic information
spiral.info()

# Display first few rows
print(spiral.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 798 entries, 0 to 797
Data columns (total 1 columns):
 #   Column     Non-Nu

## Section 2. Prepare the Data
# Includes cleaning, feature engineering, encoding, splitting, helper functions
# In this section we will take the numeric wine quality score and convert it into two new forms: a text label and a numeric category. This will help us interpret and visualize the data in our models going forward.

In [5]:
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"


# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)


## Section 3. Feature Selection and Justification
# First we need to define the x & y input features. Then we will drop all columns except for Quality, Quality Label and Quality Numeric. This will ensure the model is trained to predict wine quality by only using the chemical measurements. 

In [6]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

## Section 4. Split the Data into Train and Test

In [7]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5.  Evaluate Model Performance (Choose 2)
# 1 Random Forest (100)	A strong baseline model using 100 decision trees.
# 2 Bagging (DT, 100)	Builds many trees in parallel on different samples.

In [8]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

In [10]:
# 1. Random Forest
results = []

evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661


# Explanation of Random Forest results: 
Row 1 Actual Low-Quality: [0 13 0]
0 predicted correctly
13 mislabeled as medium
0 mislabeled as high

Row 2 Actual Medium-Quality: [0 256 8]
256 correctly predicted
8 mislabeled as high

Row 3 Actual High-Quality: [0 15 28]
28 predicted correctly
15 mislabeled as medium

This Confusion Matrix reads heavily on the medium class. It reasonably identifies high-quality, while completely fails at catching low-quality wines. 

In [11]:
# 8. Bagging
results = []

evaluate_model(
    "Bagging (DT, 100)",
    BaggingClassifier(
        estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Bagging (DT, 100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 252  12]
 [  0  12  31]]
Train Accuracy: 1.0000, Test Accuracy: 0.8844
Train F1 Score: 1.0000, Test F1 Score: 0.8655


# Explanation of Bagging results: 
Row 1 Actual Low-Quality: [0 13 0]
0 predicted correctly
13 mislabeled as medium
0 mislabeled as high

Row 2 Actual Medium-Quality: [0 252 12]
252 correctly predicted
12 mislabeled as high

Row 3 Actual High-Quality: [0 12 31]
12 predicted correctly
31 mislabeled as medium

This model also strongly favors the medium class. It performed poorly on low-quality wines, while performing moderately on high-quality wines. Most mistakes center around predicting medium instead of low or high.

## Section 6. Compare Results
Random Forest (100) and Bagging both showed similar patterns, but Random Forest performed better overall. Both models had difficulty predicting low-quality wines, predicting none of the samples correctly and misclassifying all of them as medium. Random Forest resulted in higher accuracy on the medium and high-classes. 
Random Forest also shows stronger overall test metrics, showing higher accuracy and F1-score. 

In [15]:
# Create a table of results
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd

# Create a list to store results
results = []

# ----- 1. Bagging (Decision Trees) -----
bagging = BaggingClassifier(n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)

# Predictions
y_train_pred_bag = bagging.predict(X_train)
y_test_pred_bag = bagging.predict(X_test)

# Metrics
train_acc_bag = accuracy_score(y_train, y_train_pred_bag)
test_acc_bag = accuracy_score(y_test, y_test_pred_bag)
train_f1_bag = f1_score(y_train, y_train_pred_bag, average='macro')
test_f1_bag = f1_score(y_test, y_test_pred_bag, average='macro')

results.append({
    "Model": "Bagging (DT, 100)",
    "Train Accuracy": train_acc_bag,
    "Test Accuracy": test_acc_bag,
    "Train F1": train_f1_bag,
    "Test F1": test_f1_bag,
    "Accuracy Gap": train_acc_bag - test_acc_bag,
    "F1 Gap": train_f1_bag - test_f1_bag
})

# ----- 2. Random Forest -----
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_train_pred_rf = rf.predict(X_train)
y_test_pred_rf = rf.predict(X_test)

# Metrics
train_acc_rf = accuracy_score(y_train, y_train_pred_rf)
test_acc_rf = accuracy_score(y_test, y_test_pred_rf)
train_f1_rf = f1_score(y_train, y_train_pred_rf, average='macro')
test_f1_rf = f1_score(y_test, y_test_pred_rf, average='macro')

results.append({
    "Model": "Random Forest (100)",
    "Train Accuracy": train_acc_rf,
    "Test Accuracy": test_acc_rf,
    "Train F1": train_f1_rf,
    "Test F1": test_f1_rf,
    "Accuracy Gap": train_acc_rf - test_acc_rf,
    "F1 Gap": train_f1_rf - test_f1_rf
})

# ----- 3. Create summary table -----
results_df = pd.DataFrame(results)

# Sort by Test Accuracy (descending)
results_df = results_df.sort_values(by="Test Accuracy", ascending=False).reset_index(drop=True)

print("\nSummary of All Models (Sorted by Test Accuracy):")
display(results_df)



Summary of All Models (Sorted by Test Accuracy):


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap,F1 Gap
0,Random Forest (100),1.0,0.8875,1.0,0.547722,0.1125,0.452278
1,"Bagging (DT, 100)",1.0,0.884375,1.0,0.550846,0.115625,0.449154


## Section 7. Conclusions and Insights
Both models I used achieved perfect training accuracy of 1.0. When comparing some models that my peers used, I noticed that other models resulted in lower training accuracy. While the test accuracy on my models is very close, Random Forest (100) achieved the best. In comparison to my peers, I really did not see other models that performed higher. I chose to include Accuracy Gap and F1 Gap in my model summary. The Accuracy Gap and F1 Gap for both was relatively small, which suggest some overfitting. These models suggest some class imbalance in our dataset, which can cause misclassification. These models favored the class with the majority of results (medium), and did not show a true performance of the class with fewest results (low). These ensemble methods were very interesting to use. They clearly show that models can be easily misrepresented if there is class imbalance in the dataset. 