# Lab 5: Ensemble Machine Learning – Wine Quality Dataset  
**Name:** Prince  
**Date:** 04/09/2025  

## Introduction

In this project, I’m exploring different ensemble machine learning models to predict wine quality based on its chemical properties. The dataset comes from the UCI Machine Learning Repository and includes red wine samples rated by experts.

Instead of using just one model, I’ll be combining the strengths of multiple models — like Random Forest and AdaBoost — to see if we can get better results. The goal is to figure out which approach gives us the most accurate and balanced predictions for classifying wine as **low**, **medium**, or **high** quality.

I'll be checking the performance of each model using accuracy and F1 score, and also paying attention to whether the model overfits or generalizes well to new data.


## Section 0: Imports


In [1]:
# Section 0: Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)


## Section 1: Load and Inspect the Data

In [3]:
# Section 1: Load and Inspect the Data

# Load the red wine dataset (semicolon-separated)
df = pd.read_csv("winequality-red.csv", sep=";")

# Check dataset structure
df.info()

# Preview the first few rows
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2: Prepare the Data

The original `quality` column gives a score from 0 to 10, but most wines fall between 3 and 8. Instead of using all those numbers, I’m grouping them into 3 categories to make the classification easier:

- 3–4 = low quality  
- 5–6 = medium quality  
- 7–8 = high quality  

I created two new columns:
- `quality_label`: shows the category as text (low, medium, high)
- `quality_numeric`: turns that into numbers (0, 1, 2) for the model

This makes the target simpler and easier for the model to learn.


In [4]:
# Section 2: Prepare the Data

# Helper function to convert wine quality score to a label (string)
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

# Apply function to create new label column
df["quality_label"] = df["quality"].apply(quality_to_label)

# Helper function to convert wine quality score to a numeric class
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

# Apply function to create numeric column
df["quality_numeric"] = df["quality"].apply(quality_to_number)

# Check the distribution of each label
df["quality_label"].value_counts()


quality_label
medium    1319
high       217
low         63
Name: count, dtype: int64

## Section 3: Feature Selection and Justification

For this project, I’m using the 11 chemical properties (like acidity, sugar, alcohol, etc.) as input features.

I dropped these columns:
- `quality`: the original raw score (not needed anymore)
- `quality_label`: the text version of the label (only used for display)
- `quality_numeric`: this is our target, so we don’t want it in the features

My target variable is `quality_numeric`, which has 3 classes: 0 = low, 1 = medium, 2 = high.
This format works best with the models we'll train.


In [5]:
# Section 3: Feature Selection

# Features: all chemical properties only (drop target-related columns)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])

# Target: use numeric quality for training
y = df["quality_numeric"]

# Quick check on shapes
X.shape, y.shape


((1599, 11), (1599,))

## Section 4: Split the Data into Train and Test

Now that we’ve selected our features and target, it’s time to split the dataset.  
I’m using an 80/20 train-test split so the model can learn from most of the data but still be tested on unseen examples.

I’m also using `stratify=y` to make sure the class distribution (low, medium, high) stays consistent in both the training and testing sets. This helps the model generalize better.


In [6]:
# Section 4: Split the Data into Train and Test

from sklearn.model_selection import train_test_split

# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Check split shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((1279, 11), (320, 11), (1279,), (320,))

## Section 5: Evaluate Model Performance

Now I’ll evaluate two ensemble models: Random Forest and AdaBoost. These models are designed to improve performance by combining the predictions of multiple learners.

I’m using a helper function to:
- Train the model
- Predict on train and test sets
- Print a confusion matrix
- Calculate accuracy and F1 score
- Save results for comparison

### Models Chosen:
1. **Random Forest (100 trees)** – builds multiple trees and averages their predictions.
2. **AdaBoost (100 estimators)** – builds trees sequentially, with each new one correcting the last.


In [7]:
# Section 5: Evaluate Model Performance

# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

# Create list to store results
results = []

# 1. Random Forest (100 trees)
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 2. AdaBoost (100 estimators)
evaluate_model(
    "AdaBoost (100)",
    AdaBoostClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)



Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

AdaBoost (100) Results
Confusion Matrix (Test):
[[  1  12   0]
 [  5 240  19]
 [  0  20  23]]
Train Accuracy: 0.8342, Test Accuracy: 0.8250
Train F1 Score: 0.8209, Test F1 Score: 0.8158


## Section 6: Compare Results

Now that both models have been evaluated, I’ll compare them side-by-side in a results table.

I also added two extra columns:
- **Accuracy Gap**: Difference between train and test accuracy
- **F1 Gap**: Difference between train and test F1 score

Smaller gaps are better — they mean the model is generalizing well and not overfitting to the training data.


In [8]:
# Section 6: Compare Results

# Turn list of results into a DataFrame
results_df = pd.DataFrame(results)

# Calculate generalization gaps
results_df["Accuracy Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]
results_df["F1 Gap"] = results_df["Train F1"] - results_df["Test F1"]

# Sort by test accuracy (best performing at top)
results_df = results_df.sort_values(by="Test Accuracy", ascending=False)

# Show summary table
print("\nSummary of All Models:")
results_df.reset_index(drop=True, inplace=True)
display(results_df)



Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap,F1 Gap
0,Random Forest (100),1.0,0.8875,1.0,0.866056,0.1125,0.133944
1,AdaBoost (100),0.834246,0.825,0.820863,0.815803,0.009246,0.00506


## Section 7: Conclusions and Insights

In this lab, I evaluated two ensemble models — **Random Forest (100)** and **AdaBoost (100)** — to predict red wine quality using 11 chemical features. Both models performed well but in different ways.

### My Results:
- **Random Forest (100)** had the best test performance with **88.75% accuracy** and **0.8661 F1 score**, but it also had **perfect training scores**. This created a large gap, which suggests overfitting — the model may be memorizing the training data instead of generalizing to new data.
- **AdaBoost (100)** had slightly lower test accuracy (**82.5%**) and F1 score (**0.8158**), but its train and test results were very close. This means it generalizes better and is less likely to overfit.

### Comparison with Other Student Models:
Looking at another student’s results:
- **AdaBoost**: 82.5% accuracy, 0.816 F1 score (similar to mine — shows AdaBoost is stable across experiments).
- **MLP Classifier**: 84.4% accuracy, 0.807 F1 score. Slightly better accuracy than AdaBoost, but lower F1 — which means it might not perform as well on underrepresented classes.

This comparison confirms that **AdaBoost is a solid general-purpose model**, while **MLP and Random Forest** may need more tuning or regularization to avoid overfitting.

### Key Takeaways:
- **Random Forest** gives great performance, but the high training scores make me cautious. I would reduce tree depth or limit features to improve generalization.
- **AdaBoost** may not be the flashiest, but it's consistent — and in real-world projects, consistency is gold.
- **MLP Classifier** looks promising and is worth exploring further, especially with parameter tuning.

### If this were a competition

I’d tune Random Forest to reduce overfitting, try Gradient Boosting or Voting Classifiers, and fix the class imbalance. I’d also use cross-validation to make results more reliable.


### Final Thoughts:
My top pick for now is **AdaBoost** — not because it wins every metric, but because it shows **balanced performance** without overfitting. And as an analyst, I care about models that work just as well tomorrow as they do today.
