# Lab 5: Ensemble ML, Spiral Project
**Author:** Derek Fintel

**Date:** April, 11th, 2025 

**Objective:** Ensemble models combine the outputs of multiple models to improve predictive performance. Common types of ensemble models include:

- Boosted Decision Trees – Models train sequentially, with each new tree correcting the errors of the previous one.
- Random Forest – Multiple decision trees train in parallel, each on a random subset of the data, and their predictions are averaged.
- Voting Classifier (Heterogeneous Models) – Combines different types of models (e.g., Decision Tree, SVM, and Neural Network) by taking the majority vote or average prediction.
- Cross Validation – Divides data into multiple folds to improve the reliability of performance estimates.

**Data Source:**
We use the Wine Quality Dataset made available by the UCI Machine Learning Repository.

https://archive.ics.uci.edu/ml/datasets/Wine+QualityLinks to an external site.

Data originally published by:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.  
Modeling wine preferences by data mining from physicochemical properties.  
In Decision Support Systems, Elsevier, 47(4):547–553, 2009.


## Introduction
In this project we utilize a a Wine Quality dataset to enable an ensemble model activity. 

This project is organized into the following Sections:
- Section 0: Imports
- Section 1: Load and Inspect the Data
- Section 2: Prepare the Data
- Section 3: Feature Selection and Justification
- Section 4: Split the Data into Train and Test
- Section 5: Evaluate Model Performance (Choose 2)
- Section 6: Compare Results 
- Section 7. Conclusions and Insights 

### Imports

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

#### The dataset includes 11 physicochemical input variables (features):

- fixed acidity          mostly tartaric acid
- volatile acidity       mostly acetic acid (vinegar)
- citric acid            can add freshness and flavor
- residual sugar         remaining sugar after fermentation
- chlorides              salt content
- free sulfur dioxide    protects wine from microbes
- total sulfur dioxide   sum of free and bound forms
- density                related to sugar content
- pH                     acidity level (lower = more acidic)
- sulphates              antioxidant and microbial stabilizer
- alcohol                % alcohol by volume

The target variable is:
- quality (integer score from 0 to 10, rated by wine tasters)

We will simplify this target into three categories:
- low (3–4), medium (5–6), high (7–8) to make classification feasible.
- we will also make this numeric (we want both for clarity)
The dataset contains 1599 samples and 12 columns (11 features + target).

### Section 1. Load and Inspect the Data
In this section we load the designated dataset and display the summary information and first 10 rows. 

In [55]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Section 2. Prepare the Data
 In this section we will clean data, conduct feature engineering, perform encoding, split data, and execute helper functions

1)  The functions below, temporarily take a "q" value and assign it parameters
 and return functions depending on values within the dataset. 

2) We then use this return data to formulate the a new column called "quality_label".

3) Next we create another one used for modeling that will assign our quality value to a specific number range. 

In [56]:
# Helper Function:
# Takes one input, the quality (which we will temporarily name "q" while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the "quality_label" column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"


# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)

## Section 3. Feature Selection and Justification
In this section, we assign our feature & target variables for model training:

X = "quality", "quality_label", & "quality_numeric". (Features)

y = "quality_numeric" (Target)

In [57]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

## Section 4. Split the Data into Train and Test
The code below splits train & test data. 

In [58]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Train Size:", X_train.shape)
print("Test Size:", X_test.shape)

Train Size: (1279, 11)
Test Size: (320, 11)


## Section 5.  Evaluate Model Performance (Choose 2)
Below is a list of  9 model variations. Choose two to focus on for your comparison. 

Models selected in this exercise are **BOLD.**

Option	Model Name	Notes
1) **Random Forest (100):**	A strong baseline model using 100 decision trees.
2) Random Forest (200, max_depth=10): Adds more trees, but limits tree depth to reduce overfitting.
3) AdaBoost (100): Boosting method that focuses on correcting previous errors.
4) AdaBoost (200, lr=0.5): More iterations and slower learning for better generalization.
5) **Gradient Boosting (100):** Boosting approach using gradient descent.
6) Voting (DT + SVM + NN): Combines diverse models by averaging their predictions.
7) Voting (RF + LR + KNN): Another mix of different model types.
8) Bagging (DT, 100): Builds many trees in parallel on different samples.
9) MLP Classifier: A basic neural network with one hidden layer.
 

In [59]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

### Here we train our elected models:
- Random Forest
- Gradient Boost

In [60]:
results = []

# 1. Random Forest
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 2. Random Forest (200, max depth=10) 
#evaluate_model(
#    "Random Forest (200, max_depth=10)",
#    RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
#    X_train,
#    y_train,
#    X_test,
#    y_test,
#    results,
#)

# 3. AdaBoost 
#evaluate_model(
#    "AdaBoost (100)",
#    AdaBoostClassifier(n_estimators=100, random_state=42),
#    X_train,
#    y_train,
#    X_test,
#    y_test,
#    results,
#)

# 4. AdaBoost (200, lr=0.5) 
#evaluate_model(
#    "AdaBoost (200, lr=0.5)",
#    AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=42),
#    X_train,
#    y_train,
#    X_test,
#    y_test,
#    results,
#)

# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 6. Voting Classifier (DT, SVM, NN) 
#voting1 = VotingClassifier(
#    estimators=[
#        ("DT", DecisionTreeClassifier()),
#        ("SVM", SVC(probability=True)),
#        ("NN", MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)),
#    ],
#    voting="soft",
#)
#evaluate_model(
#    "Voting (DT + SVM + NN)", voting1, X_train, y_train, X_test, y_test, results
#)

# 7. Voting Classifier (RF, LR, KNN) 
#voting2 = VotingClassifier(
#    estimators=[
#        ("RF", RandomForestClassifier(n_estimators=100)),
#        ("LR", LogisticRegression(max_iter=1000)),
#        ("KNN", KNeighborsClassifier()),
#    ],
#    voting="soft",
#)
#evaluate_model(
#    "Voting (RF + LR + KNN)", voting2, X_train, y_train, X_test, y_test, results
#)

# 8. Bagging 
#evaluate_model(
#    "Bagging (DT, 100)",
#    BaggingClassifier(
#        estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
#    ),
#    X_train,
#    y_train,
#    X_test,
#    y_test,
#    results,
#)

# 9. MLP Classifier 
#evaluate_model(
#    "MLP Classifier",
#    MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
#    X_train,
#    y_train,
#    X_test,
#    y_test,
#    results,
#)



Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411


### Section 6. Compare Results 
Here we utilize dataframes to pull the results and print them. 

In [61]:
# Create a table of results 
results_df = pd.DataFrame(results)

# Sort by 'Test Accuracy' in descending order
df_sorted = results_df.sort_values(by="Test Accuracy", ascending=False)

print("\nSummary of All Models:")
display(results_df)


Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,Random Forest (100),1.0,0.8875,1.0,0.866056
1,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106


### Section 7. Conclusions and Insights
Using both your results and the results from others, which options are performing well and why do you think so. 

This is your value as an analyst - narrate your story, link to other notebooks, provide a comprehensive view of what you feel is the best model for predicting quality in red wine. Base all your reasoning on data. Feel free to tune parameters if you like.  Discuss the types of models and why you think some seem to be more helpful. List the next steps you'd like to try if you were in a competition to build the best predictor. 

Don't just copy code and don't just copy AI insights - use them to learn, but we all get them for free. Use all your tools to provide your own unique value and insights. Professional communication skills are critical. Evaluate your work in the context of others - how well can you craft a unique data story and present a compelling project to your clients / readers / self. 