<h1 style="color: blue; font-family: Ariel;"> Lab 05 - Ensemble Machine Learning - Wine Dataset


### Author: Cera Drake
### Date: 4/11/2025
### Objective: Practive with a combination of multiple models (ensemble models), using a wine dataset

In [114]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

### Section 1: Load and Inspect Data

In [115]:
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
print('The first five rows of the dataset:')
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
The first five rows of the dataset:


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Section 2: Prepare the Data

#### Clean the Data

In [116]:
# Check for missing values
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [117]:
# Check for duplicate rows
df.duplicated().sum()
df.drop_duplicates(inplace=True)

##### Reflection: I have checked for missing values, there are none. I have also deleted any duplicate rows. 

#### Create quality_label column

In [118]:
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

In [119]:
# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)

#### Create numeric column for modeling

In [120]:

# 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)


##### Reflection: Creating these simplified columns will make it easier for a classification model

#### Section 3. Feature Selection and Justification

In [121]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target


##### Reflection: Our target is quality_numeric. That is the simplified column that we created in the last step. We are removing the quality column and the other columns listed above so the model won't draw information from them. We have created the simplified columns for the model to use. 

#### Section 4. Split the Data into Train and Test

In [122]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

#### Section 5.  Evaluate Model Performance

In [123]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

In [124]:
# 8. Bagging 
results = []
evaluate_model(
    "Bagging (DT, 100)",
    BaggingClassifier(
        estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Bagging (DT, 100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  1 210  11]
 [  0  17  20]]
Train Accuracy: 1.0000, Test Accuracy: 0.8456
Train F1 Score: 1.0000, Test F1 Score: 0.8220


In [125]:
# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  2 210  10]
 [  0  17  20]]
Train Accuracy: 0.9696, Test Accuracy: 0.8456
Train F1 Score: 0.9686, Test F1 Score: 0.8232


#### Section 6. Compare Results

In [126]:
# Create a table of results 
results_df = pd.DataFrame(results)

print("\nSummary of All Models:")
display(results_df)


Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1
0,"Bagging (DT, 100)",1.0,0.845588,1.0,0.821996
1,Gradient Boosting (100),0.969641,0.845588,0.968605,0.82319


In [127]:
#Put results in to a table
results_df = pd.DataFrame(results)

# Add a gap column
results_df["Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]

# Sort by Test Accuracy
results_df = results_df.sort_values(by="Test Accuracy", ascending=False).reset_index(drop=True)

print("\nSummary of both models - sorted by Test Accuracy:")
display(results_df)


Summary of both models - sorted by Test Accuracy:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Gap
0,"Bagging (DT, 100)",1.0,0.845588,1.0,0.821996,0.154412
1,Gradient Boosting (100),0.969641,0.845588,0.968605,0.82319,0.124053


#### Section 7

##### Looking at the 2 models that I used, the gradient boosting was the better performing model here. The test accuracy was the same between the 2, but a lower gap means that it had less overfitting.

##### I looked at a classmate's results who had used random forest (100) and Voting (DT+SVM+NN). Random forest had the highest test accuracy, but showed signs of overfitting (1 on the training set). Out of the 4 models that I looked at, Voting was the most balanced. It did not have the highest accuracy score, however it did have the smallest gap which is a good sign of low generalization. 

##### My next steps to fine tuning these models might be to combine features. I'd find features that might have a connection and combine them, similar to what we did with the Titanic and Mushroom data set. I would also find the best tool to tune the hyperparameters that best fit my model of choice. 