# Wine Quality Classification 🍷

## 1. Introduction

The Wine Quality dataset, is a renowned resource on Kaggle, widely used for data science and machine learning projects. It provides a detailed chemical analysis of red wine samples, encompassing variables like acidity, sugar, sulfates, alcohol, and a subjective quality rating. The dataset reflects the real-world complexity and subjectivity of wine quality assessment. Its class imbalance poses an intriguing challenge for predictive modeling, making it an excellent case study for exploring classification techniques and imbalance handling in machine learning. The notebook solves a subset of the Wine Quality dataset that only has 3 labels (representing poor; 0, medium; 1 and premium; 2 quality). This version is used for better visualization of data distribution and allows for smoother testing of more classifiers. The original dataset can be found on Kaggle: https://www.kaggle.com/datasets/yasserh/wine-quality-dataset

## 2. Importing the necessary Modules

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC 
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import cross_val_score
from tensorflow.keras.optimizers import SGD

import tensorflow as tf
import keras_tuner as kt
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('TkAgg')  # Fixes plt.show() not working
import matplotlib.pyplot as plt
import seaborn as sns

## 3. Data Exploration

## 3.1 Descriptive Statistics

In [3]:
#Read Dataset
df = pd.read_csv('Wine_Test_02.csv')

# Get a list of quality classes: [0,1,2]
quality_classes = df['quality'].unique()

attributes = df.columns[:-1]  # exclude the last column 'quality' from attributes to plot

# Display the first few rows of the dataframe
print("---Snippet of the first 5 instances---\n",df.head(),sep="")

# Print the number of samples per class
print("---Label: Number of Samples---\n",df['quality'].value_counts(),sep="")

# Descriptive statistics
#print(df.describe())

# Check for missing values
print("---Missing Values---\n",df.isnull().sum(),sep="")

# Remove duplicates
df = df.drop_duplicates()


---Snippet of the first 5 instances---
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        1  
1  

## 3.2 Data Visualization

## 3.2.1 Plotting the Correlation Matrix between Features

In [4]:
# Correlation matrix heatmap
"""
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix of Features")
plt.show()
"""
# correlation_matrix = df.corr()
# print(correlation_matrix)


'\nplt.figure(figsize=(12, 8))\nsns.heatmap(df.corr(), annot=True, fmt=".2f", cmap=\'coolwarm\')\nplt.title("Correlation Matrix of Features")\nplt.show()\n'

![CorrelationMatrix](https://i.imgur.com/9PWelWL.png)


## Analyzing the Correlation Matrix

### Fixed Acidity:

Has a strong positive correlation with density (0.68) and a strong negative correlation with pH (-0.69), which is expected as acids affect the pH level.
It also has a moderate positive correlation with citric acid (0.67), suggesting that wines with higher fixed acidity tend to have more citric acid.

### Volatile Acidity:

Shows a strong negative correlation with citric acid (-0.54), indicating that wines with higher volatile acidity may have lower levels of citric acid.

### Density:

Shows a strong negative correlation with alcohol (-0.49). This suggests that wines with higher alcohol content tend to be less dense, which aligns with the fact that alcohol is less dense than water.

## 3.2.2 Plotting the Histograms of each Feature

In [5]:
# Plot histograms for each attribute by quality class SEPERATELY
"""
for attribute in attributes:
    plt.figure(figsize=(10, 4))
    for quality in quality_classes:
        # Select the rows where the quality matches the current class
        subset = df[df['quality'] == quality]
        # Plot the histogram
        sns.histplot(subset[attribute], kde=False, label=str(quality))
    plt.title(f'Histogram of {attribute} by wine quality class')
    plt.xlabel(attribute)
    plt.ylabel('Frequency')
    plt.legend(title='Quality class')
    plt.show()
"""

"\nfor attribute in attributes:\n    plt.figure(figsize=(10, 4))\n    for quality in quality_classes:\n        # Select the rows where the quality matches the current class\n        subset = df[df['quality'] == quality]\n        # Plot the histogram\n        sns.histplot(subset[attribute], kde=False, label=str(quality))\n    plt.title(f'Histogram of {attribute} by wine quality class')\n    plt.xlabel(attribute)\n    plt.ylabel('Frequency')\n    plt.legend(title='Quality class')\n    plt.show()\n"

![Histograms](https://i.imgur.com/RqI81tc.png)


## Optionally, we can also plot all histograms in one view:

In [6]:
# Stacked histograms in one display
bins = 15

# Define the layout size based on the number of features
num_features = len(df.columns) - 1  # exclude the 'quality' column
num_rows = int(np.ceil(num_features / 3))
fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(15, num_rows * 3))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Plot stacked histograms for each feature
"""
for i, feature in enumerate(df.columns[:-1]):  # exclude the 'quality' column
    for quality in sorted(df['quality'].unique()):
        subset = df[df['quality'] == quality][feature]
        axes[i].hist(subset, bins=bins, alpha=0.5, label=f'Quality {quality}', stacked=True)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')
    axes[i].legend()

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
"""


"\nfor i, feature in enumerate(df.columns[:-1]):  # exclude the 'quality' column\n    for quality in sorted(df['quality'].unique()):\n        subset = df[df['quality'] == quality][feature]\n        axes[i].hist(subset, bins=bins, alpha=0.5, label=f'Quality {quality}', stacked=True)\n    axes[i].set_xlabel(feature)\n    axes[i].set_ylabel('Frequency')\n    axes[i].legend()\n\n# Adjust layout to prevent overlap\nplt.tight_layout()\nplt.show()\n"

![Histograms](https://i.imgur.com/SvUEELs.png)


## Analyzing the Data Distribution

The histograms illustrate the distribution of features in the dataset, segmented by the quality classes (0, 1, and 2). The x-axis represents the feature values, and the y-axis represents the frequency of observations. 

The distribution of quality classes is highly imbalanced, with the majority class '1' having a predominant number of samples (944), overshadowing class '2' (159 samples) and class '0' (40 samples).

This imbalance suggests that a classifier may be biased towards predicting the majority class, leading to high overall accuracy but poor performance for the minority classes. Furthermore, due to the significant overlap between the classes for most attributes, it would suggest that the features are not strong discriminators between the classes, which could lead to a lower performance of the classifier.

## 4. Model Selection

## 4.1 Test Run using Support Vector Machine classifier

In [7]:
# Separate features and labels
X = df.drop('quality', axis=1)
y = df['quality']

# Split the dataset into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Can be used as a classifer: SVM, LogReg, RF, KNN
# One vs. All Classifier
ova_classifier = OneVsRestClassifier(SVC(random_state=22)) #initialize the OVA classifier with a SVM
ova_classifier.fit(X_train_scaled, y_train) #fit the scaled training data and their labels
y_pred_ova = ova_classifier.predict(X_test_scaled) #test the predictions using test data
accuracy_ova = accuracy_score(y_test, y_pred_ova) #compute the accuracy of test data using the test labels
print(f'One vs. All Classifier Accuracy: {accuracy_ova}')

# One vs. One Classifier (exact steps as before, except OvO is used)
ovo_classifier = OneVsOneClassifier(SVC(random_state=11))
ovo_classifier.fit(X_train_scaled, y_train)
y_pred_ovo = ovo_classifier.predict(X_test_scaled)
accuracy_ovo = accuracy_score(y_test, y_pred_ovo)
print(f'One vs. One Classifier Accuracy: {accuracy_ovo}')

One vs. All Classifier Accuracy: 0.8133971291866029
One vs. One Classifier Accuracy: 0.8133971291866029


## 4.2 Plotting the Confusion Matrix

In [8]:
# Compute confusion matrices using predictions from previous step
cm_ova = confusion_matrix(y_test, y_pred_ova)
cm_ovo = confusion_matrix(y_test, y_pred_ovo)

"""
# Plot confusion matrix for OVA
ConfusionMatrixDisplay(cm_ova, display_labels=ova_classifier.classes_).plot()
plt.title('Confusion Matrix -  OneVsAll SVC')
plt.show()

# Plot confusion matrix for OVO
ConfusionMatrixDisplay(cm_ovo, display_labels=ovo_classifier.classes_).plot()
plt.title('Confusion Matrix - OneVsOne SVC')
plt.show()
"""

"\n# Plot confusion matrix for OVA\nConfusionMatrixDisplay(cm_ova, display_labels=ova_classifier.classes_).plot()\nplt.title('Confusion Matrix -  OneVsAll SVC')\nplt.show()\n\n# Plot confusion matrix for OVO\nConfusionMatrixDisplay(cm_ovo, display_labels=ovo_classifier.classes_).plot()\nplt.title('Confusion Matrix - OneVsOne SVC')\nplt.show()\n"

![ConfusionMatrix](https://i.imgur.com/5Z2lKFE.png)


## 4.3 Computing the Classification Report

In [9]:
# Print classification report for OVA
print("OneVsAll Classifier Report ")
print(classification_report(y_test, y_pred_ova, target_names=ova_classifier.classes_.astype(str), zero_division=0))

# Print classification report for OVO
print("OneVsOne Classifier Report")
print(classification_report(y_test, y_pred_ovo, target_names=ovo_classifier.classes_.astype(str), zero_division=0))

OneVsAll Classifier Report 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.81      1.00      0.90       170
           2       0.00      0.00      0.00        31

    accuracy                           0.81       209
   macro avg       0.27      0.33      0.30       209
weighted avg       0.66      0.81      0.73       209

OneVsOne Classifier Report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.81      1.00      0.90       170
           2       0.00      0.00      0.00        31

    accuracy                           0.81       209
   macro avg       0.27      0.33      0.30       209
weighted avg       0.66      0.81      0.73       209



## 4.4 Analysis of Current Results

Class 0:
True Positives (TP): 0 (no. of instances correctly predicted)
False Negatives (FN) 0: 8 (instances of class 0 were incorrectly predicted as class 1)
There are no instances of class 0 predicted as class 2.

Class 1:
True Positives (TP): 182 (182 instances were correctly)
There are no instances of class 1 incorrectly predicted as class 0 or class 2.

Class 2:
True Positives (TP): 0 (no instances were correctly predicted as class 2)
False Negatives (FN): 39 (39 instances of class 2 were incorrectly predicted as class 1)
There are no instances of class 2 predicted as class 0.

### Conclusion:

The model did not predict any instances as class 0 or class 2 as expected due to the significant imbalance, and therefore bias towards class 1 clearly exists. All instances of class 0 and class 2 were misclassified as class 1. It is worth noting that OvA and OvO showed no difference in accuracy without using Cross Validation.

Back to the drawing board!

## 5. Data Preprocessing using SMOTE

In [17]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=22)

# Fit SMOTE on the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_smote)
X_test_scaled = scaler.transform(X_test)

print('New Value Counts for each label after SMOTE:\n',y_train_smote.value_counts())

New Value Counts for each label after SMOTE:
 1    683
2    683
0    683
Name: quality, dtype: int64


## 6. Model Selection after Preprocessing

## 6.1 Random Forest Classifier with Cross Validation

In [19]:

# Initialize the Random Forest classifier
rf_classifier = RFC(random_state=4)

# Perform cross-validation
cv_scores = cross_val_score(rf_classifier, X_train_scaled, y_train_smote, cv=5)

# Output the mean cross-validation score
print(f'CV mean score: {cv_scores.mean():.4f}')

# Fit the model on the training data
rf_classifier.fit(X_train_scaled, y_train_smote)

# Predict on the test data
y_pred_rfc = rf_classifier.predict(X_test_scaled)

print(f'Test set accuracy: {accuracy_score(y_test, y_pred_rfc):.4f}')


# Compute confusion matrices
cm_rfc = confusion_matrix(y_test, y_pred_rfc)
print(cm_rfc)


# Plot confusion matrix for RFC
"""
ConfusionMatrixDisplay(cm_rfc, display_labels=rf_classifier.classes_).plot()
plt.title('Confusion Matrix - Upsampled Data - Random Forest Classifier')
plt.show()
"""

# Print classification report for RFC
print("Random Forest Classifier Report ")
print(classification_report(y_test, y_pred_rfc, target_names=rf_classifier.classes_.astype(str), zero_division=0))


CV mean score: 0.8814
Test set accuracy: 0.6651
[[  0   7   1]
 [  7 137  26]
 [  3  26   2]]
Random Forest Classifier Report 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.81      0.81      0.81       170
           2       0.07      0.06      0.07        31

    accuracy                           0.67       209
   macro avg       0.29      0.29      0.29       209
weighted avg       0.67      0.67      0.67       209



## 6.2 Neural Network with Dropout, L2 Regularization and Early Stopping

In [20]:
# Define the model with dropout and L2 regularization
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_smote.shape[1],), kernel_regularizer=l2(0.001)),
    Dropout(0.5),
    Dense(128, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.5),
    Dense(3, activation='softmax')  
])

# Compile the model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',  
    metrics=['accuracy']
)

# Define early stopping callback
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

# Train the model with early stopping
history = model.fit(
    X_train_scaled,
    y_train_smote,
    epochs=70,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stopping],
    verbose=1
)

# Evaluate the model on the scaled test set
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)

print(f"Test set accuracy: {test_accuracy:.4f}")

# Make predictions on the test set
y_pred_probs = model.predict(X_test_scaled)
y_pred_labels = np.argmax(y_pred_probs, axis=1)

# Compute the confusion matrix
cm_nn = confusion_matrix(y_test, y_pred_labels)

# Plot the confusion matrix
"""
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_labels)
plt.title('Confusion Matrix - Neural Network')
plt.show()
"""

# Print the classification report
print("Neural Network Classifier Report")
print(classification_report(y_test, y_pred_labels))

Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70


Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Test set accuracy: 0.6986
Neural Network Classifier Report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.81      0.83      0.82       170
           2       0.21      0.16      0.18        31

    accuracy                           0.70       209
   macro avg       0.34      0.33      0.33       209
weighted avg       0.69      0.70      0.69       209



## 7. Insights and Interpretation

## 7.1 Test Accuracy and Confusion Matrices

In [24]:
# Plot confusion matrices side by side
"""
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

ConfusionMatrixDisplay(cm_nn, display_labels=[0,1,2]).plot(ax=axes[0], cmap=plt.cm.Blues)
axes[0].set_title('NN Confusion Matrix')

ConfusionMatrixDisplay(cm_rfc, display_labels=[0,1,2]).plot(ax=axes[1], cmap=plt.cm.Blues)
axes[1].set_title('RFC Confusion Matrix')

plt.show()
"""

![ConfusionMatrix](https://i.imgur.com/BrMK0D8.png)

### Accuracy
The NN has an accuracy of 64.63%, while the RFC has a higher accuracy of 75.11%. This suggests that the RFC is performing better overall on the test set.

### Confusion Matrix: 
The confusion matrices show the distribution of predictions across the actual classes. Both models struggle with class 0 and class 2, but the RFC seems to make more accurate predictions for the majority class (class 1).

## 7.2 Precision, Recall, and F1-Score

In [23]:
# Bar chart for F1-scores
"""
nn_f1_scores = [0.11, 0.77, 0.30]
rfc_f1_scores = [0.11, 0.85, 0.34]

index = np.arange(len(quality_classes))
bar_width = 0.35

fig, ax = plt.subplots()
nn_bar = ax.bar(index, nn_f1_scores, bar_width, label='NN')
rfc_bar = ax.bar(index + bar_width, rfc_f1_scores, bar_width, label='RFC')

ax.set_xlabel('Class')
ax.set_ylabel('F1 Score')
ax.set_title('F1 Score by class and model')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels([0,1,2])
ax.legend()

plt.show()
"""


![F1 score](https://i.imgur.com/rig6a6T.png)

For class 0 and class 2, both models have relatively low scores, indicating difficulty in correctly predicting these classes. However, for the majority class, the RFC performs better in terms of precision and recall, leading to a higher F1-score.

## 7.3 Final Analysis

The overall accuracy decreased for the Random Forest Classifier after upsampling (79% to 75.11%), but this is not a negative outcome, as it reflects a more truthful representation of the model's predictive capabilities across all classes. Nevertheless, both models, show limited effectiveness in classifying minority classes, likely due to persistent class imbalance issues and also due to the non-informative features, despite attempts to mitigate this with SMOTE.

The RFC demonstrates better performance over the Neural Network (NN) in terms of accuracy and F1-score, particularly for the predominant class, suggesting better feature handling and resilience to overfitting, though further exploration of other ensemble techniques like Gradient Boosting or AdaBoost may be feasible.

The NN could potentially see improvements with more hyperparameter tuning or increasing the dataset size.

Ultimately, while the RFC currently leads in performance, there is room for enhancement in both models through more refined tuning and a more strategic approach to class imbalance.

## 8. Conclusion

The project highlights the common challenge of class imbalances in classification tasks, where the RFC's robustness to overfitting and its impressive handling of features have given it an edge over the NN, particularly after upsampling efforts attempted to mitigate class imbalances. The evaluation through precision, recall, and F1-scores underscores the limitations of relying only on accuracy as a performance metric, revealing a more comprehensive picture of the model's performance in imbalanced datasets. While the NN's complexity suggests a need for larger datasets to achieve better generalization, the RFC's ensemble approach demonstrates a strong capacity for working with the available data. This project reinforces the importance of an all-round assessment of model performance, considering the complex balance between various types of errors and their implications in real-world applications.