# Analysis of Wine Quality and Prediction Using Logistic Regression

by Alix, Paramveer, Susannah, Zoe 2024/11/23

In [1]:
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo 
from sklearn.model_selection import train_test_split
import altair as alt
import altair_ally as aly

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats

from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import os
import pandera as pa

## Summary

This analysis investigates the relationship between physicochemical properties and wine quality using the Wine Quality dataset from the UCI Machine Learning Repository, containing data for both red and white wine. Through comprehensive exploratory data analysis, we examined 11 physicochemical features and their correlations with wine quality scores. Our analysis revealed that higher quality wines typically have higher alcohol content and lower volatile acidity, with white wines generally receiving higher quality scores than red wines. Most features showed right-skewed distributions with notable outliers, particularly in sulfur dioxide and residual sugar measurements. The quality scores themselves followed a normal distribution centered around scores 5-6.

We implemented a logistic regression model with standardized features and one-hot encoded categorical variables, using randomized search cross-validation to optimize the regularization parameter. The final model achieved an accuracy of 52.4% on the test set. While this performance suggests room for improvement, the analysis provides valuable insights for future research directions.

## Introduction

The quality of wine is influenced by various chemical properties and sensory factors that determine its taste, aroma, and overall acceptability. Here, we aim to predict the quality of wine using a publicly available wine quality dataset. Machine learning-based predictive modeling is commonly used in the field of wine quality to identify patterns and relationships in key features such as alcohol, sulfates, and volatile acidity, which are critical factors impacting wine quality(Jain et al. 2023). By applying machine learning model, we seek to enhance the accuracy of wine quality predictions and contribute to the advancement of data-driven approaches in wine evaluation methodologies.

## Methods

### Data

The dataset used in this project is the Wine Quality dataset from the UCI Machine Learning Repository (Cortez et al. 2009) and can be found here: https://archive.ics.uci.edu/dataset/186/wine+quality. These datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. They contains physicochemical properties (e.g., acidity, sugar content, and alcohol) of different wine samples, alongside a sensory score representing the quality of the wine, rated by experts on a scale from 0 to 10. Each row in the dataset represents a wine sample, with the columns detailing 11 physicochemical attributes and the quality score. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).

Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

#### 0. Import the dataset and inspect the data

In [2]:
# Get the complete dataset
wine_quality = fetch_ucirepo(id=186)
raw_data = wine_quality.data.original 

# Ensure directories exist
os.makedirs('../data/raw', exist_ok=True)
os.makedirs('../data/processed', exist_ok=True)

# Save data into data folder as well
raw_data.to_csv('../data/raw/wine_quality.csv', index=False)

# reorder columns
raw_data['quality'] = raw_data.pop('quality')

In [3]:
# validate data
schema = pa.DataFrameSchema(
    {
        "color": pa.Column(str, pa.Check.isin(["red", "white"])),
        "fixed_acidity": pa.Column(float, pa.Check.between(0, 16), nullable=True),
        "volatile_acidity": pa.Column(float, pa.Check.between(0, 1.8), nullable=True),
        "citric_acid": pa.Column(float, pa.Check.between(0, 1.4), nullable=True), 
        "residual_sugar": pa.Column(float, pa.Check.between(0, 30), nullable=True),
        "chlorides": pa.Column(float, pa.Check.between(0, 0.7), nullable=True),
        "free_sulfur_dioxide": pa.Column(float, pa.Check.between(0, 160), nullable=True),
        "total_sulfur_dioxide": pa.Column(float, pa.Check.between(0, 400), nullable=True),
        "density": pa.Column(float, pa.Check.between(0, 1.5), nullable=True),
        "pH": pa.Column(float, pa.Check.between(0, 5), nullable=True),
        "sulphates": pa.Column(float, pa.Check.between(0, 2.5), nullable=True),
        "alcohol": pa.Column(float, pa.Check.between(9, 15), nullable=True),
        "quality": pa.Column(float, pa.Check.between(1, 10), nullable=True)
    },
    checks=[
        pa.Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found."),
        pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found.")
    ],
    drop_invalid_rows=True
)

clean_data = schema.validate(raw_data, lazy=True).drop_duplicates().dropna(how="all")

In [4]:
# Split training and testing data
train_df, test_df = train_test_split(clean_data, test_size=0.2, random_state=522)

# Store split data in data folder
train_df.to_csv('../data/processed/training_set.csv', index=False)
test_df.to_csv('../data/processed/test_set.csv', index=False)

train_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,color,quality
2697,6.5,0.29,0.25,10.6,0.039,32.0,120.0,0.9962,3.31,0.34,10.1,white,6
4493,6.4,0.125,0.36,1.4,0.044,22.0,68.0,0.99014,3.15,0.5,11.7,white,7
6137,7.1,0.09,0.3,6.2,0.032,24.0,134.0,0.993,2.99,0.39,10.9,white,6
4973,5.9,0.19,0.21,1.7,0.045,57.0,135.0,0.99341,3.32,0.44,9.5,white,5
2252,5.9,0.24,0.26,12.3,0.053,34.0,134.0,0.9972,3.34,0.45,9.5,white,6


In [5]:
# Check data info
print(f"Training data shape: {train_df.shape}")
print(f"Testing data shape: {test_df.shape}")
print('-'*50)
train_df.info()

Training data shape: (4089, 13)
Testing data shape: (1023, 13)
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Index: 4089 entries, 2697 to 5021
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         4089 non-null   float64
 1   volatile_acidity      4089 non-null   float64
 2   citric_acid           4089 non-null   float64
 3   residual_sugar        4089 non-null   float64
 4   chlorides             4089 non-null   float64
 5   free_sulfur_dioxide   4089 non-null   float64
 6   total_sulfur_dioxide  4089 non-null   float64
 7   density               4089 non-null   float64
 8   pH                    4089 non-null   float64
 9   sulphates             4089 non-null   float64
 10  alcohol               4089 non-null   float64
 11  color                 4089 non-null   object 
 12  quality               4089 non-null   int64  
dtypes: float64(

In [6]:
# Data description
train_df.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0,4089.0
mean,7.20889,0.343518,0.314935,4.794302,0.056625,29.302641,111.920885,0.994377,3.229751,0.534749,10.621882,5.809978
std,1.326447,0.167266,0.143004,4.185769,0.036748,17.104893,55.638768,0.002851,0.15924,0.150186,1.151641,0.869599
min,3.8,0.08,0.0,0.6,0.012,1.0,6.0,0.98713,2.72,0.25,9.0,3.0
25%,6.4,0.23,0.25,1.7,0.037,16.0,73.0,0.9921,3.12,0.43,9.6,5.0
50%,6.9,0.3,0.31,2.6,0.047,27.0,114.0,0.9944,3.22,0.51,10.4,6.0
75%,7.7,0.41,0.39,7.1,0.067,40.0,150.0,0.9966,3.33,0.6,11.4,6.0
max,15.9,1.33,1.23,26.05,0.61,138.5,366.5,1.00369,4.01,2.0,14.9,9.0


**Preprocessing data requirements**

From the data info and description, we can see that:
1. The numerical features are in different scales, we need to normalize them.
2. There is one categorical feature: 'color', we need to encode it.

#### 1.EDA

**1.1 Distribution of quality scores across numerical features**

In [7]:
aly.alt.data_transformers.enable('vegafusion')

aly.dist(train_df, color='quality')

From the distribution plots above, we have the following findings:
1. Higher quality wines tend to have higher alcohol content
2. Higher quality wines generally have lower volatile acidity
3. pH seems to have little discrimination power for quality (all quality levels overlap significantly)
4. The `density` feature does not showing any meaningful relationship with wine quality

**1.2 Distribution of quality scores by categorical feature (wine color)**

In [8]:
# Calculate the proportions of each quality score for different wine colors
proportions = (train_df.groupby(['color', 'quality'])
              .size()
              .reset_index(name='count')
              .assign(proportion=lambda x: x.groupby('color')['count'].transform(lambda y: y / y.sum()))
              .reset_index(drop=True))

# Create a line plot showing the proportions of each quality score for different wine colors
alt.Chart(proportions).mark_line(
    interpolate='monotone',  
    point=True,             
    tension=0.7,           
    strokeWidth=2          
).encode(
    x=alt.X('quality:Q', 
            title='Quality Score',
            scale=alt.Scale(domain=[2.5, 9.5])),
    y=alt.Y('proportion:Q', 
            title='Proportion', 
            axis=alt.Axis(format='.0%')),
    color=alt.Color('color:N', 
                   title='Wine Type',
                   scale=alt.Scale(domain=['red', 'white'],
                                 range=['#1f77b4', '#ff7f0e'])), 
    tooltip=[
        alt.Tooltip('quality:Q', title='Quality'),
        alt.Tooltip('proportion:Q', title='Proportion', format='.1%'),
        alt.Tooltip('color:N', title='Wine Type')
    ]
).properties(
    width=500,
    height=300
)

This plot simply shows that white wine in average tends to have higher quality scores than red wine.

**1.3 Correlation matrix**

In [9]:
aly.corr(train_df)

As shown above, it seems that the correlation between total sulfur dioxide and free sulfur dioxide is high, we might want to use one of them to represent the other. But let's see the scatter plot for these two features first.

In [10]:
# Create scatter plot with regression line
alt.Chart(train_df[['free_sulfur_dioxide', 'total_sulfur_dioxide']].sample(600)).mark_circle().encode(
    x='free_sulfur_dioxide',
    y='total_sulfur_dioxide'
).properties(
    width=300,
    height=200
) + alt.Chart(
    train_df[['free_sulfur_dioxide', 'total_sulfur_dioxide']].sample(600)
).mark_line(color='red').encode(
    x='free_sulfur_dioxide',
    y='total_sulfur_dioxide'
).transform_regression(
    'free_sulfur_dioxide', 
    'total_sulfur_dioxide'
)

From the scatter plot, we can see that there is a positive linear correlation between between free and total sulfur dioxide, but the relationship is not perfectly linear. Since keeping both features would not make the model too complex, we will leave them both in the model for now.



**1.4 Outlier detection**

In [11]:
# Get numerical columns only (exclude 'quality' and 'color')
numerical_cols = train_df.select_dtypes(include=['float64', 'int64']).columns
numerical_cols = [col for col in numerical_cols if col != 'quality']

# Create box plots
charts = []
for col in numerical_cols:
    chart = alt.Chart(train_df).mark_boxplot().encode(
        x=alt.X(col + ':Q', scale=alt.Scale(zero=False)),
        y=alt.Y('color:N', title=None),  # keep color but add title
        color=alt.Color('color:N', legend=alt.Legend(title="Wine Type"))
    ).properties(
        title=col,
        width=250,
        height=80
    )
    charts.append(chart)

# Display all the box plots together
n_cols = 3
n_rows = (len(charts) + n_cols - 1) // n_cols
grid = alt.vconcat(*[alt.hconcat(*charts[i:i+n_cols]) for i in range(0, len(charts), n_cols)])

grid

From the box plots above, we have the following findings:

1. Outliers:
   - Many features show significant outliers
   - Particularly noticeable in sulfur dioxide and residual sugar

1. Distributions:
   - Most features show right-skewed distributions
   - pH shows relatively normal distribution for both types

**1.5 The distribution of the target variable(quality)**

In [12]:
# Create a DataFrame with the quality counts
quality_df = pd.DataFrame({
    'quality': train_df['quality'].value_counts().index,
    'count': train_df['quality'].value_counts().values
})
quality_df['percentage'] = (quality_df['count'] / len(train_df) * 100).round(1)
quality_df = quality_df.sort_values('quality')

# Create the bar plot
chart = alt.Chart(quality_df).mark_bar().encode(
    x=alt.X('quality:O', title='Quality Score'),
    y=alt.Y('percentage:Q', title='Percentage (%)')
).properties(
    width=350,
    height=200,
    title='Distribution of Wine Quality Scores'
)

chart

We can see our target variable has a normal distribution. The scores are centered around 5-6, with symmetric decreasing frequencies on both sides, forming a classic bell-shaped curve.

### Analysis

The Logistic Regression algorithm was used to build a classification model to predict the quality as an ordinal and numeric integer (found in the `quality` column of the data set). All variables included in the original data set, including wine color (i.e. red or white) were used to fit the model. Data was split with 80% being partitioned into the training set and 20% being partitioned into the test set. The hyperparameter C was chosen using 5-fold cross validation with the accuracy score as the classification metric. All variables were standardized just prior to model fitting. `color` column is converted to a single binary column with one hot encoding and its `drop='if_binary'` parameter.

## Results and Discussion

We split and transform the data (i.e. wine color into binary variable and using standard scalers for all other features) and build our logistic regression model:

In [13]:
# Split dataset
X_train, X_test, y_train, y_test = (train_df.drop(columns='quality'), test_df.drop(columns='quality'),
                                    train_df['quality'], test_df['quality']
                                    )

numeric_features = X_train.select_dtypes(include='number').columns.tolist()
binary_features = ['color']

# Make column transformer
preprocessor = make_column_transformer(
    (OneHotEncoder(drop='if_binary'), binary_features),
    (StandardScaler(), numeric_features)
)

# Make pipeline using StandardScaler and LogisticRegression
model = make_pipeline(
    preprocessor,
    LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
)

We find the best hyperparamter C for the model:

In [14]:
# Define parameter distribution
param_dist = {
    'logisticregression__C': stats.uniform(0.001, 100),
}

# Perform randomized search
random_search = RandomizedSearchCV(model, param_distributions=param_dist,
                                   cv=3, # The least populated class in y has only 4 members, which is less than n_splits=5.
                                   n_iter=50,
                                   scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", random_search.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Best Parameters: {'logisticregression__C': 49.51869101112702}


With our tuned model using the best C hyperparameter found above, we find the accuracy score of our predictions, comparing them to actual wine quality in the test set:

In [15]:
# Evaluate
y_pred = random_search.predict(X_test)
# print(classification_report(y_test, y_pred))
accuracy_score(y_test, y_pred)

0.5366568914956011

While the performance of this model is not likely very useful in predicting wine quality, as we observed an accuracy score of 0.54, we gained insights on directions that could be further explored. First, we chose logistic regression as it is an intuitive first-step to approach a dataset with largely numeric features representing measurements of contents inside wines. Therefore, further analysis inspecting presence of linear relationships can be conducted using logistic regression results. We can then propose another model, e.g.Tree-based ones like Random Forest, to see whether it does better in wine quality prediction should there be weak linear relationships observed. Second, data cleaning might benefit our decision in choosing an optimal model as outliers have been widely observed across many features, according to our EDA in the previous section. It might be worth it to understand what all features represent and apply human knowledge to modify and "treat" the data so that it is more suitable for training than how it is currently presented. This involves speaking with professionals that understand wine makeup and qualities and seek their insights on reasons of outlier presence and their indications. We believe conducting the above two next-steps will give us a better knowledge foundation in order for us to choose a model that performs better in the future.

## References

1. Jain, K., Kaushik, K., Gupta, S. K., & Others. (2023). Machine learning-based predictive modelling for the enhancement of wine quality. Scientific Reports, 13, 17042. https://doi.org/10.1038/s41598-023-44111-9

2. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Wine Quality [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T

3. Kniazieva, Y. (2023, October 12). A digital sommelier: Machine learning for wine quality prediction. Label Your Data. https://labelyourdata.com/articles/machine-learning-for-wine-quality-prediction

4. Aich, S., Al-Absi, A. A., Hui, K. L., Lee, J. T., & Sain, M. (2018). A classification approach with different feature sets to predict the quality of different types of wine using machine learning techniques. In International Conference on Advanced Communication Technology (ICACT) (pp. 139–143). https://doi.org/10.23919/ICACT.2018.8323674