# A Practical Exploration of the Wine Quality Dataset

#### By Jordan Cairns, Chris Gao, Yingzi Jin and Chun Li
#### In fulfillment of DSCI 522 Milestone 1

## Executive Summary

Our analysis aimed to develop a predictive model to distinguish between red and white wines based on various physicochemical properties. This study employed logistic regression, a model renowned for its balance between predictive power and interpretability.

The regression result suggested that residual sugar and total sulfur dioxide had high positive coefficients, indicating a strong association with white wine, whereas density showed the most substantial negative impact, followed by alcohol and volatile acidity, suggesting these are key indicators of red wine.


The logistic regression model not only achieved high accuracy but also provided valuable insights into the features most indicative of wine type. This model can assist vintners in quality control and classification tasks. Moreover, the interpretability of the model offers a foundation for further research into wine composition and its impact on sensory attributes. Future studies might explore more complex models or delve deeper into feature engineering to enhance predictive accuracy and understanding.




# Introduction

In the intricate world of oenology, the distinction between red and white wines extends beyond color, embedding itself in the nuanced spectrum of their physicochemical properties. This project embarks on a data-driven journey to unravel these complexities by leveraging statistical models to classify wines as red or white based on their inherent characteristics. Utilizing a rich dataset that encapsulates key attributes like acidity, sugar content, sulfur dioxide levels, alcohol concentration, and more, we aim to build a predictive model that not only accurately classifies the wines but also sheds light on the influential factors that underpin this classification. Through this analysis, we intend to blend the art of winemaking with the precision of data science, offering insights that could prove valuable to vintners, sommeliers, and wine enthusiasts alike in understanding the subtle distinctions between these two celebrated categories of wine.

# Data

The dataset utilized in our project is sourced from the UCI Machine Learning Repository, specifically focusing on red and white variants of Portuguese "Vinho Verde" wine​​​​. This dataset is distinguished by its emphasis on physicochemical tests to model wine quality, capturing a range of variables that reflect the sensory and chemical composition of the wine samples. Notably, it encompasses various input features like acidity, sugar content, and alcohol levels, while the output variable relates to the sensory-driven quality rating of the wines. A unique aspect of this dataset is its exclusion of data on grape types, wine brands, or prices due to privacy and logistic constraints. This attribute frames our analysis within a context of physicochemical and sensory data, offering an opportunity to delve into wine quality assessment based on measurable attributes, free from commercial biases. The dataset's structure lends itself to both classification and regression tasks, providing a fertile ground for exploring machine learning applications in the domain of wine quality evaluation.

## Data Overview:

In [47]:
import numpy as np
import pandas as pd
import requests
import zipfile
import altair as alt
import os
import sys
from myst_nb import glue
import pickle

from sklearn import set_config
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

sys.path.insert(0, "../src")
from helper_func_wine_classification_plot import create_wine_prediction_chart 

In [48]:

df_combined = pd.read_csv("../data/raw/winequality.csv")

In [49]:
'''
np.random.seed(522)
set_config(transform_output="pandas")

# Creating the split
wine_train, wine_test = train_test_split(df_wine, train_size=0.70, stratify=df_wine["color"])
'''

'\nnp.random.seed(522)\nset_config(transform_output="pandas")\n\n# Creating the split\nwine_train, wine_test = train_test_split(df_wine, train_size=0.70, stratify=df_wine["color"])\n'

## Exploratory Data Analysis

The first step of EDA is to generate some histograms to visualize the effects of all numerical variables to the type of wines. By comparing these distributions side by side, we can pinpoint which features exhibit significant variations across the two categories, thereby informing feature selection for predictive modeling. Such visual tools are invaluable as they facilitate an intuitive understanding of complex data relationships, highlight potential factors that could influence the wine's classification, and guide subsequent analytical steps in the data science workflow.




![comparitive_distribution](../results/figures/comparative_distribution.png)

![numeric_histogram_plots](../results/figures/numeric_histogram.png)

Visually, some features do show significant differences between red and white wines and may be particularly relevant in distinguishing between the two. In particular, the following five features stand out in the histograms and could be considered significant for predicting the color of the wine.

1. Fixed & Volatile Acidity: There's a noticeable difference in the distributions, with red wines generally exhibiting higher fixed and volatile acidity.

2. Residual Sugar: White wines display a much higher residual sugar content, which could be a strong differentiator

3. Total Sulfur Dioxide: The levels are significantly higher in white wines, suggesting this feature could be key in classification.

4. Free Sulfur Dioxide: Similar to total sulfur dioxide, this feature is also markedly higher in white wines.

5. pH value: The majority of red wines seem to have a higher overall pH values.


![corr_plot](../results/figures/correlation_matrix_heatmap.png)

The distribution of the plot also demonstrates the majority of the explainatory variables are not strongly corrlated. However, we do observe the correlations between variable pairs `free sulfur dioxide` and `total sulfur dioxide`, as well as `density` and `alcohol` are relatively high (absolute value exceeding 0.7). This might introduce difficulties to the model to estimate the relationship between each independent variable and the dependent variable independently

## Models and Results

In [50]:
""" # Build a transformer to further process the data
numeric_features = wine_train.columns.tolist()[:-2]
categorical_features = wine_train.columns.tolist()[-2:-1]
columns_to_passthrough = ['color']

wine_preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OrdinalEncoder(), categorical_features),
    ('passthrough', columns_to_passthrough)
)

wine_preprocessor.fit(wine_train)
scaled_wine_train = wine_preprocessor.transform(wine_train)
scaled_wine_test = wine_preprocessor.transform(wine_test)

scaled_wine_train.to_csv("../data/processed/scaled_wine_train.csv")
scaled_wine_test.to_csv("../data/processed/scaled_wine_test.csv") """

' # Build a transformer to further process the data\nnumeric_features = wine_train.columns.tolist()[:-2]\ncategorical_features = wine_train.columns.tolist()[-2:-1]\ncolumns_to_passthrough = [\'color\']\n\nwine_preprocessor = make_column_transformer(\n    (StandardScaler(), numeric_features),\n    (OrdinalEncoder(), categorical_features),\n    (\'passthrough\', columns_to_passthrough)\n)\n\nwine_preprocessor.fit(wine_train)\nscaled_wine_train = wine_preprocessor.transform(wine_train)\nscaled_wine_test = wine_preprocessor.transform(wine_test)\n\nscaled_wine_train.to_csv("../data/processed/scaled_wine_train.csv")\nscaled_wine_test.to_csv("../data/processed/scaled_wine_test.csv") '

In [51]:
""" # Rename 'passthrough__color' back to 'color'
scaled_wine_train.rename(columns={'passthrough__color': 'color'}, inplace=True)
scaled_wine_test.rename(columns={'passthrough__color': 'color'}, inplace=True)

# Preparing data for machine learning model
X_train = scaled_wine_train.drop(columns=['color'])
y_train = scaled_wine_train['color']

X_test = scaled_wine_test.drop(columns=['color'])
y_test = scaled_wine_test['color'] """

" # Rename 'passthrough__color' back to 'color'\nscaled_wine_train.rename(columns={'passthrough__color': 'color'}, inplace=True)\nscaled_wine_test.rename(columns={'passthrough__color': 'color'}, inplace=True)\n\n# Preparing data for machine learning model\nX_train = scaled_wine_train.drop(columns=['color'])\ny_train = scaled_wine_train['color']\n\nX_test = scaled_wine_test.drop(columns=['color'])\ny_test = scaled_wine_test['color'] "

In [52]:
""" # Creating the DummyClassifier to get the baseline score
dummy_scores = pd.DataFrame(cross_validate(
    DummyClassifier(strategy="most_frequent"),
    X_train,
    y_train,
    return_train_score=True,
    scoring=["accuracy"]
))

dummy_scores """

' # Creating the DummyClassifier to get the baseline score\ndummy_scores = pd.DataFrame(cross_validate(\n    DummyClassifier(strategy="most_frequent"),\n    X_train,\n    y_train,\n    return_train_score=True,\n    scoring=["accuracy"]\n))\n\ndummy_scores '

In [53]:
""" from helper_func_model_selection import model_selection
models = model_selection("dummy", "dtree", "knn", "svm", "nb", "lr")
models """

' from helper_func_model_selection import model_selection\nmodels = model_selection("dummy", "dtree", "knn", "svm", "nb", "lr")\nmodels '

In [54]:
""" # The following block of code was inspired by DSCI 571 Lab 4

results_list = []

for name, model in models.items():
    
    # Create a pipeline with a CountVectorizer and the current model
    pipeline = make_pipeline(model)
    
    # Perform cross-validation
    cv_results = cross_validate(pipeline, X_train, y_train, cv=5,
    return_train_score=True,
    scoring='accuracy',
    n_jobs=-1)
    
    # Append results for the current model to the results_list
    results_list.append({
        "model": name,
        "fit_time": np.mean(cv_results['fit_time']),
        "score_time": np.mean(cv_results['score_time']),
        "test_score": np.mean(cv_results['test_score']),
        "train_score": np.mean(cv_results['train_score']),
    })

# Create a DataFrame from the results_list
results_df = pd.DataFrame(results_list)

# Set the model name as the index
results_df.set_index('model', inplace=True)

# Show the resulting DataFrame
results_df """

' # The following block of code was inspired by DSCI 571 Lab 4\n\nresults_list = []\n\nfor name, model in models.items():\n    \n    # Create a pipeline with a CountVectorizer and the current model\n    pipeline = make_pipeline(model)\n    \n    # Perform cross-validation\n    cv_results = cross_validate(pipeline, X_train, y_train, cv=5,\n    return_train_score=True,\n    scoring=\'accuracy\',\n    n_jobs=-1)\n    \n    # Append results for the current model to the results_list\n    results_list.append({\n        "model": name,\n        "fit_time": np.mean(cv_results[\'fit_time\']),\n        "score_time": np.mean(cv_results[\'score_time\']),\n        "test_score": np.mean(cv_results[\'test_score\']),\n        "train_score": np.mean(cv_results[\'train_score\']),\n    })\n\n# Create a DataFrame from the results_list\nresults_df = pd.DataFrame(results_list)\n\n# Set the model name as the index\nresults_df.set_index(\'model\', inplace=True)\n\n# Show the resulting DataFrame\nresults_df '

Upon concluding our exploratory data analysis and delving into model evaluation, the results delineate an intriguing landscape of model performance. Notably, while the Decision Tree, KNN, and RBF SVM models exhibit high accuracy, with the SVM model achieving the highest test scores, the choice of model cannot rest on accuracy alone. Logistic Regression, while marginally surpassed by SVM in test score metrics, stands out for its interpretability. This model provides not only a robust predictive performance but also the capacity to glean meaningful insights from the significance and impact of each feature, as reflected by its coefficients. In light of this, we opt for Logistic Regression, valuing the interpretative clarity it offers, which is instrumental for a nuanced understanding of the variables influencing wine classification. This strategic choice harmonizes predictive strength with explanatory depth, guiding us towards actionable intelligence over mere predictive prowess




In [55]:
""" # Fitting the Logistic Regression model and score it on the test portion
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
model.score(X_test, y_test) """

' # Fitting the Logistic Regression model and score it on the test portion\nmodel = LogisticRegression(max_iter=1000)\nmodel.fit(X_train,y_train)\nmodel.score(X_test, y_test) '

In [56]:
""" # Producing the table to present the marginal contribution of each feature.
reg_data = {
    'Feature Name': model.feature_names_in_,
    'Coefficient': model.coef_[0]
}

result_df = pd.DataFrame(reg_data)

result_df['Feature Name'] = result_df['Feature Name'].str.replace('standardscaler__', '')
result_df['Feature Name'] = result_df['Feature Name'].str.replace('ordinalencoder__', '')

result_df """

" # Producing the table to present the marginal contribution of each feature.\nreg_data = {\n    'Feature Name': model.feature_names_in_,\n    'Coefficient': model.coef_[0]\n}\n\nresult_df = pd.DataFrame(reg_data)\n\nresult_df['Feature Name'] = result_df['Feature Name'].str.replace('standardscaler__', '')\nresult_df['Feature Name'] = result_df['Feature Name'].str.replace('ordinalencoder__', '')\n\nresult_df "

In [57]:

![wine_prediction_chart](../results/figures/predict_visualization.png)

'[wine_prediction_chart]' is not recognized as an internal or external command,
operable program or batch file.


In [58]:
'''
wine_train.head(5).to_csv("../data/test/test_data_alt_distri.cvs")
'''

'\nwine_train.head(5).to_csv("../data/test/test_data_alt_distri.cvs")\n'

The coefficients obtained from the logistic regression model provide a quantifiable measure of the impact each feature has on the likelihood of a wine being classified as red or white. Features with positive coefficients, such as residual sugar and total sulfur dioxide, increase the probability of a wine being classified as white, as indicated by the model.classes_ array. Conversely, features with negative coefficients, such as alcohol, volatile acidity, chlorides, and notably density with the largest negative coefficient, are indicative of a wine being classified as red. The magnitude of these coefficients reveals the relative importance of each feature, with density and alcohol having the most substantial influence in the negative direction and residual sugar significantly increases the odds in favor of white wine. The feature 'quality' also plays a role, albeit a smaller one, in swaying the classification towards red wine. Overall, the model's coefficients provide a nuanced understanding of how each physicochemical characteristic tilts the balance in the complex interplay of factors that determine wine color in our dataset.

## Conclusion

The logistic regression analysis reveals expected relationships between wine characteristics and their classification as red or white. Residual sugar's positive coefficient aligns with the higher levels typically found in white wines, indicating a greater likelihood of a wine being classified as white as the sugar content increases. Similarly, the positive coefficient for sulfur dioxide corresponds with the higher concentrations in white wines. The negative coefficients for alcohol and density suggest a higher probability of wine being classified as red with increasing values, which is consistent with red wines generally having higher alcohol content. These insights highlight the intricate balance of physicochemical properties influencing wine color, reaffirming the importance of considering the context and interactions of features within the dataset when interpreting model outcomes.



Nevertheless, it's important to remember that the signs and magnitudes of coefficients in logistic regression are influenced by the scale of the features and the correlations between them. These factors can affect the interpretability of the coefficients in complex ways, especially if there is multicollinearity in the data. Therefore, while the results are plausible and show some expected trends, any surprising findings would warrant a deeper investigation into the data and the model's behavi 



# Reference

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Wine Quality Dataset. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/dataset/186/wine+quality

Timbers, T. (2023). Breast Cancer Predictor Python Repository. GitHub. Retrieved from https://github.com/ttimbers/breast_cancer_predictor_py/tree/v0.0.2

Mor, N. S. (2022).Wine Quality and Type Prediction from Physicochemical Properties Using Neural Networks for Machine Learning: A Free Software for Winemakers and Customer. https://osf.io/ph4cu/download.

UBC Master of Data Science. (2023). DSCI 571: Supervised Learning I. UBC GitHub. Retrieved from https://github.ubc.ca/MDS-2023-24/DSCI_571_sup-learn-1_students