## Predicting Portugal white wine quality

by Kittipong Wongwipasamitkun, Nicole Tu, Sho Inagaki
2023/12/02

In [1]:
import pandas as pd
from myst_nb import glue
import pickle

In [2]:
test_scores_df = pd.read_csv("../results/tables/test_scores.csv", index_col=0).round(2)
glue("test_score", test_scores_df.iloc[0,0])

mean_scores_df = pd.read_csv("../results/tables/mean_scores.csv", index_col=0).round(2)
glue("avg_train_r2", mean_scores_df.iloc[3,0])
glue("avg_test_r2", mean_scores_df.iloc[2,0])

avg_score_df = pd.read_csv("../results/tables/mean_scores.csv", index_col=0).round(2)
glue("avg_train_neg_rmse", avg_score_df.iloc[7,0])
glue("avg_test_neg_rmse", avg_score_df.iloc[6,0])

with open('../results/models/best_model.pkl', 'rb') as f:
    wine_fit = pickle.load(f)

0.32

0.36

0.33

-0.72

-0.73

# Summary

We tried to make the classification model using the Polynomial Regression with Ridge Regularization algorithm with Randomized Search Hyperparameters which can predict Portugal white wine quality rating (on scale 0-10) through the physicochemical properties of the test wine. The model is trained on the Portugal white wine data set with 4898 observations. In the conclusion, the model performance is not quite good enough both on training data and on an unseen test data set with the test score at around {glue:text}`test_score` with the average train $R^2$ at {glue:text}`avg_train_r2` and the average test $R^2$ at {glue:text}`avg_test_r2`. We also observed high root MSE and MSE (Mean Squared Error).

The reason we suspect the model cannot predict well is that judging the wine quality can be very subjective and vary widely depends on individual preferences. Moreover, there is no standard on what is considered to have better taste. For example, high or low in acidity, alcohol level or sulfur level cannot indicate if the wine is in good quality or not (It can be both ways!!). As such, we believe this model can serve as a starting point for further studies and can be improved by collecting more data to analyze the combination of physicochemical properties that will determine the quality of wine. The study will also benefit from the knowledge from wine experts to understand the determinating factors of wine quality, which will help us identify the patter of correctly and incorrectly prediction to enhance the model performance.

# Introduction

Referring to WSET (the Wine & Spirits Education Trust), Wine tasting notes can be described by Systematic Approach to Tasting (SAT){cite}`WEST_GLOBAL_2023` (https://www.wsetglobal.com/media/13271/wset_l4wines_sat_en_aug2023.pdf) , which is consisted of 
1. Appearance (colour, clarity, intensity, and other observations)
2. Nose (condition, intensity, aroma characteristics, aroma development)
3. Palate (sweetness, acidity, tannin, alcohol, body, flavour intensity, flavour characteristics, finish, and other observations)
4. Conclusion (quality, readiness for drinking and potential for ageing).
   
The wine quality which you can find in the conclusion part, consists mainly on Balance, Length, Intensity and Complexity (BLIC). The result of these qualities, all came from the chemical components in the wine. Subsequently, nowadays, it is considered that the quality of wine can be determined from the physicochemical components of the wine.
 
As the physicochemical properties is related to the wine quality, we aim to create a machine learning algorithm to predict the quality of wine from physicochemical values to reduce bias and subjectivity involved in the historical wine quality evaluation processes. This will allow us consistent evaluation across the wine labels for easier comparisons. Ultimately, such a machine learning model can help both customers and winemakers as this will help them make more objective purchasing decisions, and improve the wine quality by adjusting the physicochemical values.

This machine learning algorithm will aim to study only the Portugal white wine to reduce biased from the types of wine and the origin sources of wine as the start point to assess the quality of wine via its physicochemical features.

# Methods

## Data
This data set used in this project is related to white vinho verde wine samples from the north of Portugal created By P. Cortez, A. Cerdeira, Fernando Almeida, Telmo Matos, J. Reis. 2009. The dataset, consisting of various physiochemical properties and wine ratings, was sourced from the University of California, Irvine Machine Learning Repository {cite}`misc_wine_quality_186` (https://archive.ics.uci.edu/dataset/186/wine+quality).

## Analysis
With 11 explanatory variables that are deemed to have relationship with wine quality, it is less likely for that those relationships are all linear. Therefore, we applied polynomial transformation on our features and used that in our regression model to let the model to tell us the true relationships. This way, not only we can capture non-lienar relationships between our features and response variable, we can also capture interactions among our explanatory variables. As a result, the model will be able to capture more complex relationships that we cannot test otherwise, and hopefully with a better predicting capability {cite}`tikhonov1977solutions`. Moreover, polynomial regression provides greater flexibility, allowing us to uncover potential curvilinear associations between features and the quality of wine that linear models might overlook. This approach enhances our ability to create more robust and accurate predictive model for wine quality assessment.

Therefore, we choose the **Polynomial regression with ridge regularization** to reduce the effect of multicolinearity, and use **Random search** to optimize the hyperparameters {cite}`MDS_UBC`. Data was partitioned into 70% training and 30% testing set. All variables were standardized as part of the preprocessing to avoid potential scale issues. The Python programming language {cite}`Python` and the following Python packages were used to perform the analysis: numpy {cite}`numpy`, Pandas, altair {cite}`altair`, scikit-learn {cite}`scikit-learn`. The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/DSCI_522_group16/tree/main/src/portugal_wine_quality_predictor.ipynb

# Discussion

As part of our initial assessment, we tried to identify if each of the physicochemical properties might be useful in predicting the wine quality rating. The correlation matrix for all the variables was used to understand usefulness of features in predicting the quality and also investigate potential multicollinearity issues.

In [3]:
pd.read_csv("../results/tables/correlation_matrix.csv", index_col=0).style.background_gradient()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.022159,0.283248,0.098448,0.017915,-0.051813,0.090352,0.269401,-0.42345,-0.024174,-0.125424,-0.118615
volatile acidity,-0.022159,1.0,-0.147688,0.064137,0.076459,-0.088009,0.08572,0.029973,-0.029339,-0.050048,0.062962,-0.195553
citric acid,0.283248,-0.147688,1.0,0.107066,0.12768,0.092379,0.11829,0.160669,-0.178773,0.064732,-0.093406,-0.01886
residual sugar,0.098448,0.064137,0.107066,1.0,0.076111,0.280611,0.396009,0.843845,-0.198103,-0.029428,-0.457366,-0.085984
chlorides,0.017915,0.076459,0.12768,0.076111,1.0,0.099581,0.19203,0.241153,-0.097633,0.016289,-0.349867,-0.200642
free sulfur dioxide,-0.051813,-0.088009,0.092379,0.280611,0.099581,1.0,0.620482,0.282654,0.021046,0.067162,-0.244811,0.008141
total sulfur dioxide,0.090352,0.08572,0.11829,0.396009,0.19203,0.620482,1.0,0.526727,0.014098,0.135311,-0.456719,-0.175026
density,0.269401,0.029973,0.160669,0.843845,0.241153,0.282654,0.526727,1.0,-0.096938,0.071511,-0.776427,-0.292935
pH,-0.42345,-0.029339,-0.178773,-0.198103,-0.097633,0.021046,0.014098,-0.096938,1.0,0.170854,0.123852,0.098827
sulphates,-0.024174,-0.050048,0.064732,-0.029428,0.016289,0.067162,0.135311,0.071511,0.170854,1.0,-0.014943,0.070664


From the correlation plot shown above, we find that `alcohol` has the highest correlation with our target variable, `quality` among all the features. Further, it turns out that among explanatory features, `density` and `residual sugar` are highly correlated (0.844); `density` and `total sulfur dioxide` are moderately correlated (0.527); `total sulfur dioxide` and `free sulfur dioxide` are also moderately correlated (0.620).

Since the free sulfer dioxied is the active, unbound form that contributes to antioxidant and antimicrobial properties, and the Total sulfur dioxide includes both free and bound forms, providing an overall measure of sulfur dioxide content in the win, we drop the `free sulfur dioxide` from the data. 

Let's examine the relationship between the three highly correlated features with target variable by scatter plot matrix, figure 1, as shown below:

```{figure} ../results/figures/scatter_matrix.png
---
width: 800px
name: scatter_plot_matrix
---
Scatter Plot Matrix indicating correlation between three highly correlated features with target variable
```

From figure 1, we observe that `residual sugar`, `density`, and `total sulfur dioxide` are all correlated with the target quality, so it is not a good idea to drop these correlated features, though these variables are correlated to each other. Instead, we pick the **polynomial regression with ridge regularization** to reduce multicolinearity, and use **Random search** to optimize the hyperparamters {cite}`horel1962application`.

The reason we use polynomial regression rather than a linear regression is that in the real world scenario with multiple explanatory variables, it is less likely for model to have a linear pattern, and we believe that polynomial regression can be a more realistic model to predict wine quality {cite}`fan19961`.

Now lets examine the distribution of the features to decide how to preprocess them:

Distribution histogram of some of the varaibles are shown below. From figures below, most of the features follow an approximately normal distribution, while some show the sign of potential skewness. Given the distribution of the features, we preprocess the data by **standardization and imputation with median value** {cite}`van2015lecture`.


```{figure} ../results/figures/histogram_alcohol.png
---
width: 800px
name: alcohol_histogram
---
Histogram for the feature, alcohol
```

```{figure} ../results/figures/histogram_chlorides.png
---
width: 800px
name:  chlorides_histogram
---
Histogram for the feature, chlorides
```

```{figure} ../results/figures/histogram_density.png
---
width: 800px
name:  density_histogram
---
Histogram for the feature, density
```

```{figure} ../results/figures/histogram_pH.png
---
width: 800px
name:  pH_histogram
---
Histogram for the feature, pH
```

```{figure} ../results/figures/histogram_quality.png
---
width: 800px
name:  quality_histogram
---
Histogram for the feature, quality
```

```{figure} ../results/figures/histogram_sulphates.png
---
width: 800px
name:  sulphates_histogram
---
Histogram for the feature, sulphates
```

Below is our model pipeline with parameters for the best model, which consits of columntransformer: SimpleImputer, StandardScaler and PolynomialFeatures, and our regression model, Ridge.

In [4]:
wine_fit

And the below is our Cross-Validation scores on our training data set.

In [5]:
pd.read_csv("../results/tables/score_table.csv", index_col=0)

Unnamed: 0,0,1,2,3,4
fit_time,0.022363,0.011994,0.014427,0.012,0.013
score_time,0.007998,0.006994,0.008007,0.005998,0.008014
test_r2,0.315056,0.356524,0.352406,0.317041,0.289271
train_r2,0.363229,0.353133,0.351876,0.360865,0.368786
test_sklearn MAPE,-0.097206,-0.102735,-0.096511,-0.10333,-0.102554
train_sklearn MAPE,-0.098717,-0.097416,-0.099154,-0.09767,-0.097268
test_neg_root_mean_square_error,-0.720164,-0.741907,-0.716911,-0.734411,-0.75246
train_neg_root_mean_square_error,-0.717869,-0.712449,-0.720107,-0.715402,-0.710361
test_neg_mean_squared_error,-0.518636,-0.550426,-0.513962,-0.53936,-0.566196
train_neg_mean_squared_error,-0.515336,-0.507584,-0.518555,-0.5118,-0.504613


And also the average error across 5 folds.

In [6]:
pd.read_csv("../results/tables/mean_scores.csv", index_col=0)


Unnamed: 0,mean_value
fit_time,0.014757
score_time,0.007402
test_r2,0.32606
train_r2,0.359578
test_sklearn MAPE,-0.100467
train_sklearn MAPE,-0.098045
test_neg_root_mean_square_error,-0.733171
train_neg_root_mean_square_error,-0.715238
test_neg_mean_squared_error,-0.537716
train_neg_mean_squared_error,-0.511578


# Result
**Model Performance**:

R-squared (test): The model explains around {glue:text}`avg_train_r2` of the variance in the wine quality on the test set, indicating some predictive capability, though it is not great.

R-squared (train): A higher value (around {glue:text}`avg_train_r2`) on the training set suggests potential overfitting by our model, though at this low score, it is not much of the issue.


**Error Metrics**:

Negative RMSE (test): The model's root mean square error on the test set is approximately {glue:text}`avg_test_neg_rmse`, but given the negative sign, it suggests that the model performs worse than predicting the mean value.

Negative RMSE (train): Similar to the testing set, the negative RMSE on the training data is around {glue:text}`avg_train_neg_rmse`, indicating room for improvement.

Negative MSE (test & train): MSEs for both testing and training set are negative, signifying a worse performance than a model predicting the mean.

Mean Absolute Percentage Error (MAPE) &
Negative MAPE (test & train): MAPEs for both testing and training set are negative, indicating inaccuracies in predictions beyond simply predicting the mean.

## References
```{bibliography}
```