## Predicting Portugal white wine quality

by Kittipong Wongwipasamitkun, Nicole Tu, Sho Inagaki
2023/12/02

In [5]:
import pandas as pd
from myst_nb import glue
import pickle

ModuleNotFoundError: No module named 'myst_nb'

In [15]:
test_scores_df = pd.read_csv("../results/tables/test_scores.csv", index_col=0).round(2)
glue("test_score", test_scores_df.iloc[0,0])

mean_scores_df = pd.read_csv("../results/tables/mean_scores.csv", index_col=0).round(2)
glue("avg_train_r2", mean_scores_df.iloc[3,0])
glue("avg_test_r2", mean_scores_df.iloc[2,0])

avg_score_df = pd.read_csv("../results/tables/mean_scores.csv", index_col=0).round(2)
glue("avg_train_neg_rmse", avg_score_df.iloc[7,0])
glue("avg_test_neg_rmse", avg_score_df.iloc[6,0])


Unnamed: 0,mean_value
fit_time,0.02
score_time,0.01
test_r2,0.33
train_r2,0.36
test_sklearn MAPE,-0.1
train_sklearn MAPE,-0.1
test_neg_root_mean_square_error,-0.73
train_neg_root_mean_square_error,-0.72
test_neg_mean_squared_error,-0.54
train_neg_mean_squared_error,-0.51


In [43]:
with open('../results/models/best_model.pkl', 'rb') as f:
    wine_fit = pickle.load(f)

# Summary

We tried to make the classification model using the Polynomial Regression with Ridge Regularization algorithm with Randomized Search Hyperparameters which can predict Portugal white wine quality rating (on scale 0-10) through the physicochemical properties of the test wine. The model has trained on the Portugal white wine data set with 4898 observations. In the conclusion, the model performance is not quite good enough both on training data and on an unseen test data set with the test score at around {glue:text}`test_score` with the average train $R^2$ at {glue:text}`avg_train_r2` and the average test $R^2$ at {glue:text}`avg_test_r2` also with high root MSE and MSE (Mean Squared Error).

The reason we suspect the model cannot predict well is that the wine quality can be judge widely and vary depends on each individual preference taste. Moreover, there is no standard on the taste, for example, high or low in acidity or alcohol level or sulfur level cannot indicate the wine is in good quality or not (It can be both ways!!). As such, we believe this model is at, or close to, the starter required for studying further and could run more collected data to analyze the combination of physicochemical properties which will announce quality of the wine, although more researches need to improve the model performance and understand the characteristics of incorrectly predicted pattern would be in need to investigate further.

# Introduction

Referring to WSET (the Wine & Spirits Education Trust), Wine tasting notes can be described by Systematic Approach to Tasting (SAT) (https://www.wsetglobal.com/media/13271/wset_l4wines_sat_en_aug2023.pdf) , which is consisted of 
1. Appearance (colour, clarity, intensity, and other observations)
2. Nose (condition, intensity, aroma characteristics, aroma development)
3. Palate (sweetness, acidity, tannin, alcohol, body, flavour intensity, flavour characteristics, finish, and other observations)
4. Conclusion (quality, readiness for drinking and potential for ageing).
   
The wine quality which is in the conclusion part, consists mainly on Balance, Length, Intensity and Complexity (BLIC). The result of these qualities, all came from the chemical components in the wine. Subsequently, nowadays, the quality of wine can be determined roughly from the physicochemical components of the wine.
 
As the physicochemical properties have been related to the wine quality, so we aim to create a machine learning algorithm to predict the quality of wine from the measurement of physicochemical values. Answering this question can help both customers and winemakers to screen or adjust or make decision to the prior wine quality rating derived from the model according to its physicochemical values.

This machine learning algorithm will aim to study only the Portugal white wine to reduce biased from the types of wine and the origin sources of wine as the start point to assess the quality of wine via its physicochemical features.

# Methods

## Data
This data set used in this project is related to white vinho verde wine samples from the north of Portugal created By P. Cortez, A. Cerdeira, Fernando Almeida, Telmo Matos, J. Reis. 2009. The dataset was sourced from  website for downloading these datasets is the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/dataset/186/wine+quality). In addition, these datasets stored the physicochemical properties data on wines and the quality rating to compare and make the quality prediction model.

## Analysis
As we have many variables to concern and most of them have correlation to the target quality, which it is best to not drop these correlated features. 
Also in the real world with multiple explanatory variables, it is less likely for model to have a linear pattern, and we believe that polynomial regression can be a more realistic model to predict wine quality. The reason we use polynomial regression rather than a linear regression is that it allows us to capture more complex relationships between the predictors and the target variable. By introducing polynomial terms, we can account for non-linear patterns that might exist in the data, enabling the model to better fit the intricacies of wine quality prediction. Moreover, polynomial regression provides greater flexibility, allowing us to uncover potential curvilinear associations between features and the quality of wine, which a linear regression might overlook. This approach enhances our ability to create a more nuanced and accurate predictive model for wine quality assessment.
Therefore, we choose the **Polynomial regression with ridge regularization** to reduce the effect of multicolinearity, and use **Random search** to optimize the hyperparameters. Data was partitioned by 70% for the training set and 30% for the test set. All variables were standardized just prior to model fitting. The Python programming language (Van Rossum and Drake 2009) and the following Python packages were used to perform the analysis: numpy(Harris et al. 2020), Pandas (McKinney 2010), altair (VanderPlas, 2018), scikit-learn (Pedregosa et al. 2011). The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/DSCI_522_group16/tree/main/src/portugal_wine_quality_predictor.ipynb

# Discussion

We tried to find out whether each of the physicochemical properties might be useful to predict the wine quality rating. We examined the data set that it doesn't have Null value or any adjustment needed for any values. After that we make a correlation matrix for all the variables in this data set to recognize the pattern or choose to drop some variable out.

In [45]:
pd.read_csv("../results/tables/correlation_matrix.csv", index_col=0).style.background_gradient()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.022159,0.283248,0.098448,0.017915,-0.051813,0.090352,0.269401,-0.42345,-0.024174,-0.125424,-0.118615
volatile acidity,-0.022159,1.0,-0.147688,0.064137,0.076459,-0.088009,0.08572,0.029973,-0.029339,-0.050048,0.062962,-0.195553
citric acid,0.283248,-0.147688,1.0,0.107066,0.12768,0.092379,0.11829,0.160669,-0.178773,0.064732,-0.093406,-0.01886
residual sugar,0.098448,0.064137,0.107066,1.0,0.076111,0.280611,0.396009,0.843845,-0.198103,-0.029428,-0.457366,-0.085984
chlorides,0.017915,0.076459,0.12768,0.076111,1.0,0.099581,0.19203,0.241153,-0.097633,0.016289,-0.349867,-0.200642
free sulfur dioxide,-0.051813,-0.088009,0.092379,0.280611,0.099581,1.0,0.620482,0.282654,0.021046,0.067162,-0.244811,0.008141
total sulfur dioxide,0.090352,0.08572,0.11829,0.396009,0.19203,0.620482,1.0,0.526727,0.014098,0.135311,-0.456719,-0.175026
density,0.269401,0.029973,0.160669,0.843845,0.241153,0.282654,0.526727,1.0,-0.096938,0.071511,-0.776427,-0.292935
pH,-0.42345,-0.029339,-0.178773,-0.198103,-0.097633,0.021046,0.014098,-0.096938,1.0,0.170854,0.123852,0.098827
sulphates,-0.024174,-0.050048,0.064732,-0.029428,0.016289,0.067162,0.135311,0.071511,0.170854,1.0,-0.014943,0.070664


From the correlation plot above, it turns out that among explanatory features, density and residual sugar are highly correlated (0.844); density and total sulfur dioxide are highly correlated(0.527); total sulfur dioxide and free sulfur dioxide are highly correlated (0.620).

Since the free SO2 is the active, unbound form that contributes to antioxidant and antimicrobial properties, and the Total SO2 includes both free and bound forms, providing an overall measure of sulfur dioxide content in the win, we drop the free sulfur dioxide from the data. 

Let's examine the relationship between the three highly correlated features with target variable by scatter plot matrix, figure 1, as shown below:

```{figure} ../results/figures/scatter_matrix.png
---
width: 800px
name: scatter_plot_matrix
---
Scatter Plot Matrix indicating correlation between three highly correlated features with target variable
```

From figure 1, since both residual sugar, density, and total sulfur dioxide have correlation with the target quality, it is not a good idea to drop these correlated features.  Instead, we pick the **polynomial regression with ridge regularization** to reduce multicolinearity, and use **Random search** to optimize the hyperparamters.

The reason we use polynomial regression rather than a linear regression is that in the real world scenario with multiple explanatory variables, it is less likely for model to have a linear pattern, and we believe that polynomial regression can be a more realistic model to predict wine quality.

Now lets examine the distribution of the features to decide how to preprocess them:

From figures below, most of the features follow an approximately normal distribution, while the residual sugar is slightly skewed. Given the distribution of the features, we preprocess the data by **standardization and imputation with median value**.


```{figure} ../results/figures/histogram_alcohol.png
---
width: 800px
name: alcohol_histogram
---
Histogram for the feature, alcohol
```

```{figure} ../results/figures/histogram_chlorides.png
---
width: 800px
name:  chlorides_histogram
---
Histogram for the feature, chlorides
```

```{figure} ../results/figures/histogram_citric%20acid.png
---
width: 800px
name:  citricacid_histogram
---
Histogram for the feature, citric acid
```

```{figure} ../results/figures/histogram_density.png
---
width: 800px
name:  density_histogram
---
Histogram for the feature, density
```

```{figure} ../results/figures/histogram_fixed%20acidity.png
---
width: 800px
name:  fixedacidity_histogram
---
Histogram for the feature, density
```

```{figure} ../results/figures/histogram_free%20sulfter%20dioxide.png
---
width: 800px
name:  free_sulfer_dioxide_histogram
---
Histogram for the feature, free sulfer dioxide
```

```{figure} ../results/figures/histogram_pH.png
---
width: 800px
name:  pH_histogram
---
Histogram for the feature, pH
```

```{figure} ../results/figures/histogram_quality.png
---
width: 800px
name:  quality_histogram
---
Histogram for the feature, quality
```

```{figure} ../results/figures/histogram_residual%20sugar.png
---
width: 800px
name:  residual_sugar_histogram
---
Histogram for the feature, residual sugar
```

```{figure} ../results/figures/histogram_sulphates.png
---
width: 800px
name:  sulphates_histogram
---
Histogram for the feature, sulphates
```

```{figure} ../results/figures/histogram_total%20sulfur%20dioxide.png
---
width: 800px
name:  total_sulfer_dioxide_histogram
---
Histogram for the feature, total sulfer dioxide
```

```{figure} ../results/figures/histogram_volatile%20acidity.png
---
width: 800px
name:  volatile_acidity_histogram
---
Histogram for the feature, volatile acidity
```

Below is our model pipeline, which consits of columntransformer: SimpleImputer, StandardScaler and PolynomialFeatures, and our regression model, Ridge.

In [46]:
wine_fit

And the below is our Cross-Validation scores for training data set.

In [47]:
pd.read_csv("../results/tables/score_table.csv", index_col=0)

Unnamed: 0,0,1,2,3,4
fit_time,0.023944,0.013026,0.012648,0.015523,0.014615
score_time,0.007685,0.006513,0.006672,0.007515,0.005942
test_r2,0.315056,0.356524,0.352406,0.317041,0.289271
train_r2,0.363229,0.353133,0.351876,0.360865,0.368786
test_sklearn MAPE,-0.097206,-0.102735,-0.096511,-0.10333,-0.102554
train_sklearn MAPE,-0.098717,-0.097416,-0.099154,-0.09767,-0.097268
test_neg_root_mean_square_error,-0.720164,-0.741907,-0.716911,-0.734411,-0.75246
train_neg_root_mean_square_error,-0.717869,-0.712449,-0.720107,-0.715402,-0.710361
test_neg_mean_squared_error,-0.518636,-0.550426,-0.513962,-0.53936,-0.566196
train_neg_mean_squared_error,-0.515336,-0.507584,-0.518555,-0.5118,-0.504613


And also the average error across 5 folds.

In [48]:
pd.read_csv("../results/tables/mean_scores.csv", index_col=0)


Unnamed: 0,mean_value
fit_time,0.015951
score_time,0.006865
test_r2,0.32606
train_r2,0.359578
test_sklearn MAPE,-0.100467
train_sklearn MAPE,-0.098045
test_neg_root_mean_square_error,-0.733171
train_neg_root_mean_square_error,-0.715238
test_neg_mean_squared_error,-0.537716
train_neg_mean_squared_error,-0.511578


# Result
**Model Performance**:

R-squared (test): The model explains around {glue:text}`avg_train_r2` of the variance in the wine quality on the test set, indicating some predictive capability, even it's not good enough.

R-squared (train): A higher value (around {glue:text}`avg_train_r2`) on the training set suggests some fit of the model to the training data, but still can't perform well.


**Error Metrics**:

Negative RMSE (test): The model's root mean square error on the test set is approximately {glue:text}`avg_test_neg_rmse`, but given the negative sign, it suggests that the model performs worse than predicting the mean value.

Negative RMSE (train): Similar to the test set, the negative RMSE on the training data is around {glue:text}`avg_train_neg_rmse`, indicating room for improvement.

Negative MSE (test & train): Both test and training set MSEs are negative, signifying a worse performance than a model predicting the mean.

Mean Absolute Percentage Error (MAPE) &
Negative MAPE (test & train): Both test and training set MAPE values are negative, indicating inaccuracies in predictions beyond simply predicting the mean.t very high.

## References


```{bibliography}
```