## Determining the origin of wine using physiochemical properties

by Hina Bandukwala, Yimeng Xia, Sean McKay, Julia Everitt 2023/12/02

In [1]:
import sys
import pickle 
import pandas as pd
from myst_nb import glue
import warnings
import numpy as np
from sklearn.exceptions import InconsistentVersionWarning

warnings.filterwarnings("ignore", category=InconsistentVersionWarning)

In [2]:
#Attribution: Code adapted from https://github.com/ttimbers/breast_cancer_predictor_py/blob/main/report/breast_cancer_predictor_report.ipynb

test_scores_df = pd.read_csv("../results/tables/test_results.csv")
glue("accuracy", test_scores_df['accuracy'].values[0], display=False)
glue("f1", test_scores_df['F1 score'].values[0], display=False)
test_scores_df = test_scores_df.style.format().hide()
glue("test_scores_df", test_scores_df, display=False)

In [3]:
with open('../results/models/wine_pipeline.pickle', 'rb') as model:
    wine_model = pickle.load(model)
glue("best_C", wine_model.best_params_["logisticregression__C"], display=False)
glue("train_score", np.mean(wine_model.cv_results_["mean_train_score"]), display=False)
glue("valid_score", np.mean(wine_model.cv_results_["mean_test_score"]), display=False)

# Summary

With this project, we attempted to build a classification model as a proof-of-concept for how logistic regression can be used for classifying wine samples based on their origin using their physiochemical characteristics. We built our classifier using a simple dataset that summarizes 13 physiochemical properties per wine sample along with it's corresponding class based on it's origin/cultivar. Since we are using a "perfect" dataset, our final classifer performed very well on the unseen test wine samples with an accuracy score of {glue:text}`accuracy` and a F1 score of {glue:text}`f1`. 

With this project, we intend to showcase that this methodology has potential of streamlining wine identification processes for the benefit of the industry. Since this model is intended as a proof-of-concept, it can be improved significantly by considering other important physiochemical properties and modern techniques for feature selection. The model can also be refined with further testing with larger and more complex datasets.

# Introduction

With increased globalization, wine is consumed across a wider range of nations making wine trade an important part of the global economy {cite:p}`Orlandi2015`. For example, in 2021 wine exports increased by an average of 15% since 2017 reaching a global total of $40.7 billion{cite:p}`jain_machine_2023`. Italy is one of the top 5 exporters of wine and together these countries contribute to 70.4% of the total wine exported globally{cite:p}`jain_machine_2023`. As wine consumption becomes integrated into more cultures, there is an increased need for faster and efficient methods for wine certification, identication as well as quality evaluation. Our project focuses on one of those, namely, wine identification. 

Identification of the wine cultivar (e.g. 'Chardonnay' and 'Merlot') is an important element of consuming and selling wine{cite:p}`ohana-levi_long-term_2023`. Traditional methods rely heavily on the knowledge and experience of indivdual experts which makes the process inherently subjective and labour-intensive. In this project, we aim to use a machine learning algorithm to identify the cultivar of Italian wines using 13 different physiochemical properties instead. This method takes advantage of the dense knowledge-base that exists about the important physiochemical properties of wine. It then utilizes quantitative measurements corresponding to these properties along with machine learning to systematically identify wine cultivars. Given that the wine industry has carved itself a name in global trade, it is crucial to develop and apply cutting-edge methods that can make these processes more accurate, less labour-intensive and cost-efficient. We think that this data-driven approach could be highly beneficial to the wine industry due to the benefits highlighted above.  

# Methods

## Data

We are using a multivariate dataset for this project that combines 13 physiochemical properties for 178 Italian wine samples. These samples correspond to 3 distinct cultivars from the same geographical location. The data was originally collected by M.Forina et al {cite:p}`forina1998` and contributed to the UC Irvine Machine Learning Repository by Stefan Aeberhard and M. Forina in 1992 (last updated on Aug 28 2023). Details associated with the dataset can be found in the UC Irvine repository (https://archive.ics.uci.edu/dataset/109/wine) and the data can be read directly from here (https://archive.ics.uci.edu/static/public/109/data.csv). Each row of the dataset corresponds to one wine sample and contains measurements corresponding to each of the 13 physiochemical components. Identification and quantification of the different chemical constituents and properties of the wine was based on chromatographic profiles obtained through mass spectrometry{cite:p}`ballabio_classification_2008`. This collection and experimentation was performed by Ballabio, D. et al. 

## Analysis 

For our classification task, we used the logistic regression (LR) algorithm to develop a model that categorizes wine samples into one of three cultivar types based on their origin. These targets can be found in the class column of our dataset. All physiochemical features included in our dataset were used for classification. As a benchmark, we employed scikit-learn’s DummyClassifer as our baseline model which resulted in a 40.33% accuracy with our training dataset. For the LR model, a grid search for the C hyperparameter was performed for values ranging from 0.01 to 1000. The optimal value of {glue:text}`best_C` was used to perform a 5-fold cross-validation and resulted in a accuracy of {glue:text}`train_score` with our training set and {glue:text}`valid_score` with our validation set. We primarily used the Python programming language for our analysis. In particular, the following packages were used: NumPy{cite:p}`harris_array_2020`, Pandas{cite:p}`mckinney2010`, Altair{cite:p}`VanderPlas2018`, Matplotlib {cite:p}`Hunter:2007`, scikit-learn{cite:p}`scikit-learn`, and ucimlrepo {cite:p}`misc_wine_109`. 

## Results and Discussion

For our data analysis, we first split the data into train and test sets with an equal distribution of target classes in each set to ensure the model generalizes well. The train-test split was done before any further data analysis and scaling to avoid information leakage. All of the features in the dataset are numerical, so we applied the standard scaler to all of them to ensure they take on the same range of values.

Next we looked at the distribution of values for each numerical feature for each of the three target classes. We can see that the density curves overlap, but still show different shapes and mean values, with some exhitibiting bimodal distributions. The least predictive features look to be Magnesium and Ash as there is significant overlap between the 3 class distributions. We decided to keep all of the features to use in our model, as those features may still be more predictive when combined with other features.

![Density plots per class of wine for the 13 physiochemical properties included in the dataset](../results/figures/densities_plot_by_class.png)

In terms of the model performance and its applicability to wine origin prediction, our current logistic regression model performs quite well with a high test accuracy rate of 98.15%. To further improve the classification accuracy, we may explore other models such as Support Vector Machines (SVM) and Random Forest to assess if they offer improved test accuracy. <br>
In addition, diversifying our evaluation metrics can provide a more comprehensive understanding of our model's performance. Metrics such as precision, recall, F1-score are good choices for imblanced class. According to our baseline model, the accuracy is of 40.33%, indicates that the most prevalent class occurs at a rate of 40%. This suggests a class imbalance (as we have three classes), prompting a closer examination of class distribution during EDA. 

#![Accuracy scores for training and validation sets during hyperparameter optimization](../results/figures/wine_cv_C.png)

```
{glue:figure}
:figwidth: 400px
:name: "test_scores_df"

Accuracy and F1 scores to evaluate model performance on test data
```


## References

```{bibliography}
```