In [1]:
import numpy as np
import pandas as pd
import altair as alt
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from ucimlrepo import fetch_ucirepo 

## Papers
https://www.cabidigitallibrary.org/doi/epdf/10.31220/agriRxiv.2022.00126

https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/epdf/10.1002/cem.1173

https://www.nature.com/articles/s41598-023-44111-9.pdf

https://www.mdpi.com/2077-0472/13/1/224

https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub

## Citations

### data source 
Sources: (a) Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

(b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au (c) July 1991

### UC Irvine repo

If you publish material based on databases obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following pseudo-APA reference format for referring to this repository:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Here is a BiBTeX citation as well:

@misc{Lichman:2013 ,
author = "M. Lichman",
year = "2013",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences" }

# Introduction

With increased globalization, wine is consumed across a wider range of nations making wine trade an important part of the global economy. For example, in 2021 wine exports increased by an average of 15% since 2017 reaching a global total of $40.7 billion(Jain et al, 2023). Italy is one of the top 5 exporters of wine and together they contribute to 70.4% of the total wine exported globally (Jain et al, 2023). Certification of wine via quality evaluation is a crucial element of wine production and trade (Cortez, P et al). These processes ensure that the product is both fit for human consumption and of intended quality (Cortez, P et al). Given that wine trade has carved itself a name in global trade, it is useful to develop and apply cutting-edge methods that can make these processes more accurate, less labour-intensive and cost-efficient. 

In this project, we aim to use a machine learning algorithm to assess the quality of Italian wine using 13 different physiochemical properties. With this data-driven approach to evaluate wine quality, we hope to make the certification process less prone to subjectivity than traditional methods. To elaborate, since traditional methods rely on the knowledge and experience of indivdual experts, it makes the process inherently subjective and labour-intensive (Cortez, P et al). Instead, this method takes advantage of the existing dense knowledge-base about the physiochemical properties of wine and then utilizes quantitative measurements and machine learning to identify wine quality. We think that using a machine-learning approach will be highly beneficial to the wine industry due to the benefits it provides. 

# Methods

## Data

In [3]:
# important data details: https://www.openml.org/search?type=data&sort=runs&id=187&status=active

We are using a multivariate dataset for this project that combines 13 physiochemical properties for 178 Italian wine samples. These samples correspond to 3 distinct cultivars from the same geographical location. The data was originally collected by M.Forina et al (Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.) and contributed to the UC Irvine Machine Learning Repository by Stefan Aeberhard and M. Forina in 1992 (last updated on Aug 28 2023). Details associated with the dataset can be found in the UC Irvine repository (https://archive.ics.uci.edu/dataset/109/wine) and the data can be read directly from here (https://archive.ics.uci.edu/static/public/109/data.csv). Each row of the dataset corresponds to one wine sample and contains measurements corresponding to each of the 13 physiochemical components. Identification and quantification of the different chemical constituents and properties of the wine was based on chromatographic profiles obtained through mass spectrometry(Ballabio, D. et al). This collection and experimentation was performed by Ballabio, D. et al. 

## Analysis 

# Results and Discussion

For our data analysis, we first split the data into train and test sets with an equal distribution of target classes in each set to ensure the model generalizes well. The train-test split was done before any further data analysis and scaling to avoid information leakage. All of the features in the dataset are numerical, so we applied the standard scaler to all of them to ensure they take on the same range of values.

Next we looked at the distribution of values for each numerical feature for each of the three target classes. We can see that the density curves overlap, but still show different shapes and mean values, with some exhitibiting bimodal distributions. The least predictive features look to be Magnesium and Ash as there is significant overlap between the 3 class distributions. We decided to keep all of the features to use in our model, as those features may still be more predictive when combined with other features.

In [14]:
# fetch dataset 
wine = fetch_ucirepo(id=109) 

#Split into train/test with equal distribution of target classes
wine_train, wine_test = train_test_split(
    wine.data.original, train_size=0.70, stratify=wine.data.original['class']
)

#Save split data
wine_train.to_csv("./data/processed/wine_train.csv")
wine_test.to_csv("./data/processed/wine_test.csv")

In [15]:
#Data info

wine_train.info()
wine.variables

<class 'pandas.core.frame.DataFrame'>
Index: 124 entries, 91 to 55
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Alcohol                       124 non-null    float64
 1   Malicacid                     124 non-null    float64
 2   Ash                           124 non-null    float64
 3   Alcalinity_of_ash             124 non-null    float64
 4   Magnesium                     124 non-null    int64  
 5   Total_phenols                 124 non-null    float64
 6   Flavanoids                    124 non-null    float64
 7   Nonflavanoid_phenols          124 non-null    float64
 8   Proanthocyanins               124 non-null    float64
 9   Color_intensity               124 non-null    float64
 10  Hue                           124 non-null    float64
 11  0D280_0D315_of_diluted_wines  124 non-null    float64
 12  Proline                       124 non-null    int64  
 13  class     

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,class,Target,Categorical,,,,no
1,Alcohol,Feature,Continuous,,,,no
2,Malicacid,Feature,Continuous,,,,no
3,Ash,Feature,Continuous,,,,no
4,Alcalinity_of_ash,Feature,Continuous,,,,no
5,Magnesium,Feature,Integer,,,,no
6,Total_phenols,Feature,Continuous,,,,no
7,Flavanoids,Feature,Continuous,,,,no
8,Nonflavanoid_phenols,Feature,Continuous,,,,no
9,Proanthocyanins,Feature,Continuous,,,,no


In [16]:
cols_to_scale = wine.variables.query('role == "Feature" and type in ["Continuous", "Integer"]')["name"].to_list()
wine_preprocessor = make_column_transformer(
    (StandardScaler(), cols_to_scale),
    remainder='passthrough'
)

wine_preprocessor.fit(wine_train)
wine_preprocessor.set_output(transform='pandas')
scaled_wine_train = wine_preprocessor.transform(wine_train)
scaled_wine_test = wine_preprocessor.transform(wine_test)

#Remove prefix added by column transformer
scaled_wine_train.columns = [col.split('__')[1] for col in scaled_wine_train.columns]
scaled_wine_test.columns = [col.split('__')[1] for col in scaled_wine_test.columns]

scaled_wine_train.to_csv("./data/processed/scaled_wine_train.csv")
scaled_wine_test.to_csv("./data/processed/scaled_wine_test.csv")

In [17]:
#Density plot for each numerical variable. Code adapted from https://github.com/ttimbers/breast_cancer_predictor_py

# melt for plotting via facets 
wine_melted = scaled_wine_train.melt(
    id_vars=['class'],
    var_name='predictor',
    value_name='value'
)
wine_melted

#Plot the distribution of each feature for each class of wine
alt.Chart(wine_melted, width=150, height=100).transform_density(
    'value',
    groupby=['class', 'predictor']
).mark_area(opacity=0.7).encode(
    x=alt.X("value:Q"),
    y=alt.Y('density:Q').stack(False),
    color='class:N'
).facet(
    'predictor:N',
    columns=4
).resolve_scale(
    y='independent'
)