<a href="https://colab.research.google.com/github/DeepCodeSec/ml1000-p1/blob/working_models/Project1_wine_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working code for wine classification problem.

**Dataset:**
* wine quality data retrieved from https://archive.ics.uci.edu/ml/datasets/Wine+Quality

* using the white wine csv only (had more observations than the red)


**Problem:**
* how do we test if the wine is high quality and should be priced accordingly?

## Install packages and load in dataset

* When using colab, need to install pycaret everytime
* read in the wine quality dataset using the "raw" link from the git repository
  * alternatively can use any other url for the data that ends in .csv
  * trying to avoid linking the colab notebook to git, google drive or local server for reproducibility

In [2]:
pip install pycaret

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycaret
  Downloading pycaret-2.3.10-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.2/320.2 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting kmodes>=0.10.1
  Downloading kmodes-0.12.2-py2.py3-none-any.whl (20 kB)
Collecting spacy<2.4.0
  Downloading spacy-2.3.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyyaml<6.0.0
  Downloading PyYAML-5.4.1-cp38-cp38-manylinux1_x86_64.whl (662 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m662.4/662.4 KB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<=1.5.4
  Downloading scipy-1.5.4-cp38-cp38-manylinux1_x86_64.whl (25.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25

In [6]:
import pandas as pd

df_path = 'https://raw.githubusercontent.com/DeepCodeSec/ml1000-p1/working_models/data/winequality-white.csv'  
data = pd.read_csv(df_path, 
                   sep=';') #the separater in the raw data is ;. need to indicate so columns are found
data.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [10]:
# What is the distribution of the target variable (quality)
data.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


### Recode quality to a binary label
**Original**: quality of wine rated from 0-10 with 10 as the best

Above shows that the minimum rating was a 3 and max is 9. The mean and median are both ~6.

According to this website, a rating of 7+ is good wine. It isn't a big deal if the classes are imbalanced, that just dictates which performance metric we use. It is more important to create our classes based on real-world knowledge.  https://vineroutes.com/wine-rating-system/#:~:text=Wines%20rated%2089%20and%20above,outstanding%20for%20its%20particular%20type.

**New**: binary label (target variable) of 'standard' or 'high quality' where a rating of 7 or above  is high quality and 6 or below is standard

In [12]:
import numpy as np

#add binary classification label
data['new_quality'] = np.where(data['quality'] > 6, 
                               'high_quality',
                               'standard')
data.head(100)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,new_quality
0,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.0010,3.00,0.45,8.8,6,standard
1,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6,standard
2,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6,standard
3,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6,standard
4,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6,standard
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,7.1,0.260,0.29,12.4,0.044,62.0,240.0,0.9969,3.04,0.42,9.2,6,standard
96,6.0,0.340,0.66,15.9,0.046,26.0,164.0,0.9979,3.14,0.50,8.8,6,standard
97,8.6,0.265,0.36,1.2,0.034,15.0,80.0,0.9913,2.95,0.36,11.4,7,high_quality
98,9.8,0.360,0.46,10.5,0.038,4.0,83.0,0.9956,2.89,0.30,10.1,4,standard


In [13]:
#drop old quality column and rename new
data = data.drop(columns=['quality']) #drops old column
data = data.rename(columns={'new_quality':'quality'}) #renames back to quality

data.head() #double check it did what we asked

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,new_quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,standard
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,standard
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,standard
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,standard
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,standard


## Exploratory analysis report

The code below automatically creates an exploratory data analysis report. The report is output as an html file in the local files (see the files pane on the left.

For the final report/project we will want to highlight specific aspects from the EDA document that justify our decisions below and code these explicitly. For example if we use a parametric classifier that needs independent (non-correlated) predictors, we would want to show the correlation matrix in the notebook that gets published and comment on how much/little the predictors are correlated and how we are dealing with them.

In [15]:
#Load libraries for exploratory analysis
!pip3 install pandas_profiling --upgrade
import pandas_profiling
from pandas_profiling import ProfileReport
import pandas as pd

pr = ProfileReport(data)

pr.to_file(output_file="EDA.html")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset has:
* 12 variables (11 numeric predictors and 1 categorical target/label)
* 4898 observations
* no missing values

Distributions:
* imbalanced label classes (~20% high quality and 80% standard)
  * again, this is not an issue, it just tells us that we need to think about undersampling and choose the appropriate performance metric
* most of the predictor variables are fairly normally distributed
* alcohol, volatile acidity, and residual sugar are not normally distributed, so we can consider transforming these columns (eg log transformation) if needed


Interactions:
* skipping, not super interesting for classification because pycaret is just going to run a dozen models anyway. We are more interested in the interactions for a regression type problem where, for example, the revenue from tv x radio adcertisements is predicted better than from tv ads or radio adds alone

Correlations:
* moderate correlation between density + residual sugar, density + alcohol, alcohol + chlorides
* for now leaving, but revisit for fine tuning if we are getting poor performance

Duplicate rows:
* therer are a number of duplicate rows. It is hard to say if these are genuinely duplicates and should be removed, or if they are coincidental duplicates (ie 2 wine samples happen to have the same measurements)


## Data cleaning decisions

* Do outlier/anomaly analysis here. Since the dataset has lower number of observations, opt for solutions that retain as much data as possible (eg capping the data as opposed to removing rows)
* make decisions about transformations etc here

In [None]:
# fill in the blanks

## Split dataset

* 5% test set (unseen until after model is finalized) 
* 5% validation (used to tune)
* 90% training (want large training set especially with relatively few observations)

We also need to take into consideration the sampling method because we have imbalanced classes
* under/over sample
* stratified sampling 
* etc