# Wine Quality Dataset 

## Data Description

### Red Wine Quality - Parameters
* fixed.acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily) 
* volatile.acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* citric.acid (g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 
* residual.sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet 
* chlorides (sodium chloride - g / dm^3): the amount of salt in the wine 
* free.sulfur.dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine 
* total.sulfur.dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine 
* density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content 
* pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale 
* sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant 
* alcohol (% by volume): the percent alcohol content of the wine 
* quality: quality score between 0 and 10

### Objective.

* To explore the physiocochemical properties of red wine
* To determine an optimal machine learning model for red wine quality classification


In [3]:
# Import librarires
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt

# Sklearn moduels.
from sklearn.model_selection import train_test_split


In [193]:
# Include any additional modules libraries your code might need here.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [194]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "../data/red-wine-dataset/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)


In [195]:
# Split the data into a training and testing set using the sklearn function train_test_split
# Noteice that 
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)


## Challenge 1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.

* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.


In [220]:
# Your Solution here




# Challenge 2

Train an SVM classifier using the `(X_train,y_train)` dataset and use trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [10]:
# Your solution 


# Challenge 3

Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [11]:
# Your solution 


# Challenge 4

The code below loads the same dataset, but treats it as a binary classification problem. That is, instead of classifying an observation into one of 10 categories (0..10) instead we consider all observations with score above 5 as being good and all observation below or equal to five as being bad.





In [12]:
# Code to load the data from file. Here we use the pandas library to read the csv file. 
datafile = "../data/red-wine-dataset/wineQualityReds.csv"
wine_df = pd.read_csv(datafile)
wine_df.drop(wine_df.columns[0],axis=1,inplace=True)

wine_df['quality'] = np.where(wine_df['quality']>5,"Good","Bad")

In [13]:
X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25, random_state=42)


## Callenge 4.1
Use the variables `X_train`, `X_test`, `y_train`, and `y_test` to explore your data. In particular, calculate and display the following information.
* Number of samples in the training set in total and in each class.
* Number of samples in the testing set in total and in each class.
* Number of features in the dataset. 
* Number of classes in the dataset.
* IDs of the number of classes.




In [14]:
# Your Solution 


## Challenge 4.2 
Train a SVM classifier using the `(X_train,y_train)` dataset and use trained model to predict the underlying classes for the observations in the test dataset `X_test`. Store your prediction in a variable called `y_pred`.

In [15]:
# Your solution 


## Challenge 4.3
Evaluate the performance of your classifier. Calculate and display the following:
* print the `confusion matrix`.
* `normalized confusion matrix`. 
* the probablitity of correct classification (accuracy score). 
* the `precision`, `recall`, and `f1-score` for each class.

In [16]:
# Your Solution 


# Challenge 5

SVM classifier accepts a number of parameters. Some of those parameters are the parameter `C`, the `kernel`, the `degree`, and the parameter `gamma`. Evaluate the classifier for different values of K and identify which configuration achieve the best performance on the testing set. Plot or print your results.


In [17]:
# Your solution here.


# Challenge 6
Select the best parameter configuration for the SVM classifier on this dataset (i.e. best C, kernel, gamma or degree parameter configuration). Also choose the best parameter configuration for the KNN classifer. Evaluate both classifiers by by running on 100 random train-test split of the dataset. You can achieve that using a loop (i.e. for loop) and by calleing the `train_test_split` function without specifing the `random_state parameter` to obtain a new random split. For example, the foundation of your code could look like this:

```python 

for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(wine_df.drop('quality',axis=1), wine_df['quality'], test_size=.25)

##
# Your code to train, test and evaluate teh classifier. 
    
```
Your code should report the __mean__ and __standard deviation__ of each classifer in terms of __Accuracy__. Base on your finding, which algorithm performs better? 