<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/LinearRegression4_WineQuality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

# **Can the quality of wine be predicted from its measureable characteristics?**

**Fixed acidity**: acids are major wine properties and contribute greatly to the wine’s taste. Usually, the total acidity is divided into two groups: the volatile acids and the nonvolatile or fixed acids. Among the fixed acids that you can find in wines are the following: tartaric, malic, citric, and succinic. This variable is expressed in g(tartaricacid)/dm3 in the data sets.<br>
**Volatile acidity**: the volatile acidity is basically the process of wine turning into vinegar. In the U.S, the legal limits of Volatile Acidity are 1.2 g/L for red table wine and 1.1 g/L for white table wine. In these data sets, the volatile acidity is expressed in g(aceticacid)/dm3.<br>
**Citric acid** is one of the fixed acids that you’ll find in wines. It’s expressed in g/dm3 in the two data sets.<br>
**Residual sugar **typically refers to the sugar remaining after fermentation stops, or is stopped. It’s expressed in g/dm3 in the red and white data.<br>
**Chlorides** can be a significant contributor to saltiness in wine. Here, you’ll see that it’s expressed in g(sodiumchloride)/dm3.
**Free sulfur dioxide**: the part of the sulfur dioxide that is added to a wine and that is lost into it is said to be bound, while the active part is said to be free. The winemaker will always try to get the highest proportion of free sulfur to bind. This variable is expressed in mg/dm3 in the data.<br>
**Total sulfur dioxide** is the sum of the bound and the free sulfur dioxide (SO2). Here, it’s expressed in mg/dm3. There are legal limits for sulfur levels in wines: in the EU, red wines can only have 160mg/L, while white and rose wines can have about 210mg/L. Sweet wines are allowed to have 400mg/L. For the US, the legal limits are set at 350mg/L, and for Australia, this is 250mg/L.<br>
**Density** is generally used as a measure of the conversion of sugar to alcohol. Here, it’s expressed in g/cm3.<br>
**pH** or the potential of hydrogen is a numeric scale to specify the acidity or basicity the wine. As you might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.<br>
**Sulfate**s are to wine as gluten is to food. You might already know sulfites from the headaches that they can cause. They are a regular part of the winemaking around the world and are considered necessary. In this case, they are expressed in g(potassiumsulphate)/dm3.<br>
**Alcohol**: wine is an alcoholic beverage, and as you know, the percentage of alcohol can vary from wine to wine. It shouldn’t be surprised that this variable is included in the data sets, where it’s expressed in % vol.<br>
**Quality**: wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual number is the median of at least three evaluations made by those same wine experts.<br>

# **Load the libraries**

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.linear_model import LinearRegression
from keras.utils import to_categorical
from tensorflow import keras
from tensorflow.keras import layers
from sklearn import metrics

print(tf.__version__)

# **Load the data**
The data is in a file:<br>
>winequality-red.csv<br>

In [None]:
# Read in white wine data 
#white = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';')
#white = pd.read_csv("winequality-white.csv", sep=';')

wines = pd.read_csv("winequality-red.csv", sep=';')

In [None]:
wines.tail()

# **Check out the distribution of the type of wines**

In [None]:
wines['quality'].value_counts()

In [None]:
plt.figure(figsize=(15,10))
plt.tight_layout()
sns.distplot(wines['quality'])

# **Check for missing data in wines**

In [None]:
wines.isna().sum()

# **Is there a strong correlation between any of the features?**

Can the model be simplified (without increasing the mse)  by removing features? <br>


In [None]:
corr = wines.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()

# **Split the dataset into a training set and a test set**
Models with very few hyperparameters will be easy to validate and tune, so you can probably reduce the size of your validation set, but if your model has many hyperparameters, you would want to have a large test set as well

In [None]:
#Consider changing the ratio of the train/test split
#fac .95 - .5
wines_train = wines.sample(frac=.8,random_state=0)
wines_test = wines.drop(wines_train.index)
print("done")

In [None]:
print("Training shape:",wines_train.shape)
print("Test shape:",wines_test.shape)

In [None]:
wines_train

# **Remove the labels from the dataset**

In [None]:
train_labels = wines_train.pop('quality')
test_labels = wines_test.pop('quality')

In [None]:
wines_train

In [None]:
train_stats = wines_train.describe()
train_stats = train_stats.transpose()
train_stats

In [None]:
test_stats = wines_test.describe()
test_stats = test_stats.transpose()
test_stats

# **Normalize the data**

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']

normed_train_data = norm(wines_train)
normed_test_data = norm(wines_test)
print("done")

# **Use the skLearn Linear Regression Model**


In [None]:
regressor = LinearRegression()  
regressor.fit(wines_train, train_labels)

In [None]:
print(regressor.score(wines_test, test_labels))

The regression model has to find the most optimal coefficients for all the attributes. To see what coefficients the regression model has chosen:

In [None]:
coeff_df = pd.DataFrame(regressor.coef_, wines_train.columns, columns=['Coefficient'])  
coeff_df

This means that for a unit increase in “density”, there is a decrease of 31.51 units in the quality of the wine. Similarly, a unit decrease in “Chlorides“ results in an increase of 1.87 units in the quality of the wine. We can see that the rest of the features have very little effect on the quality of the wine.

In [None]:
y_pred = regressor.predict(wines_test)
y_pred.shape

In [None]:
test_labels

**Check the difference between the actual and predicted value**

In [None]:
df1 = pd.DataFrame({'Actual': test_labels, 'Predicted': y_pred})
df1

**Plot the actual and predicted values**

In [None]:
df1.plot(kind='bar',figsize=(15,8))
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(test_labels, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_labels, y_pred)))


**Need more data**: We need to have a huge amount of data to get the best possible prediction.<br>
**Bad assumptions**: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.<br>
**Poor features**: The features we used may not have had a high enough correlation to the values we were trying to predict.<br>

# **Assignment** <br>
Select a feature to remove from the model input. It should be one that has zero or close to zero correlation with quality. 

In [None]:
corr = wines.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()

In [None]:
wines.columns

In [None]:
wines.pop('the column you want to remove')

In [None]:
#Consider changing the ratio of the train/test split
#fac .95 - .5
wines_train = wines.sample(frac=.8,random_state=0)
wines_test = wines.drop(wines_train.index)
print("done")

In [None]:
train_labels = wines_train.pop('quality')
test_labels = wines_test.pop('quality')

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']

normed_train_data = norm(wines_train)
normed_test_data = norm(wines_test)
print("done")

In [None]:
regressor = LinearRegression()  
regressor.fit(wines_train, train_labels)

In [None]:
coeff_df = pd.DataFrame(regressor.coef_, wines_train.columns, columns=['Coefficient'])  
coeff_df

In [None]:
df1 = pd.DataFrame({'Actual': test_labels, 'Predicted': y_pred})
df1

In [None]:
df1.plot(kind='bar',figsize=(15,8))
plt.show()

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(test_labels, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_labels, y_pred)))
