<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/HP4%20WineQuality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo
!ls

# **Can the quality of wine be predicted from its measureable characteristics?**

**Fixed acidity**: acids are major wine properties and contribute greatly to the wine’s taste. Usually, the total acidity is divided into two groups: the volatile acids and the nonvolatile or fixed acids. Among the fixed acids that you can find in wines are the following: tartaric, malic, citric, and succinic. This variable is expressed in g(tartaricacid)/dm3 in the data sets.<br>
**Volatile acidity**: the volatile acidity is basically the process of wine turning into vinegar. In the U.S, the legal limits of Volatile Acidity are 1.2 g/L for red table wine and 1.1 g/L for white table wine. In these data sets, the volatile acidity is expressed in g(aceticacid)/dm3.<br>
**Citric acid** is one of the fixed acids that you’ll find in wines. It’s expressed in g/dm3 in the two data sets.<br>
**Residual sugar **typically refers to the sugar remaining after fermentation stops, or is stopped. It’s expressed in g/dm3 in the red and white data.<br>
**Chlorides** can be a significant contributor to saltiness in wine. Here, you’ll see that it’s expressed in g(sodiumchloride)/dm3.
**Free sulfur dioxide**: the part of the sulfur dioxide that is added to a wine and that is lost into it is said to be bound, while the active part is said to be free. The winemaker will always try to get the highest proportion of free sulfur to bind. This variable is expressed in mg/dm3 in the data.<br>
**Total sulfur dioxide** is the sum of the bound and the free sulfur dioxide (SO2). Here, it’s expressed in mg/dm3. There are legal limits for sulfur levels in wines: in the EU, red wines can only have 160mg/L, while white and rose wines can have about 210mg/L. Sweet wines are allowed to have 400mg/L. For the US, the legal limits are set at 350mg/L, and for Australia, this is 250mg/L.<br>
**Density** is generally used as a measure of the conversion of sugar to alcohol. Here, it’s expressed in g/cm3.<br>
**pH** or the potential of hydrogen is a numeric scale to specify the acidity or basicity the wine. As you might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.<br>
**Sulfate**s are to wine as gluten is to food. You might already know sulfites from the headaches that they can cause. They are a regular part of the winemaking around the world and are considered necessary. In this case, they are expressed in g(potassiumsulphate)/dm3.<br>
**Alcohol**: wine is an alcoholic beverage, and as you know, the percentage of alcohol can vary from wine to wine. It shouldn’t be surprised that this variable is included in the data sets, where it’s expressed in % vol.<br>
**Quality**: wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual number is the median of at least three evaluations made by those same wine experts.<br>

# **Load the libraries**

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

from tensorflow.keras.utils import to_categorical
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

# **Load the data**
The data is in two files:<br>
>winequality-white.csv<br>
>winequality-red.csv<br>

In [None]:
# Read in white wine data 
#white = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';')
white = pd.read_csv("winequality-white.csv", sep=';')

# Read in red wine data 
red = pd.read_csv("winequality-red.csv", sep=';')

In [None]:
# Print info on white wine
white.tail()

In [None]:
red.tail()

# **Combine the two files into one file called wines**


**Add a column called type**: <br>
red = 1<br>
white = 0

In [None]:
#combine reds and whites into one dataset
# Add `type` column to `red` with value 1
red['type'] = 1

# Add `type` column to `white` with value 0
white['type'] = 0

# Append `white` to `red`
wines = red.append(white, ignore_index=True)

In [None]:
wines

# **Check out the distribution of the type of wines**

In [None]:
wines['quality'].value_counts()

# **Check for missing data in wines**

In [None]:
wines.isna().sum()

# **Is there a strong correlation between any of the features?**

Can the model be simplified (without increasing the mse)  by removing features? <br>


In [None]:
corr = wines.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
plt.show()

# **Split the dataset into a training set and a test set**
Models with very few hyperparameters will be easy to validate and tune, so you can probably reduce the size of your validation set, but if your model has many hyperparameters, you would want to have a large test set as well

**The 'frac' variable is a hyperparameter that can be tuned 

In [None]:
#Consider changing the ratio of the train/test split
#fac .95 - .5
wines_train = wines.sample(frac=0.5,random_state=0)
wines_test = wines.drop(wines_train.index)
print("done")

In [None]:
wines_train

# **Remove the labels from the dataset**

In [None]:
train_labels = wines_train.pop('quality')
test_labels = wines_test.pop('quality')

In [None]:
wines_train

In [None]:
train_stats = wines_train.describe()
train_stats = train_stats.transpose()
train_stats

In [None]:
test_stats = wines_test.describe()
test_stats = test_stats.transpose()
test_stats

# **Normalize the data**

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']

normed_train_data = norm(wines_train)
normed_test_data = norm(wines_test)
print("done")

# **Build the model**
There are several hyper-parameters that can be tuned in this cell. <br>
>**The number of layers**: Increasing the number of hidden layers might improve the accuracy or it might not, it depends on the complexity of the problem that you are trying to solve.<br>
Increasing the number of hidden layers much more than the sufficient number of layers will cause accuracy in the test set to decrease. It will cause your network to overfit to the training set.<br>
**The activation functions**: <br>
>*Sigmoid*: Vanishing gradient problem. For a shallow network with only a few layers that use these activations, this isn’t a big problem.<br>
>*Tanh*: (hyperbolic tangent)Vanishing gradient problem<br>
>*Relu*: Avoids and rectifies vanishing gradient problem. ReLu could result in Dead Neurons.<br>
**The size of layers**: Increasing or decreasing the number of nodes provides a short-cut to increasing the capacity of the model with fewer resources, and modern techniques allow learning algorithms to successfully train deep models<br>
**The optimizer**: Optimizers update the weight parameters to minimize the loss function. The loss function acts as a guide to the terrain telling optimizer if it is moving in the right direction to reach the global minimum.<br>
**The learning rate of the optimizer**: Must be >= 0. Varies from .001 --><br>
**The loss function**: The Mean Squared Error, or MSE, loss is the default loss to use for regression problems. Mathematically, it is the preferred loss function under the inference framework of maximum likelihood if the distribution of the target variable is Gaussian. It is the loss function to be evaluated first and only changed if you have a good reason.

<br>



In [None]:

from keras import metrics

inputs = len(normed_train_data.keys())
print("number of inputs to the model = " + str(inputs))

def build_model():
  model = keras.Sequential([
    layers.Dense(124, activation=tf.nn.relu,input_shape=([len(normed_train_data.keys())]),),
    layers.Dense(124, activation=tf.nn.relu),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mean_squared_error', 
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])
  return model
  print("done")

In [None]:
model = build_model()
print("done")

In [None]:
model.summary()

# **Train the model**
Changing the number of epochs can change the performance of the model. <br>

In [None]:
EPOCHS = 20
history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,batch_size = 16 )

In [None]:
hist = pd.DataFrame(history.history)
hist['epochs'] = history.epoch
hist.tail(20)

# **Plot the loss**
This is not a good plot. There is a good possibilty that the model might be too complex. 

In [None]:
axes = plt.gca()
axes.set_ylim([0,1])
plt.plot(hist['loss'], label='training loss')
plt.plot(hist['val_loss'], label='testing loss')
plt.title('Loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()

In [None]:
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=0)

print("Testing set Mean Abs Error: {:5.2f}".format(mae))


# **Make a prediction with the model**
When making predictions, whatever was done to the data during training, must also be done to all data entered into the model for predictions. 

In [None]:
#The csv file has 4 wines
#Add tge type to the dataframe
predict_case = pd.read_csv("prediction_cases.csv", sep=';')
typew = [0,0,1,1] 
predict_case['type'] = typew

In [None]:
#pop off the type from the dataframe
prediction_labels = predict_case.pop('quality')

In [None]:
#normalize the data
normed_prediction_cases = norm(predict_case)
normed_prediction_cases 

In [None]:
#There are four wines that the model has not seem
#Choose number 0,1,2,3
number = 3
test1 = normed_prediction_cases.loc[[number]]
test1.transpose()

In [None]:
prediction = model.predict(test1)

In [None]:
print(prediction)

In [None]:
print("prediction is ", prediction)
print("actual value is ", prediction_labels.iloc[number] )