### Example of a simple MLP in TensorFlow with Kreas using the [White wine quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality)

This is one of two datasets from the UCI ML repository with data on red and white variants of the Portuguese "Vinho Verde" wine [Cortez et al., 2009]. They are appropriate for either classification or regression tasks. The classes are ordered, but not balanced, e.g., there are many more normal wines than excellent or poor ones. The eleven input variables are: 1 fixed acidity, 2 volatile acidity, 3 citric acid, 4 residual sugar, 5 chlorides, 6 free sulfur dioxide, 7 total sulfur dioxide, 8 density, 9 pH, 10 sulphates, 11 alcohol and the output variable is 12 quality (score between 0 and 10).

You can run this notebook on your local computer or fro colab, but on colad, you must to an extra step of upload your file to colab.  It will only be there while your session on colab continues.  If you stop and com back a few hours later, you will have to upload it again.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
import keras
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
import os

### 1 Get the data

We have to load the file slightly differently if we are running this on colab or our local computer. This will set the varialble **IN_COLAB** to True iff we are running on colab.

In [None]:
try:
    import google.colab
    IN_COLAB = True
    print("Running in colab")
except:
    IN_COLAB = False
    print("Not running in colab")

This is the name of the file that has our data

In [None]:
data_file_name = "winequality.csv"

This will ask your to upload the file if you are using colab and requires navigating to and clicking on the file.  Note that the file will eventually be deleted from the colab server you are running on when your session ends. The session will end if you completely disconnect or are inactive for too long a time.

In [None]:
if IN_COLAB:
    if os.path.exists(data_file_name):
        print("Deleting prior version of", data_file_name)
        os.remove(data_file_name)
    import io
    from google.colab import files
    print("When asked to select your file, find and click on", data_file_name)
    uploaded = files.upload()
    the_file = io.BytesIO(uploaded[data_file_name])
else:
    the_file = data_file_name

The numpy function [genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) is any easy way to import data from a file.

In [None]:
data = np.genfromtxt(the_file, delimiter=',')
print("data.shape:", data.shape)
print(data)

**X** will be the array minus the last column and **y** will be the last column

In [None]:
X = data[:,:-1]
y = data[:,-1]
print(f"X.shape={X.shape} y.shape={y.shape}")

### 2 Preprocess the data

We typically need to do some preprocessing of the data to make it work better with our machine learning system

The unique y values are floating point versions of the integers between 3 and 7. TensorFlow will be happier if the y values are in {0,1,2,3,4,5,6}. sklearn's [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) does this easily by finding the unique values, sorting them, and replacing them with integers starting with 0.

In [None]:
print(f"Initial unique y values = {np.unique(y)}")
y = LabelEncoder().fit_transform(y)
print(f"After encoding, unique y values = {np.unique(y)}")

We'll use the variables num_features and num_classes later when we define the TF model

In [None]:
num_features = X.shape[1]
num_classes = len(set(y))
print(f"num_features={num_features}; num_classes={num_classes}")

It's generally a good idea to scale the X columns, so let's use sklearn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to do this.  TensorFlow has scalers also, but the sklearn verison familiar and easy to use.  We print the first three rows of X before and after scaling.

In [None]:
print(f"Before scaling the input\n{X[:3]}")
X_scaled = StandardScaler().fit(X).transform(X)
print(f"After scaling the input\n{X_scaled[:3]}")

### 3 Build the model

Using Keras makes constructing a neural network model easy. 

TensofrFlow remembers the model parameter's weights if we run it repeateldly while debugging.  Calling the clear_session() function will reset the model parameters.

In [None]:
tf.keras.backend.clear_session()

Our model will have three dense layers of size 64, 64 and 32 followed by a final output layer with as many nodes as we have classes. The final layer will use the softmax activation rather than relu since the values it produces approximates the probability of correctness.

Calling model.summary will print a desciption of the model so we can confirm it looks ok.

In [None]:
model = tf.keras.models.Sequential([
  layers.Dense(64, input_shape=(num_features,), activation='relu'),
  layers.Dense(32, activation='relu'),
  layers.Dense(16, activation='relu'),
  layers.Dense(num_classes, activation='softmax'),
])
model.summary()

### 4 Compile the Keras model into a neural network 

This step proruces the actual network structure and lets us specify addional parameters.  These are typical parameter values for optimiser and metrics.  The loss function *sparse_categorical_crossentropy* is one that's good for a multiclass problem.

In [None]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print("compiled, ready to train")

### 5 Train the model

TensorFlow's callback mechansim lets us add several extras.  This [**EarlyStopping**](https://keras.io/api/callbacks/early_stopping/) callback will stop training when the loss on the validation data does not improve for some number of successive training epochs.  We will stop when the loss on the validation set stops improving (i.e., getting lower).  

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4, verbose=1)

We train the model with a validation_split that holds out a fraction of the data as test data and captures the accuracy of the main training data and this validation data after each eopch.

The fit() method returns a history object which records useful information about the model's training.  We'll examine it aftr we train

This will take 5-20 seconds to run, depending on your computer.  Wait for it to print the message.

In [None]:
history = model.fit(X_scaled, y, verbose=0, epochs=900, validation_split = 0.2, callbacks=[callback])
print(f"{len(history.history['loss'])} epochs run; final accuracy = {history.history['accuracy'][-1]}")

### 6 Review the results, draw conclusions, and make adjustments

We plot the model accuracy and validation accuracy recorded in the history. **EarlyStopping** callback kicked in when the val_loss was not improving (i.e., getting lower) for several epochs.  This indicates that the model is beginning to overfit on the training data and it's time to stop training.

While an accuracy does not sound impressive, it is about as good as one can get with this data using any machine learning method without overfitting.  If you edit this notebook to trai/n on X rather than X_scaled, the accuracy is significantly lower, about 0.46.

We can plot the accuracy and validation accuracy for each epoch easily.

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['accuracy','validation accuracy'])

history is a dictionary that records values after each epoch.  Looking at its keys shows us the properties it records.

In [None]:
history.history.keys()

Ploting the loss and validation loss is also informative.

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss','validation loss'])

fin