# Dataset import and exploration
- https://www.kaggle.com/shelvigarg/wine-quality-dataset
- Refer to https://github.com/better-data-science/TensorFlow/blob/main/003_TensorFlow_Classification.ipynb for detailed preparation instructions

In [1]:
import os
import numpy as np
import pandas as pd
import warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
warnings.filterwarnings('ignore')

df = pd.read_csv('data/winequalityN.csv')
df.sample(5)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
349,white,7.6,0.17,0.45,11.2,0.054,56.0,137.0,0.997,3.15,0.47,10.0,5
4416,white,5.7,0.24,0.47,6.3,0.069,35.0,182.0,0.99391,3.11,0.46,9.75,5
4866,white,5.7,0.41,0.21,1.9,0.048,30.0,112.0,0.99138,3.29,0.55,11.2,6
4338,white,7.3,0.19,0.27,13.9,0.057,45.0,155.0,0.99807,2.94,0.41,8.8,8
5244,red,6.6,0.815,0.02,2.7,0.072,17.0,34.0,0.9955,3.58,0.89,12.3,7


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Prepare the data
df = df.dropna()
df['is_white_wine'] = [1 if typ == 'white' else 0 for typ in df['type']]
df['is_good_wine'] = [1 if quality >= 6 else 0 for quality in df['quality']]
df.drop(['type', 'quality'], axis=1, inplace=True)

# Train/test split
X = df.drop('is_good_wine', axis=1)
y = df['is_good_wine']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, random_state=42
)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

<br>

# Modelling

In [3]:
import tensorflow as tf
tf.random.set_seed(42)

<br>

## Callbacks list
- I like to declare it beforehand

### `ModelCheckpoint`
- It will save the model locally on the current epoch if it beats the performance on the previous one
- The configuration below saves it to a `hdf5` file in the following format:
    - `<dir>/model-<epoch>-<accuracy>.hdf5`
- Model is saved only if the validation accuracy is higher than on the previous epoch

In [4]:
cb_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='checkpoints/model-{epoch:02d}-{val_accuracy:.2f}.hdf5',
    monitor='val_accuracy',
    mode='max',
    save_best_only=True,
    verbose=1
)

### `ReduceLROnPlateau`
- Basically if a metric (validation loss) doesn't decrease for a number of epochs (10), reduce the learning rate
- New learning rate = old learning rate * factor (0.1)
    - nlr = 0.01 * 0.1 = 0.001
- You can also set the minimum learning rate below the model won't go

In [5]:
cb_reducelr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    mode='min',
    factor=0.1,
    patience=10,
    verbose=1,
    min_lr=0.00001
)

### `EarlyStopping`
- If a metric (validation accuracy) doesn't increase by some minimum delta (0.001) for a given number of epochs (10) - kill the training process


In [6]:
cb_earlystop = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',
    mode='max',
    min_delta=0.001,
    patience=10,
    verbose=1
)

### `CSVLogger`
- Captures model training history and dumps it to a CSV file
- Useful for analyzing the performance later

In [7]:
cb_csvlogger = tf.keras.callbacks.CSVLogger(
    filename='training_log.csv',
    separator=',',
    append=False
)

<br>

- For simplicity's sake we'll treat test set as a validation set
- In real deep learning projects you'll want to have 3 sets: training, validation, and test
- We'll tell the model to train for 1000 epochs, but the `EarlyStopping` callback will kill it way before
- Specify callbacks in the `fit()` function

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=[tf.keras.metrics.BinaryAccuracy(name='accuracy')]
)

history = model.fit(
    X_train_scaled, 
    y_train, 
    epochs=1000,
    validation_data=(X_test_scaled, y_test),
    callbacks=[cb_checkpoint, cb_reducelr, cb_earlystop, cb_csvlogger]
)

Epoch 1/1000
Epoch 1: val_accuracy improved from -inf to 0.76102, saving model to checkpoints\model-01-0.76.hdf5
Epoch 2/1000
Epoch 2: val_accuracy did not improve from 0.76102
Epoch 3/1000
Epoch 3: val_accuracy improved from 0.76102 to 0.76566, saving model to checkpoints\model-03-0.77.hdf5
Epoch 4/1000
Epoch 4: val_accuracy did not improve from 0.76566
Epoch 5/1000
Epoch 5: val_accuracy improved from 0.76566 to 0.77108, saving model to checkpoints\model-05-0.77.hdf5
Epoch 6/1000
Epoch 6: val_accuracy improved from 0.77108 to 0.77185, saving model to checkpoints\model-06-0.77.hdf5
Epoch 7/1000
Epoch 7: val_accuracy improved from 0.77185 to 0.77572, saving model to checkpoints\model-07-0.78.hdf5
Epoch 8/1000
Epoch 8: val_accuracy did not improve from 0.77572
Epoch 9/1000
Epoch 9: val_accuracy did not improve from 0.77572
Epoch 10/1000
Epoch 10: val_accuracy did not improve from 0.77572
Epoch 11/1000
Epoch 11: val_accuracy did not improve from 0.77572
Epoch 12/1000
Epoch 12: val_accurac

<br>

## Final evaluation
- You can now load the best model - it will be the one with the highest epoch number

In [9]:
best_model = tf.keras.models.load_model('checkpoints/model-25-0.80.hdf5')

- Save yourself some time by calling `predict_classes()` instead of `predict()`
- It assigns the classes automatically - you don't have to calculate them from probabilities

In [10]:
best_model_preds = np.ravel(best_model.predict_classes(X_test_scaled))
best_model_preds

array([1, 1, 0, ..., 1, 0, 1], dtype=int32)

- Evaluate as you normally would

In [11]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, best_model_preds))

0.7981438515081206
