# Dataset import and exploration
- https://www.kaggle.com/shelvigarg/wine-quality-dataset
- Refer to https://github.com/better-data-science/TensorFlow/blob/main/003_TensorFlow_Classification.ipynb for detailed preparation instructions

In [1]:
import os
import numpy as np
import pandas as pd
import warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
warnings.filterwarnings('ignore')

df = pd.read_csv('data/winequalityN.csv')
df.sample(5)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
673,white,6.7,0.31,0.3,2.1,0.038,18.0,130.0,0.9928,3.36,0.63,10.6,6
2652,white,7.3,0.22,0.31,2.3,0.018,45.0,80.0,0.98936,3.06,0.34,12.9,7
5574,red,10.8,0.4,0.41,2.2,0.084,7.0,17.0,0.9984,3.08,0.67,9.3,6
6416,red,7.4,0.47,0.46,2.2,0.114,7.0,20.0,0.99647,3.32,0.63,10.5,5
3837,white,8.0,0.27,0.33,1.2,0.05,41.0,103.0,0.99002,3.0,0.45,12.4,6


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Prepare the data
df = df.dropna()
df['is_white_wine'] = [1 if typ == 'white' else 0 for typ in df['type']]
df['is_good_wine'] = [1 if quality >= 6 else 0 for quality in df['quality']]
df.drop(['type', 'quality'], axis=1, inplace=True)

# Train/test split
X = df.drop('is_good_wine', axis=1)
y = df['is_good_wine']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, random_state=42
)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

<br>

# Modelling

In [3]:
import tensorflow as tf
tf.random.set_seed(42)

Init Plugin
Init Graph Optimizer
Init Kernel


<br>

## Callbacks list
- I like to declare it beforehand

### `ModelCheckpoint`
- It will save the model locally on the current epoch if it beats the performance on the previous one
- The configuration below saves it to a `hdf5` file in the following format:
    - `<dir>/model-<epoch>-<accuracy>.hdf5`
- Model is saved only if the validation accuracy is higher than on the previous epoch

In [4]:
cb_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='checkpoints/model-{epoch:02d}-{val_accuracy:.2f}.hdf5',
    monitor='val_accuracy',
    mode='max',
    save_best_only=True,
    verbose=1
)

### `ReduceLROnPlateau`
- Basically if a metric (validation loss) doesn't decrease for a number of epochs (10), reduce the learning rate
- New learning rate = old learning rate * factor (0.1)
    - nlr = 0.01 * 0.1 = 0.001
- You can also set the minimum learning rate below the model won't go

In [5]:
cb_reducelr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    mode='min',
    factor=0.1,
    patience=10,
    verbose=1,
    min_lr=0.00001
)

### `EarlyStopping`
- If a metric (validation accuracy) doesn't increase by some minimum delta (0.001) for a given number of epochs (10) - kill the training process


In [6]:
cb_earlystop = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',
    mode='max',
    min_delta=0.001,
    patience=10,
    verbose=1
)

### `CSVLogger`
- Captures model training history and dumps it to a CSV file
- Useful for analyzing the performance later

In [7]:
cb_csvlogger = tf.keras.callbacks.CSVLogger(
    filename='training_log.csv',
    separator=',',
    append=False
)

<br>

- For simplicity's sake we'll treat test set as a validation set
- In real deep learning projects you'll want to have 3 sets: training, validation, and test
- We'll tell the model to train for 1000 epochs, but the `EarlyStopping` callback will kill it way before
- Specify callbacks in the `fit()` function

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=[tf.keras.metrics.BinaryAccuracy(name='accuracy')]
)

history = model.fit(
    X_train_scaled, 
    y_train, 
    epochs=1000,
    validation_data=(X_test_scaled, y_test),
    callbacks=[cb_checkpoint, cb_reducelr, cb_earlystop, cb_csvlogger]
)

Metal device set to: Apple M1
Epoch 1/1000

Epoch 00001: val_accuracy improved from -inf to 0.76489, saving model to checkpoints/model-01-0.76.hdf5
Epoch 2/1000

Epoch 00002: val_accuracy did not improve from 0.76489
Epoch 3/1000

Epoch 00003: val_accuracy improved from 0.76489 to 0.77030, saving model to checkpoints/model-03-0.77.hdf5
Epoch 4/1000

Epoch 00004: val_accuracy did not improve from 0.77030
Epoch 5/1000

Epoch 00005: val_accuracy did not improve from 0.77030
Epoch 6/1000

Epoch 00006: val_accuracy did not improve from 0.77030
Epoch 7/1000

Epoch 00007: val_accuracy improved from 0.77030 to 0.77804, saving model to checkpoints/model-07-0.78.hdf5
Epoch 8/1000

Epoch 00008: val_accuracy improved from 0.77804 to 0.78036, saving model to checkpoints/model-08-0.78.hdf5
Epoch 9/1000

Epoch 00009: val_accuracy did not improve from 0.78036
Epoch 10/1000

Epoch 00010: val_accuracy did not improve from 0.78036
Epoch 11/1000

Epoch 00011: val_accuracy improved from 0.78036 to 0.78113,

<br>

## Final evaluation
- You can now load the best model - it will be the one with the highest epoch number

In [9]:
best_model = tf.keras.models.load_model('checkpoints/model-25-0.80.hdf5')

- Save yourself some time by calling `predict_classes()` instead of `predict()`
- It assigns the classes automatically - you don't have to calculate them from probabilities

In [10]:
best_model_preds = np.ravel(best_model.predict_classes(X_test_scaled))
best_model_preds

array([1, 1, 0, ..., 1, 0, 1], dtype=int32)

- Evaluate as you normally would

In [11]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, best_model_preds))

0.7981438515081206
