# Exploring Keras 
Based on Aurélien Geron's [Chapter 10 Notebook](https://github.com/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb).

## Objectives
- Continue the basics of neural networks with Keras
- Build and train a simple regression model
- Explore some more features of the Keras/Tensorflow ecosystem

You may want to run this on Colab, but it isn't using a particularly large dataset so locally should work as well.

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd

## A basic model: predicting house prices
And you thought you were done with the California housing dataset!

In [None]:
# extra code – load and split the California housing dataset, like earlier
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, random_state=42)
# split it again to get a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, random_state=42)

print("Training instances: ", y_train.shape)
print("Validation instances: ", y_val.shape)
print("Testing instances: ", y_test.shape)

This time, let's use the approach of a constant number of neurons per layer and see how things go.

The [Adam optimizer](https://arxiv.org/abs/1412.6980) is a very popular adaptive learning rate method that takes into account both the first and second moments (mean and variance) of the gradients. The step ends up being with high moments, resulting in smaller steps when the gradient is both small and smooth.

We'll also use a `Normalization` layer to scale the input features (the `StandardScaler` from scikit-learn would also work).

### Exercise 1
Modify the cell below to have 3 fully connected layers with 50 neurons each.

In [None]:
import tensorflow.keras.layers as layers # just to make the names less unwieldy

tf.random.set_seed(42)

model = tf.keras.Sequential([
    # input shape does not include batch size
    layers.Input(X_train.shape[1:]), 
    layers.Normalization(name="norm"),
    # TODO: Add 3x fully connected layers with 50 neurons each and relu activation
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

# the adapt method computes the mean and standard deviation of the input features
# Note that only the training data is used to compute the mean and standard deviation!
model.get_layer("norm").adapt(X_train)
model.summary()

In [None]:
# train and plot the training curves
history = model.fit(X_train, y_train, epochs=100, validation_batch_size=len(X_val),
                    validation_data=(X_val, y_val))

pd.DataFrame(history.history).plot(
    figsize=(8, 5), grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])

The validation behaviour looks awfully weird. What might be happening?

Take a peek at the [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method docs to try to understand the training process (and hyperparameters) better. In particular, pay attention to the **batch size**.

In [None]:
# Try looking at the whole validation dataset
mse_val, rmse_val = model.evaluate(X_val, y_val)

Is that a good RMSE value? Let's try a scatter plot to see how we did.

In [None]:
plt.scatter(y_val, model.predict(X_val), alpha=0.1)
plt.xlabel("Actual House price ($100,000)")
plt.ylabel("Predicted House price ($100,000)")
# Probably better than our random forest model, but still not great

## Building Complex Models Using the Functional API

Not all neural network models are simply sequential. Some may have complex topologies. Some may have multiple inputs and/or multiple outputs. For example, a Wide & Deep neural network (see [paper](https://research.google/pubs/wide-deep-learning-for-recommender-systems/)) connects all or part of the inputs directly to the output layer.

In [None]:
# extra code – reset the name counters and make the code reproducible
tf.keras.backend.clear_session()
tf.random.set_seed(42)

### Exercise 2
Using the following code as a starting point, build a model with the following architecture:

- Input layer, same as before but we need to be more explicit about it (i.e. specify the shape)
- Normalization (same as before)
- 2x Dense layers with only 30 neurons each this time and relu activation
- Concatenation of the input and the output of the second Dense layer - also called a "skip connection"
- Our output layer, same as before

In [None]:
input_ = layers.Input(shape=X_train.shape[1:])
normalized = layers.Normalization(name="norm")(input_)
# TODO: 2x hidden layers
concat = layers.Concatenate()([normalized, hidden_output])
output = layers.Dense(1)(concat)

model = tf.keras.Model(inputs=[input_], outputs=[output])

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

# don't forget to calculate the mean/std of your normalization layer
model.get_layer("norm").adapt(X_train)
history = model.fit(X_train, y_train, epochs=20, validation_data=(X_val, y_val))
mse_val = model.evaluate(X_val, y_val)
y_pred = model.predict(X_val)

In [None]:
mse_val, rmse_val = model.evaluate(X_val, y_val)
# plot the training curves
pd.DataFrame(history.history).plot(
    figsize=(8, 5), grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])

The RMSE is a bit worse than before, but we have far fewer parameters. However, the functional API allows for a lot more flexibility - the [original notebook](https://github.com/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb) and associated text in chapter 10 goes into a lot more detail.
Finally, you can also define a model by subclassing the `Model` class and defining your own `call` method to create a more dynamic model. This is also the PyTorch way of doing things.

## Saving and Restoring a Model
Ultimately after spending all this time training a model, you'll probably want to save the weights so you can use it later. You can also define a **custom callback** to save the model periodically during training in case of a crash, timeout, to save the best intermediate result, etc.

In [None]:
model.save("model.keras")

# To load it again:
# model = tf.keras.models.load_model("model.keras")

### Exercise 3
Here we'll define two simple callbacks (built in to Keras): One for early stopping and one for saving the model at the end of each epoch. The early stopping callback will stop the training if the validation loss stops decreasing for a certain number of epochs.

We can also define custom callbacks - again the original notebook goes into a lot more detail.

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(42)

# make a copy of the model, with the same architecture, but randomly initialized weights
model = tf.keras.models.clone_model(model)
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3), metrics=["RootMeanSquaredError"])

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
# TODO: define a checkpoint_cb using tf.keras.callbacks.ModelCheckpoint
history = model.fit(X_train, y_train, epochs=30, validation_data=(X_val, y_val), callbacks=[checkpoint_cb, early_stopping_cb])


In [None]:
# Plot the model performance
pd.DataFrame(history.history).plot(
    figsize=(8, 5), grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])

## Hyperparameter Tuning
Let's use the Keras Tuner to do a quick hyperparameter search on the housing price prediction problem. For this to work, we need to wrap our model creation into a function that takes an `hp` argument (for hyperparameters) and returns a model.

We'll go back to the sequential model for simplicity - it was actually working the best anyway.

### Exercise 4
In the code below, try modifying the n_neurons parameter in the function generator to be a tunable hyperparameter. Then, run the random search to figure out the best hyperparameters for the California housing dataset.

In [None]:
import keras_tuner as kt

def build_model(hp):
    # Original model had 3 hidden layers with 50 neurons each
    n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
    n_neurons = 50
    #TODO: Modify n_neurons to be a tunable hyperparameter
    model = tf.keras.Sequential()
    model.add(layers.Input(shape=X_train.shape[1:]))
    model.add(layers.Normalization(name="norm"))
    for _ in range(n_hidden):
        model.add(layers.Dense(n_neurons, activation="relu"))
    model.add(layers.Dense(1))

    # adapt the normalization layer as usual
    model.get_layer("norm").adapt(X_train)

    # add on the optimizer and compile
    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
    model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
    return model

In [None]:
random_search_tuner = kt.RandomSearch(
    build_model,
    objective="val_loss",
    max_trials=5,
    overwrite=True,
    directory="random_search",
    project_name="california_housing",
    seed=42)

random_search_tuner.search(X_train, y_train, epochs=20, validation_data=(X_val, y_val))

In [None]:
random_search_tuner.get_best_models()[0].summary()

## Image Augmentation
For this exercise, we'll need a new dataset. Let's try the CGI `rock_paper_scissors` dataset - it's pretty uniformly the same hand angle.

In [None]:
import tensorflow_datasets as tfds
# let's try this rock_paper_scissors dataset
rps_train, info = tfds.load('rock_paper_scissors', split='train', shuffle_files=True, with_info=True)
rps_val, info = tfds.load('rock_paper_scissors', split='test', shuffle_files=True, with_info=True)
print(info)

i = 1
for sample in rps_train.take(6):
    ax = plt.subplot(2, 3, i)
    i += 1
    img = sample["image"] / 255
    label = sample["label"]
    ax.imshow(img)
    ax.set_title(info.features["label"].int2str(label))


### Exercise 5
Modify the following two cells to include random rotations in the augmentation layers, then add augmentation to your model.

In [None]:
# Build the preprocessing and augmentation layers
# The downsampling and rescaling layers will remain active during inference, 
# but the augmentation (random flip, rotation, zoom, crop, etc) will be switched off.
preprocess = tf.keras.Sequential([
    layers.Input(shape=(300, 300, 3)),
    layers.Resizing(112, 112),
    layers.Rescaling(1./255),
])

augment = tf.keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    #TODO: Try a random rotation
])

# Example of applying the pipeline to the last image shown above
# rerun this several times to see how the randomization changes
batch = tf.expand_dims(sample["image"], axis=0)
augmented = augment(preprocess(batch))
augmented = tf.squeeze(augmented, axis=0)
# the whole batch/augment/squeeze thing is all about making sure the dimensions match

plt.imshow(augmented)
plt.title("Augmented Image")
plt.show()


In [None]:
# build a simple model
model = tf.keras.Sequential([
    preprocess,
    #TODO: Add your augmentation here
    layers.Conv2D(16, 3, padding='same', activation='relu'),
    layers.MaxPooling2D(),
    layers.Conv2D(32, 3, padding='same', activation='relu'),
    layers.MaxPooling2D(),
    layers.Conv2D(64, 3, padding='same', activation='relu'),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(64, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(32, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(3, activation="softmax")
])

model.summary()

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

In [None]:
# for some reason the tfds format is a dictionary, but we need a tuple of input, output to pass to model.fit
def wrangle_tfds(sample):
    return tf.expand_dims(sample["image"], axis=0), tf.expand_dims(tf.one_hot(sample['label'], depth=3), axis=0)

epochs=5
history = model.fit(
    rps_train.map(wrangle_tfds),
    validation_data=rps_val.map(wrangle_tfds),
    epochs=epochs
)

In [None]:
# evaluate the validation accuracy with flips/rotations applied
def augment_and_wrangle(sample):
    img, label = wrangle_tfds(sample)
    return augment(img), label

# Augment the validation dataset
augmented_val_ds = rps_val.map(augment_and_wrangle)

# Evaluate the model on the augmented validation dataset
accuracy = model.evaluate(augmented_val_ds)
print(f"Accuracy on augmented validation dataset: {accuracy[1]}")