# Adding Regularization with L2 and Dropout

When there is a very large gap between the training accuracy and the validation accuracy, it means the model is overfitting. To improve the validation accuracy we can use regularization. Methods to reduce overfitting are called **Regularization.** there are many regularization techniques:
- Reduce the network's capacity; by removing layers or reducing the number of hidden units.
- L2 regularization
- Dropout
- Data augmentation
- Early stopping
- etc.

## 1. Imports and Configuration

In [4]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers

# Configure GPU memory growth to be dynamic instead of allocating all memory at once
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

## 2. Data Loading and Preprocessing

In [5]:
from tensorflow.keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

## 3. Model Definition

- BatchNormalization in addition to improving training speed, acts as a **regularizer**, reducing generalization error. Batch normalization accomplishes this by scaling the output of a layer by subtracting the batch mean and dividing by the batch standard deviation.

- L2 regularization: L2 regularization is a regularization technique that penalizes the squared magnitude of all parameters directly in the objective. L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero). The optimization procedure is modified so that at each step, each weight value is first decayed towards zero by a factor proportional to the learning rate. Only after this step is complete, the usual gradient descent step based on the current mini-batch is taken. 
    - We can add L2 regularization to a layer by passing kernel_regularizer=regularizers.l2(l2) to the layer constructor.

- Dropout: Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.
    - We can add dropout to a layer by passing a Dropout layer to the model, specifying the dropout rate, which is the fraction of the features that are being zeroed-out; it is usually between 0.2 and 0.5. At test time, no units are dropped out, instead the layer's output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time.

In [6]:
def my_model():
    inputs = keras.Input(shape=(32, 32, 3))
    x = layers.Conv2D(32, 3, padding="same", kernel_regularizer=regularizers.l2(0.01),)(
        inputs
    )
    x = layers.BatchNormalization()(x)
    x = keras.activations.relu(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(64, 3, padding="same", kernel_regularizer=regularizers.l2(0.01),)(
        x
    )
    x = layers.BatchNormalization()(x)
    x = keras.activations.relu(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(
        128, 3, padding="same", kernel_regularizer=regularizers.l2(0.01),
    )(x)
    x = layers.BatchNormalization()(x)
    x = keras.activations.relu(x)
    x = layers.Flatten()(x)
    x = layers.Dense(64, activation="relu", kernel_regularizer=regularizers.l2(0.01),)(
        x
    )
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(10)(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model


model = my_model()

## 4. Compile Model

In [7]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(lr=3e-4),
    metrics=["accuracy"],
)

## 5. Model Training and Evaluation

Due to the dropout layers, it takes longer to converge than the previous model. Hence, we train for more epochs.

In [8]:
print("Training model...")
model.fit(x_train, y_train, batch_size=64, epochs=150, verbose=2)

print("\nEvaluating model...")
results = model.evaluate(x_test, y_test, batch_size=64, verbose=0)
print(f"Test loss: {results[0]:.4f}")
print(f"Test accuracy: {results[1]:.4f}")

Training model...
Epoch 1/150
782/782 - 28s - loss: 2.9886 - accuracy: 0.3169
Epoch 2/150
782/782 - 3s - loss: 1.8865 - accuracy: 0.4431
Epoch 3/150
782/782 - 3s - loss: 1.6404 - accuracy: 0.4859
Epoch 4/150
782/782 - 3s - loss: 1.5267 - accuracy: 0.5167
Epoch 5/150
782/782 - 3s - loss: 1.4632 - accuracy: 0.5357
Epoch 6/150
782/782 - 3s - loss: 1.4252 - accuracy: 0.5486
Epoch 7/150
782/782 - 3s - loss: 1.3959 - accuracy: 0.5587
Epoch 8/150
782/782 - 3s - loss: 1.3676 - accuracy: 0.5734
Epoch 9/150
782/782 - 3s - loss: 1.3465 - accuracy: 0.5777
Epoch 10/150
782/782 - 3s - loss: 1.3303 - accuracy: 0.5796
Epoch 11/150
782/782 - 3s - loss: 1.3101 - accuracy: 0.5879
Epoch 12/150
782/782 - 3s - loss: 1.2999 - accuracy: 0.5936
Epoch 13/150
782/782 - 3s - loss: 1.2931 - accuracy: 0.5978
Epoch 14/150
782/782 - 3s - loss: 1.2708 - accuracy: 0.6072
Epoch 15/150
782/782 - 3s - loss: 1.2609 - accuracy: 0.6109
Epoch 16/150
782/782 - 3s - loss: 1.2516 - accuracy: 0.6153
Epoch 17/150
782/782 - 3s - lo