<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />

# Adaptive Learning Rate<a id="top"></a>

<i><small>Authors: Alberto Díaz Álvarez<br>Last update: 2023-04-09</small></i></div>

***

## Introduction

Although not directly related to gradient-related problems, the learning factor is a hyperparameter that greatly affects training behavior:

- When it is high, it accelerates movements over the error space looking for promising regions, but makes convergence more difficult since large jumps make it difficult to stay in local minima
- When it is low, the opposite happens; it allows the exploitation of the region in which we are, but makes it difficult (or impossible) to escape from these local minima in search of better solutions.

This hyperparameter can be useful to explore regions that we suspect to be difficult (e.g. with many local minima). The idea is usually to start training with a high learning factor, and decrease it as we learn. In this way we try to favor a high exploitation at the beginning, when the algorithm is not yet converging towards a solution, and then continue towards a higher exploitation later on when the algorithm has (supposedly) found a sufficiently promising area.

## Goals

In Keras we have implementations of algorithms called "learning rate schedulers". These algorithms act as a learning factor, only that they vary according to the evolution of the training.

For the rest of the section we will create a function to visualize the evolution of our learning factor.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

Configuraremos también algunos parámetros para adecuar la presentación gráfica.

In [None]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams.update({"axes.grid" : False})
plt.rcParams.update({'figure.figsize': (20, 6),'figure.dpi': 64})

***

## Sample model

Let's create a metric to tell us how much the learning factor is worth at each epoch:

In [None]:
def lr_spy(optimizer):
    def lr(y_true, y_pred):
        return optimizer.lr
    return lr

We can use this metric in the same way as we use the precision, RMSE, etc.

As a problem, we will use a small classification problem to make the training much faster, for example the three-input AND gate:

In [None]:
X = np.array([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]])
y = np.array([0, 1, 1, 1, 1, 1, 1, 1])

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(3,)),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

Let us now look at some implementation options.

### `ExponentialDecay`

This component performs an exponential learning factor decrease. It corresponds to the following formula:

$$
\alpha_i = \alpha_o \cdot \gamma^{\frac{i}{s}}
$$

Being $i$ the current **batch** (not epoch), $alpha_i$ the learning factor in the current batch, $alpha_o$ the initial learning factor, $gamma$ the learning factor decrease rate, and $s$ the number of batches needed to decrease the learning factor.

Actually the decrement is continuous as long as the division of the exponent of $\gamma$ is an integer division, in which case the decrement will be stepwise. For example:

In [None]:
lr_exponential_decay = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.1,
    decay_steps=5,
    decay_rate=0.95,
    staircase=True,
)

The learning factor starts at 0.1, and every 5 batches will be multiplied by 0.75; moreover, since `staircase` is `True`, the decrease will be stepwise instead of continuous.

Let us see how the learning factor evolves during the training of a model:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_exponential_decay)
lr_metric = lr_spy(optimizer)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

We have specified the batch size equal to that of the epoch so that both match, but it must be remembered that **the iterations defined refer to _batches_ and not to _epochs_**.

Graphically, the evolution of the learning factor is as follows:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

### `PiecewiseConstantDecay`

Here the decrease is by means of a step function in which we define exactly what learning factor we want in each range of training _batches_. For example:

In [None]:
lr_piecewise_constant_decay = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
    boundaries=[125, 250, 375],
    values=[0.1, 0.075, 0.05, 0.025],
)

The learning factor remains at 0.1 for the first 126 _batches_ (from 0 to 125), at 0.075 for the next 125, at 0.05 for the next 125, and at 0.025 for the rest.

Let us see how the learning factor evolves during the training of a model:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_piecewise_constant_decay)
lr_metric = lr_spy(optimizer)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

Graphically, the evolution of the learning factor is as follows:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

### `PolinomialDecay`

This component performs a monotonic decrease of the learning factor given an initial and final learning factor, as well as the number of iterations to reach the latter from the former.

It follows the equation:

$$
\alpha_i = (\alpha_o - \alpha_n) \cdot \frac{1 - \min(i, S_\alpha)}{\delta\alpha}^p + \alpha_n
$$

Where $i$ is the current batch, $alpha_i$ the learning factor in the current batch, $alpha_o$ and $alpha_n$ the initial and final learning factors, $S_alpha$ the number of _batches_ to reach $alpha_n$ and $p$ an exponent that determines the degree of the polynomial.

Since it is possible to reach the value $\alpha_n$ without having completed the training, and therefore with many iterations ahead, there is the option of converting this descent into a descent that occurs again, each time from a value slightly less than the previous $\alpha_o$.

In [None]:
lr_polynomial_decay = tf.keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=0.1,
    end_learning_rate=0.001,
    decay_steps=100,
    power=1,
    cycle=True,
)

The learning factor starts at 0.1 and ends at 0.001 after 100 _batches_. The degree of the polynomial is 1, so the decrease will be in a straight line. Moreover, since we have set the `cycle` argument to `true`, the decrease will be repeated, each time from a smaller initial learning factor.

Let us see how the learning factor evolves during the training of a model:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_polynomial_decay)
lr_metric = lr_spy(optimizer)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

Gráficamente, la evolución del factor de aprendizaje queda como sigue:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

### `InverseTimeDecay`

Decay in this strategy applies an inverse decay function at each iteration of the optimizer, given an initial learning rate. It responds to the equation:


$$
\alpha_i = \frac{\alpha_o}{1 + \alpha_r \cdot \frac{i}{\delta i}}
$$

With $alpha_i$ being the learning factor in the $i$-th batch, $alpha_o$ and $alpha_r$ being its initial value and descent ratio, respectively, and $delta i$ being the number of batches to descend.

As in other cases, the descent can be specified to be staggered rather than continuous.

In [None]:
lr_inverse_time_decay = tf.keras.optimizers.schedules.InverseTimeDecay(
    initial_learning_rate=0.1,
    decay_steps=50,
    decay_rate=0.1,
    staircase=True,
)

The learning factor starts at 0.1, and every 50 _batches_ decreases by 0.1 (10%) from the previous value. Also, since `staircase` is `True`, the decrease will be stepwise instead of continuous.

Let us see how the learning factor evolves during the training of a model:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_inverse_time_decay)
lr_metric = lr_spy(optimizer)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

Graphically, the evolution of the learning factor is as follows:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

Es similar, aunque no tan pronunciado, al descenso del factor de aprendizaje exponencial.

### `CosineDecay`

The decrease in this case is achieved by applying a cosine-based function ([SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983) by Ilya Loshchilov, Frank Hutter). Specifically, the equation of the function is as follows:

$$
\alpha_i = \alpha_o \frac{1 + \cos(\pi \min(i, S_\alpha))}{2 S_\alpha}
$$

With $alpha_i$ being the learning factor in the $i$-th _batch_, $alpha_o$ being its initial value and $S_alpha$ being the number of _batches_ to reach 0. It is logical that the parameter $S_alpha$ should be equal to the number of _batches_ our training will perform.

In [None]:
lr_cosine_decay = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=500,
)

The learning factor starts at 0.1, and in 500 batches it reaches the minimum, i.e. 0.

Let us see how the learning factor evolves during the training of a model:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_cosine_decay)
lr_metric = lr_spy(optimizer)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

Graphically, the evolution of the learning factor is as follows:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

### `CosineDecayRestarts`

This scheme is very similar to the previous one, but in this case the evolution is repeated by varying the frequency of descent. Example:

In [None]:
lr_exponential_decay = tf.keras.optimizers.schedules.CosineDecayRestarts(
    initial_learning_rate=0.1,
    first_decay_steps=10,
    t_mul=2.0,
    m_mul=0.9,
)

The learning factor starts at 0.1 and decays to 0 for the first 10 _batches_. At the $i+1$-th iteration, the descent rate and the learning factor will be those they had at the $i$-th iteration, but multiplied by 2 (`t_mul`) and by 0.9 (`m_mul`) respectively.

Let us see how the learning factor evolves during the training of a model:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_exponential_decay)
lr_metric = lr_spy(optimizer)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

Graphically, the evolution of the learning factor is as follows:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

### Custom learning factor

All the classes we have seen before inherit from the `LearningRateSchedule` class. To create our own adaptive learning factor it would be enough to inherit from this class and implement the `__call__(self, step)` method (well, and the `get_config` and `from_config` methods if we wanted to make it serializable).

However, invoking the power of _duck typing_, it would be enough to provide a class with a `__call__` method (a _functor_) that accepts an `int` (the current iteration) and returns a `float` (the value of the learning factor). Note that the operations we are going to work with are tensor operations.

We are going to create a strategy that generates random values between two configurable limits.

In [None]:
class RandomScheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, lower_bound, upper_bound, name=None):
        self.lower_bound = lower_bound
        self.upper_bound = upper_bound
        self.name = name or 'RandomScheduler'

    def __call__(self, step):
        with tf.name_scope(self.name) as name:
            return tf.random.uniform(
                shape=[],
                minval=self.lower_bound,
                maxval=self.upper_bound,
            )

Yes, it is somewhat absurd and does not take into account the iteration we are in, but it is useful to illustrate how it works. Let's see how it would evolve during the training of a model

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=RandomScheduler(lower_bound=0.001, upper_bound=0.1))
lr_metric = lr_spy(optimizer)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=[lr_metric])

history = model.fit(X, y, batch_size=X.shape[0], epochs=500, verbose=0)

Graphically, the evolution of the learning factor is as follows:

In [None]:
pd.DataFrame(history.history['lr']).plot()
plt.xlabel('Epoch num.')
plt.show()

## Conclusion

El factor de aprendizaje adaptativo en redes neuronales es un mecanismo útil el rendimiento del entrenamiento de un modelo basado en redes neuronales. Existen varias implementaciones que permiten ajustar el factor de aprendizaje en función de varios parámetros ya implementadas en Keras, pero además es posible desarrollar una implementación propia, adaptada a las necesidades específicas del problema a resolver.

En general, el factor de aprendizaje adaptativo es una técnica esencial para la construcción de redes neuronales que pueden ser entrenadas de manera efectiva en una amplia variedad de problemas de aprendizaje automático.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Back to top](#top)

</div>