# **Optimization for Training Deep Models**

Deep learning algorithms involve optimization in many contexts. For example, performing inference in models such as PCA requires solving an optimization problem. Analytical optimization is often used to write proofs or design algorithms. Among all optimization problems in deep learning, the most difficult is training neural networks.

It is common to spend days or even months on hundreds of machines to solve a single neural network training problem. Because this task is both important and expensive, a specialized set of optimization techniques has been developed for it. This chapter presents those techniques for neural network training.

We focus on one key optimization goal: finding the parameters $\theta$ of a neural network that significantly reduce a cost function $J(\theta)$, which typically includes a performance measure on the entire training set as well as regularization terms.

# **How Learning Differs from Pure Optimization**

In most machine learning scenarios, we care about a performance measure $P$ defined on the test set. Because $P$ depends on the unknown test distribution, we optimize it only indirectly. Instead, we minimize a cost function $J(\theta)$, where $\theta$ denotes the vector of trainable model parameters (e.g., weights and biases). Improving $J(\theta)$ is intended to improve $P$, even though $P$ itself may be non-differentiable or hard to optimize directly. This contrasts with *pure optimization*, where minimizing $J$ is itself the goal.

Optimization algorithms for deep learning also exploit the structure of machine learning objectives, which can typically be written as an expectation over the **empirical distribution** $\hat{P}_{data}$ defined by the training set ${(x^{(i)},y^{(i)})}_{i=1}^m$, where $m$ is the number of training examples. Each pair $(x^{(i)},y^{(i)})$ is a sample consisting of an input $x^{(i)}$ and its corresponding target $y^{(i)}$. Using this notation, the empirical cost function is

$$
J(\theta) = E_{(x,y)\sim \hat{P}_{data}}L(f(x;\theta), y),
$$

where $f(x;\theta)$ is the model’s prediction for input $x$, and $L(f(x;\theta),y)$ is the loss function.

Ideally, we would minimize the **true risk**, whose expectation is taken with respect to the unknown **data-generating distribution** $P_{data}$ rather than the empirical distribution:

$$
J^*(\theta) = E_{(x,y)\sim P_{data}} L(f(x;\theta), y).
$$

Because $P_{data}$ is unknown, $J^*(\theta)$ cannot be computed exactly, motivating the need for learning rather than pure optimization.

## **Empirical Risk Minimization**

The goal of a learning algorithm is to reduce the expected generalization error, i.e., the true risk $J^*(\theta)$, defined over the actual data-generating distribution $P_{data}$. If $P_{data}(x,y)$ were known, minimizing $J^*(\theta)$ would reduce to a standard optimization problem. However, because we only observe a finite dataset, we instead approximate $P_{data}$ with the empirical distribution $\hat{P}_{data}$, which assigns probability $\frac{1}{m}$ to each training example.

This yields the **empirical risk**:

$$
E_{(x,y)\sim \hat{P}_{data}} L(f(x;\theta), y)
= \frac{1}{m}\sum_{i=1}^m L(f(x^{(i)};\theta), y^{(i)}).
$$

This approach is known as **empirical risk minimization (ERM)**.
However:

* High-capacity models may overfit by memorizing training examples.
* Some desirable loss functions, such as the 0–1 loss, are non-differentiable, preventing efficient optimization via gradients.

Thus, deep learning rarely applies ERM directly to the loss of interest. Instead, we optimize a differentiable, tractable surrogate.

## **Surrogate Loss Functions and Early Stopping**

Many performance measures (such as classification error) involve non-differentiable losses that cannot be optimized efficiently. For example, the 0–1 loss is flat almost everywhere and provides no useful gradient information. To address this, we replace the true loss with a **surrogate loss function**—a smooth, differentiable objective function whose minimization correlates with better true performance.

To further reduce overfitting, we frequently use **early stopping**: tracking performance on a validation set and stopping training when validation error begins to increase.

## **Batch and Minibatch Algorithms**

Machine learning objective functions typically decompose into sums over training examples. For probabilistic models, the maximum likelihood estimator $\theta_{ML}$ is defined as

$$
\theta_{ML} = \arg\max_{\theta} \sum_{i=1}^m \log P_{model}(x^{(i)}, y^{(i)};\theta),
$$

where $P_{model}(x,y;\theta)$ is the probability assigned by the model to the pair $(x,y)$ under parameters $\theta$, and $\log P_{model}(x,y;\theta)$ is its log-likelihood.

This optimization is equivalent to maximizing the empirical expectation:

$$
J(\theta) = E_{x,y\sim \hat{P}_{data}} \log P_{model}(x,y;\theta).
$$

The gradient, which is the quantity used by gradient-based algorithms, is

$$
\nabla_{\theta}J(\theta) =
E_{x,y\sim \hat{P}_{data}}
\left[\nabla_\theta \log P_{model}(x,y;\theta)\right].
$$

Computing this expectation exactly requires evaluating the model on all $m$ training examples—often prohibitively expensive. Instead, we approximate the expectation by sampling a small **minibatch** $B$ of examples and computing an average:

$$
\nabla_{\theta}J(\theta) \approx
\frac{1}{|B|} \sum_{(x,y)\in B}
\nabla_\theta \log P_{model}(x,y;\theta).
$$

This leads to the stochastic gradient descent (SGD) update

$$
\theta \leftarrow \theta - \eta \frac{1}{|B|}\sum_{(x,y)\in B}
\nabla_\theta L(f(x;\theta),y),
$$

where $\eta$ is the learning rate.

## **Challenges in Neural Network Optimization**

Neural network optimization is difficult due to several structural challenges:

* **Ill-Conditioning:** When the Hessian of $J(\theta)$ has a large condition number, or a the ratio of its largest eigenvalue $\lambda _{max}$ to its smallest eigenvalue $\lambda _{min}$, gradient-based updates become unstable or extremely slow.

* **Local Minima:** Nonconvex models like deep networks possess many local minima, although in high-dimensional spaces these are often benign.

* **Plateaus and Saddle Points:** Most critical points in high dimensions are saddle points where gradients are small, causing extremely slow progress.

* **Cliffs and Exploding Gradients:** Sharp increases in curvature can produce enormous gradients, causing updates to overshoot and destabilize training.

* **Long-Term Dependencies:** Very deep computational graphs, especially in recurrent architectures, cause gradients to vanish or explode over long chains of operations.

* **Inexact Gradients:** Because gradients are computed from minibatches and not the full dataset, they are noisy estimates rather than exact derivatives.

* **Poor Correspondence Between Local and Global Structure:** The local geometry of $J(\theta)$ near a point may provide misleading information about the global landscape.

* **Theoretical Limits:** Some optimization problems are provably intractable, though in practice larger or more flexible networks often make finding acceptable solutions easier.

## **Basic Algorithms**

# **Stochastic Gradient Descent**

Stochastic gradient descent (SGD) and its variants are among the most widely used optimization algorithms in machine learning and deep learning. The key idea is to replace the **true gradient**
$$
\nabla_\theta J(\theta)
$$
(which requires evaluating all $m$ training examples) with an **unbiased stochastic estimate** computed from a minibatch of size $m_b$ sampled i.i.d. from the data-generating distribution. Here, $\theta$ denotes the vector of trainable parameters, and each minibatch consists of input–target pairs $(x^{(i)},y^{(i)})$.

A crucial hyperparameter of SGD is the **learning rate**. Although SGD is often written with a constant learning rate $\epsilon$, in practice we use a **learning rate schedule** $\epsilon_k$ indexed by iteration $k$. This is necessary because the stochastic gradient estimate
$$
\hat{g} \approx \nabla_\theta J(\theta)
$$
contains **noise** due to random sampling, and this noise does not vanish even when the iterates approach a minimum. In contrast, batch gradient descent—using the full dataset—has a gradient that approaches zero near a minimum, allowing a fixed learning rate.

A standard condition ensuring convergence of SGD is:

$$
\sum_{k=1}^\infty \epsilon_k = \infty,\qquad
\sum_{k=1}^\infty \epsilon_k^2 < \infty.
$$

A common practical schedule linearly decays the learning rate until iteration $\tau$:
$$
\epsilon_k = \left(1 -\frac{k}{\tau}\right)\epsilon_0 + \frac{k}{\tau}\epsilon_{\tau}
$$
After iteration $\tau$, the learning rate is usually held constant. In practice, the choice of learning rate is guided by observing learning curves of the objective function.

### **Algorithm: Stochastic Gradient Descent (SGD)**

- **Require:** Learning rate schedule $\epsilon_1, \epsilon_2, \dots$
- **Require:** Initial parameters $\theta$

* $k \leftarrow 1$
* **while** stopping criterion not met **do**
  - Sample a minibatch of $m_b$ examples ${(x^{(i)},y^{(i)})}$
  - Compute gradient estimate:
  $
  \hat{g} \leftarrow \frac{1}{m_b}\sum_i \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})
  $
  - Update parameters:
  $
  \theta \leftarrow \theta - \epsilon_k \hat{g}
  $
  - $k \leftarrow k + 1$
* **end while**

# **Momentum**

Although SGD is effective, learning can be slow in regions of high curvature or where gradients are small or noisy. **Momentum** accelerates SGD by accumulating an exponentially decaying moving average of past gradients. The method introduces a **velocity vector** $v$, which encodes both direction and magnitude of recent updates.

<img src="img\download.png" width="30%" height="30%">

Momentum is controlled by a hyperparameter $\alpha \in [0,1)$, which determines how strongly past gradients influence the current direction. Formally, the momentum update is

$$
v \leftarrow \alpha v - \epsilon \left(\frac{1}{m_b}\sum_{i=1}^m \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})\right)
$$

$$
\theta \leftarrow \theta + v.
$$

The step size now depends on consistency between successive gradients. If several gradients point in the same direction, the velocity builds up, allowing faster progress. A useful interpretation is that $\frac{1}{1 - \alpha}$ approximates the “effective horizon” of the moving average. For example, $\alpha = 0.9$ amplifies the effective step size by a factor of about 10 compared to vanilla SGD.

Typical choices for $\alpha$ include $0.5$, $0.9$, and $0.99$. Like the learning rate, $\alpha$ can be adjusted over time, though this is less critical than scheduling $\epsilon$.

### **Algorithm: SGD with Momentum**

- **Require:** Learning rate $\epsilon$, momentum parameter $\alpha$
- **Require:** Initial parameters $\theta$, initial velocity $v$

* **while** stopping criterion not met **do**
  - Sample minibatch ${(x^{(i)},y^{(i)})}$
  - Compute gradient estimate:
  $
  \hat{g} \leftarrow \frac{1}{m_b}\sum_i \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})
  $
  - Velocity update:
  $
  v \leftarrow \alpha v - \epsilon \hat{g}
  $
  - Parameter update:
  $
  \theta \leftarrow \theta + v
  $
* **end while**

# **Nesterov Momentum**

Nesterov momentum modifies standard momentum by evaluating the gradient **after** applying the current velocity. This can be viewed as adding a correction term: the algorithm “looks ahead” using the momentum direction before computing the gradient.

The key idea is to compute an interim parameter value
$
\tilde{\theta} = \theta + \alpha v,
$
evaluate the gradient at $\tilde{\theta}$, and then update velocity and parameters accordingly.

### **Algorithm: SGD with Nesterov Momentum**

- **Require:** Learning rate $\epsilon$, momentum parameter $\alpha$
- **Require:** Initial parameters $\theta$, initial velocity $v$

* **while** stopping criterion not met **do**
  - Sample minibatch ${(x^{(i)},y^{(i)})}$
  - Interim lookahead step:
  $
  \tilde{\theta} \leftarrow \theta + \alpha v
  $
  - Compute gradient estimate at $\tilde{\theta}$:
  $
  \hat{g} \leftarrow \frac{1}{m_b}\sum_i \nabla_\theta L(f(x^{(i)};\tilde{\theta}), y^{(i)})
  $
  - Velocity update:
  $
  v \leftarrow \alpha v - \epsilon \hat{g}
  $
  - Parameter update:
  $
  \theta \leftarrow \theta + v
  $
* **end while**


### Algorithms with Adaptive Learning Rates

Neural network researchers have long recognized that the learning rate is one of the most difficult hyperparameters to set, as it strongly affects model performance. The cost is often sensitive in some directions of parameter space and insensitive in others. Momentum can help but introduces an additional hyperparameter.

If the directions of sensitivity are axis-aligned, it makes sense to use a separate learning rate for each parameter and adapt them automatically during training. Several incremental (mini-batch) methods have been developed for this purpose. Below, we briefly review some popular algorithms.

### Algorithm: AdaGrad

* **Require:** Global learning rate $\epsilon$
* **Require:** Initial parameter $\theta$
* **Require:** Small constant $\delta$ (e.g., $10^{-7}$) for numerical stability

* Initialize $r = 0$ (gradient accumulation)

* **while** stopping criterion not met **do**

  * Sample a minibatch ${x^{(i)}, y^{(i)}}_{i=1}^m$
  * Compute gradient:
    $
    g \leftarrow \frac{1}{m} \nabla_\theta \sum_i L(f(x^{(i)}; \theta), y^{(i)})
    $
  * Accumulate squared gradient: $r \leftarrow r + g \odot g$
  * Compute update: $\Delta \theta \leftarrow -\frac{\epsilon}{\delta + \sqrt{r}} \odot g$
  * Apply update: $\theta \leftarrow \theta + \Delta \theta$

* **end while**

### Algorithm: RMSProp

* **Require:** Global learning rate $\epsilon$, decay rate $\rho$
* **Require:** Initial parameter $\theta$
* **Require:** Small constant $\delta$ (e.g., $10^{-6}$)

* Initialize $r = 0$

* **while** stopping criterion not met **do**

  * Sample a minibatch ${x^{(i)}, y^{(i)}}_{i=1}^m$
  * Compute gradient:
    $
    g \leftarrow \frac{1}{m} \nabla_\theta \sum_i L(f(x^{(i)}; \theta), y^{(i)})
    $
  * Accumulate squared gradient: $r \leftarrow \rho r + (1-\rho) g \odot g$
  * Compute update: $\Delta \theta \leftarrow -\frac{\epsilon}{\delta + \sqrt{r}} \odot g$
  * Apply update: $\theta \leftarrow \theta + \Delta \theta$

* **end while**

### Algorithm: RMSProp with Nesterov Momentum

* **Require:** Global learning rate $\epsilon$, decay rate $\rho$
* **Require:** Initial parameter $\theta$
* **Require:** Small constant $\delta$ (e.g., $10^{-6}$)

* Initialize $r = 0$

* **while** stopping criterion not met **do**

  * Sample a minibatch ${x^{(i)}, y^{(i)}}_{i=1}^m$
  * Compute interim update: $\tilde{\theta} \leftarrow \theta + \alpha v$
  * Compute gradient:
    $
    g \leftarrow \frac{1}{m} \nabla_\theta \sum_i L(f(x^{(i)}; \tilde{\theta}), y^{(i)})
    $
  * Accumulate squared gradient: $r \leftarrow \rho r + (1-\rho) g \odot g$
  * Compute velocity: $v \leftarrow \alpha v - \frac{\epsilon}{\sqrt{r}} \odot g$
  * Apply update: $\theta \leftarrow \theta + v$

* **end while**

### Algorithm: Adam

* **Require:** Step size $\epsilon$ (default 0.001)
* **Require:** Exponential decay rates $\rho_1, \rho_2 \in [0,1)$ (default 0.9, 0.999)
* **Require:** Small constant $\delta$ (default $10^{-8}$)
* **Require:** Initial parameter $\theta$

* Initialize $s = 0$, $r = 0$ (1st and 2nd moments)

* Initialize time step $t = 0$

* **while** stopping criterion not met **do**

  * Sample a minibatch ${x^{(i)}, y^{(i)}}_{i=1}^m$
  * Compute gradient: $g \leftarrow \frac{1}{m} \nabla_\theta \sum_i L(f(x^{(i)}; \theta), y^{(i)})$
  * $t \leftarrow t + 1$
  * Update biased 1st moment: $s \leftarrow \rho_1 s + (1-\rho_1) g$
  * Update biased 2nd moment: $r \leftarrow \rho_2 r + (1-\rho_2) g \odot g$
  * Bias correction:
    $
    \hat{s} \leftarrow \frac{s}{1-\rho_1^t}, \quad \hat{r} \leftarrow \frac{r}{1-\rho_2^t}
    $
  * Compute update: $\Delta \theta \leftarrow -\epsilon \frac{\hat{s}}{\sqrt{\hat{r} + \delta}}$
  * Apply update: $\theta \leftarrow \theta + \Delta \theta$

* **end while**

Furthermore, we're creating a configuration object that stores model parameters.

<img src="img/121381obtV.gif">
<img src="img/56201contours_evaluation_optimizers.gif">


# Hands-on Optimizers with Python

In [None]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape)

In [None]:
from tensorflow import keras

# Reshape input data
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

# Convert labels to one-hot encoding
y_train = keras.utils.to_categorical(y_train) 
y_test = keras.utils.to_categorical(y_test)

# Normalize input data
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

In [None]:
batch_size = 64
num_classes = 10
epochs = 10
input_shape = (28,28,1)

def build_model(optimizer):
    model = Sequential()
    model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=input_shape))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss=keras.losses.categorical_crossentropy, optimizer=optimizer, metrics=['accuracy'])
    
    return model

In [None]:
optimizers = ['Adadelta', 'Adagrad', 'Adam', 'RMSprop', 'SGD']

models_history = {}
for i in optimizers:
    model = build_model(i)
    hist=model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test,y_test))
    models_history[i] = hist

We have run our model with a batch size of 64 for 10 epochs. After trying the different optimizers, the results we get are pretty interesting. Before analyzing the results, what do you think will be the best optimizer for this dataset?

## Adadelta

- Epoch 1/10 34s 35ms/step - loss: 2.2782 - accuracy: 0.1459 - val_loss: 2.2228 - val_accuracy: 0.3174
- Epoch 2/10 40s 43ms/step - loss: 2.1887 - accuracy: 0.3049 - val_loss: 2.1213 - val_accuracy: 0.6197
- Epoch 3/10 39s 42ms/step - loss: 2.0875 - accuracy: 0.4569 - val_loss: 2.0019 - val_accuracy: 0.7057
- Epoch 4/10 34s 36ms/step - loss: 1.9704 - accuracy: 0.5544 - val_loss: 1.8638 - val_accuracy: 0.7474
- Epoch 5/10 33s 35ms/step - loss: 1.8359 - accuracy: 0.6152 - val_loss: 1.7099 - val_accuracy: 0.7736
- Epoch 6/10 34s 36ms/step - loss: 1.6921 - accuracy: 0.6535 - val_loss: 1.5487 - val_accuracy: 0.7928
- Epoch 7/10 34s 36ms/step - loss: 1.5433 - accuracy: 0.6831 - val_loss: 1.3900 - val_accuracy: 0.8092
- Epoch 8/10 33s 35ms/step - loss: 1.4021 - accuracy: 0.6992 - val_loss: 1.2432 - val_accuracy: 0.8194
- Epoch 9/10 34s 36ms/step - loss: 1.2757 - accuracy: 0.7164 - val_loss: 1.1137 - val_accuracy: 0.8275
- Epoch 10/10 35s 37ms/step - loss: 1.1663 - accuracy: 0.7314 - val_loss: 1.0028 - val_accuracy: 0.8337

## Adagrad

- Epoch 1/10 33s 35ms/step - loss: 1.7035 - accuracy: 0.5212 - val_loss: 0.8946 - val_accuracy: 0.8293
- Epoch 2/10 33s 35ms/step - loss: 0.8019 - accuracy: 0.7735 - val_loss: 0.5032 - val_accuracy: 0.8791
- Epoch 3/10 33s 35ms/step - loss: 0.5975 - accuracy: 0.8243 - val_loss: 0.4045 - val_accuracy: 0.8958
- Epoch 4/10 32s 34ms/step - loss: 0.5146 - accuracy: 0.8468 - val_loss: 0.3573 - val_accuracy: 0.9049
- Epoch 5/10 33s 35ms/step - loss: 0.4658 - accuracy: 0.8624 - val_loss: 0.3273 - val_accuracy: 0.9099
- Epoch 6/10 32s 34ms/step - loss: 0.4346 - accuracy: 0.8717 - val_loss: 0.3056 - val_accuracy: 0.9163
- Epoch 7/10 34s 36ms/step - loss: 0.4090 - accuracy: 0.8789 - val_loss: 0.2895 - val_accuracy: 0.9204
- Epoch 8/10 35s 37ms/step - loss: 0.3884 - accuracy: 0.8852 - val_loss: 0.2753 - val_accuracy: 0.9233
- Epoch 9/10 31s 33ms/step - loss: 0.3723 - accuracy: 0.8892 - val_loss: 0.2633 - val_accuracy: 0.9257
- Epoch 10/10 103s 110ms/step - loss: 0.3609 - accuracy: 0.8935 - val_loss: 0.2535 - val_accuracy: 0.9282

## Adam

- Epoch 1/10 35s 37ms/step - loss: 0.2290 - accuracy: 0.9317 - val_loss: 0.0656 - val_accuracy: 0.9788
- Epoch 2/10 36s 38ms/step - loss: 0.0906 - accuracy: 0.9724 - val_loss: 0.0495 - val_accuracy: 0.9827
- Epoch 3/10 35s 37ms/step - loss: 0.0670 - accuracy: 0.9796 - val_loss: 0.0394 - val_accuracy: 0.9870
- Epoch 4/10 35s 37ms/step - loss: 0.0537 - accuracy: 0.9828 - val_loss: 0.0418 - val_accuracy: 0.9865
- Epoch 5/10 36s 38ms/step - loss: 0.0460 - accuracy: 0.9855 - val_loss: 0.0338 - val_accuracy: 0.9884
- Epoch 6/10 33s 35ms/step - loss: 0.0393 - accuracy: 0.9868 - val_loss: 0.0353 - val_accuracy: 0.9886
- Epoch 7/10 34s 36ms/step - loss: 0.0334 - accuracy: 0.9891 - val_loss: 0.0347 - val_accuracy: 0.9892
- Epoch 8/10 35s 37ms/step - loss: 0.0318 - accuracy: 0.9897 - val_loss: 0.0323 - val_accuracy: 0.9897
- Epoch 9/10 34s 36ms/step - loss: 0.0260 - accuracy: 0.9911 - val_loss: 0.0331 - val_accuracy: 0.9895
- Epoch 10/10 34s 37ms/step - loss: 0.0258 - accuracy: 0.9918 - val_loss: 0.0330 - val_accuracy: 0.9886

## RMSprop

- Epoch 1/10 40s 41ms/step - loss: 0.2360 - accuracy: 0.9282 - val_loss: 0.0721 - val_accuracy: 0.9777
- Epoch 2/10 38s 41ms/step - loss: 0.0928 - accuracy: 0.9722 - val_loss: 0.0573 - val_accuracy: 0.9810
- Epoch 3/10 38s 41ms/step - loss: 0.0729 - accuracy: 0.9790 - val_loss: 0.0541 - val_accuracy: 0.9828
- Epoch 4/10 39s 41ms/step - loss: 0.0641 - accuracy: 0.9814 - val_loss: 0.0480 - val_accuracy: 0.9857
- Epoch 5/10 39s 41ms/step - loss: 0.0600 - accuracy: 0.9828 - val_loss: 0.0453 - val_accuracy: 0.9862
- Epoch 6/10 68s 72ms/step - loss: 0.0596 - accuracy: 0.9830 - val_loss: 0.0420 - val_accuracy: 0.9874
- Epoch 7/10 47s 50ms/step - loss: 0.0570 - accuracy: 0.9835 - val_loss: 0.0481 - val_accuracy: 0.9856
- Epoch 8/10 44s 47ms/step - loss: 0.0552 - accuracy: 0.9840 - val_loss: 0.0447 - val_accuracy: 0.9870
- Epoch 9/10 41s 43ms/step - loss: 0.0580 - accuracy: 0.9837 - val_loss: 0.0488 - val_accuracy: 0.9869
- Epoch 10/10 41s 43ms/step - loss: 0.0562 - accuracy: 0.9841 - val_loss: 0.0469 - val_accuracy: 0.9860

## SGD

- Epoch 1/10 34s 36ms/step - loss: 0.8495 - accuracy: 0.7438 - val_loss: 0.3122 - val_accuracy: 0.9119
- Epoch 2/10 33s 36ms/step - loss: 0.3903 - accuracy: 0.8814 - val_loss: 0.2336 - val_accuracy: 0.9333
- Epoch 3/10 34s 36ms/step - loss: 0.3161 - accuracy: 0.9040 - val_loss: 0.1938 - val_accuracy: 0.9443
- Epoch 4/10 33s 36ms/step - loss: 0.2748 - accuracy: 0.9183 - val_loss: 0.1687 - val_accuracy: 0.9515
- Epoch 5/10 33s 35ms/step - loss: 0.2467 - accuracy: 0.9244 - val_loss: 0.1492 - val_accuracy: 0.9566
- Epoch 6/10 32s 34ms/step - loss: 0.2263 - accuracy: 0.9315 - val_loss: 0.1367 - val_accuracy: 0.9598
- Epoch 7/10 33s 35ms/step - loss: 0.2123 - accuracy: 0.9357 - val_loss: 0.1279 - val_accuracy: 0.9630
- Epoch 8/10 34s 36ms/step - loss: 0.1982 - accuracy: 0.9405 - val_loss: 0.1189 - val_accuracy: 0.9644
- Epoch 9/10 34s 36ms/step - loss: 0.1896 - accuracy: 0.9430 - val_loss: 0.1144 - val_accuracy: 0.9663
- Epoch 10/10 35s 38ms/step - loss: 0.1796 - accuracy: 0.9463 - val_loss: 0.1078 - val_accuracy: 0.9681


## Table Analysis
The above table shows the validation accuracy and loss at different epochs. It also contains the total time that the model took to run on 10 epochs for each optimizer. From the above table, we can make the following analysis.

- The adam optimizer shows the best accuracy in a satisfactory amount of time.
- RMSprop shows similar accuracy to that of Adam but with a comparatively much larger computation time.
- Surprisingly, the SGD algorithm took the least time to train and produced good results as well. But to reach the accuracy of the Adam optimizer, SGD will require more iterations, and hence the computation time will increase.
- SGD with momentum shows similar accuracy to SGD with unexpectedly larger computation time. This means the value of momentum taken needs to be optimized.
- Adadelta shows poor results both with accuracy and relative computation time (before final epoch).

You can analyze the accuracy of each optimizer with each epoch from the below graph.

In [None]:
import matplotlib.pyplot as plt


x_axis = [num+1 for num in range(10)]
for optimizer in models_history:
    optimizer_model = models_history[optimizer]
    y_axis = optimizer_model.history['accuracy']
    plt.plot(x_axis, y_axis, label = optimizer)
plt.title('accuracy for ' + optimizer)
plt.legend()
plt.show()

## Summary

- SGD is a basic algorithm but is rarely used today due to slow convergence and its constant learning rate. It also struggles with saddle points.

- Adagrad improves on SGD by adapting the learning rate frequently, making it especially useful for sparse data. 

- RMSProp performs similarly to gradient descent with momentum, differing mainly in how it computes gradients.

- Adam combines the strengths of RMSProp and other methods. It generally achieves better results, converges faster, and requires fewer hyperparameter adjustments. For these reasons, Adam is often the default choice for many applications.

However, Adam also has drawbacks, and in some cases, simpler algorithms like SGD may outperform it. Understanding your data and requirements is essential to choose the most suitable optimizer and achieve the best results.

### Conclusion

The choice of an optimization algorithm affects a deep learning model’s accuracy, speed, and efficiency. We explored several algorithms and compared their strengths, weaknesses, and appropriate use cases.

**Key Takeaways**

* Popular deep-learning optimizers include Gradient Descent, Stochastic Gradient Descent, Mini-batch Gradient Descent, Adagrad, RMSProp, AdaDelta, and Adam.
* Each optimizer has unique strengths and limitations, and the best choice depends on the task and data characteristics.
* The optimizer can greatly influence training speed, convergence quality, and the final performance of a model.