# 1. Machine Learning & Neural Networks (8 points)



## (a) (4 points) Adam Optimizer


Recall the standard Stochastic Gradient Descent update rule:
$$\theta \leftarrow \theta − \alpha \nabla_\theta J_{minibatch} (\theta)$$
where $\theta$ is a vector containing all of the model parameters, J is the loss function, $\nabla \theta J_{minibatch} (\theta)$
is the gradient of the loss function with respect to the parameters on a minibatch of data, and $\alpha$ is
the learning rate. Adam Optimization uses a more sophisticated update rule with two additional
steps. 

### i. (2 points) 

First, Adam uses a trick called momentum by keeping track of m, a rolling average
of the gradients:
$$m \leftarrow \beta_1 m + (1 − \beta_1 ) \nabla_\theta J_{minibatch} (\theta)$$
$$\theta ← \theta − \alpha m$$
where $\beta_1$ is a hyperparameter between 0 and 1 (often set to 0.9). 

Briefly explain (you don’t need
to prove mathematically, just give an intuition) how using m stops the updates from varying
as much and why this low variance may be helpful to learning, overall.

Setting $\beta$ to a value near 1 weights the updates to $\theta$ to take more of the previous values into account.

$\beta$ values near zero revert to the stochastic gradient descent update.

This would be useful for escaping narrow local minima like saddle points in the loss space and look for larger, more generalisable  local minima.

Adam uses a momentum term $(\beta_1 m)$, which, with typical learning rates like 0.9, results the updated $\textbf{m}$ to consist mostly of the previous $\mathbf{m}_{prev}$, and allows only minor contribution from the gradient. This "dampens" the path of our gradient, resulting in more stable updates, and allowing us to use larger learning rates than with regular SGD.

### ii. (2 points) 

Adam also uses adaptive learning rates by keeping track of v, a rolling average of
the magnitudes of the gradients

$$m \leftarrow \beta_1 m + (1 − \beta_1 ) \nabla_\theta J_{minibatch} (\theta)$$

$$v \leftarrow \beta_2 v + (1 − \beta_2 )(\nabla_\theta J_{minibatch} (\theta) \odot \nabla_\theta J_{minibatch} (\theta))$$

$$\theta \leftarrow \theta − \alpha \odot \frac{m}{\sqrt{v}} $$


where $\odot$ and $/$ denote elementwise multiplication and division (so $z\odot z$ is elementwise squaring)
and $\beta_2$ is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update
by $\sqrt{v}$, which of the model parameters will get larger updates? Why might this help with
learning?

Adam divides each element of the parameters $\theta$ in the update elementwise with $\textbf{v}$ that is produced by elementwise squaring of the gradients with respect to the parameters. The larger the gradient is for parameter $\theta_i$, the larger the divisor for it will be, meaning that Adam evens out the scale of the updates between flatter and steeper directions. This helps the updates to make progress on plateaus while and prevents overshooting in steep directions.

## (4 points) 

Dropout is a regularization technique. During training, dropout randomly sets units
in the hidden layer h to zero with probability $p_{drop} $(dropping different units each minibatch), and
then multiplies h by a constant $\gamma$. We can write this as

$$h_{drop} = \gamma d \circ h$$

where $d \in \{0, 1\}^{D_h}$ ($D_h$ is the size of h) is a mask vector where each entry is 0 with probability
$p_{drop}$ and 1 with probability $(1 − p_{drop} )$. $\gamma$ is chosen such that the expected value of $h_{drop}$ is h:

$$E_{p_{drop}} [h_{drop} ]_i = h_i$$

for all $i \in {1, . . . , D_h }$.

### i. (2 points) What must γ equal in terms of p drop ? Briefly justify your answer.


$$E_{p_{drop}} [h_{drop} ]_i = h_i$$
$$E_{p_{drop}} [\gamma d \circ h ]_i = h_i$$
$$E_{p_{drop}} [ d_i \circ h_i ] \gamma = h_i$$
$$ \gamma = \frac{h_i}{E_{p_{drop}} [ d_i  h_i ]}$$
$$ \gamma = \frac{1}{E_{p_{drop}} [ d_i ]}$$
$$ \gamma = \frac{1}{0 (p_{drop}) + 1(1-p_{drop}))}$$

$$ \gamma = \frac{1}{1-p_{drop}}$$

### ii. (2 points) Why should we apply dropout during training but not during evaluation?

During training dropout acts as a method of preventing overfitting. By ensuring that the entire network is modelling sections of the input data independently we can be sure we are not training our network to fit only the input data. 

During evaluation the entire network should be active as we are not changing the weights/biases on the network and so we should leverage the extra power granted by the entire network without the risk of overfitting.

$$