## Assignment 3: Dependency Parsing

### Estimated Time: ~10 hours

This assignment will build a neural dependency parser using PyTorch.  In part 1, we will review two general neural network techniques (Adam optimization and Dropout).  In part 2, we will implement and train a dependency parser using techniques from part 1.

## Part 1.  Adam Optimization and Dropout

### a) Adam

Recall the SGD update rule:

$$\theta = \theta - \alpha\triangledown_\theta J_{\text{minibatch}}(\theta)$$

where $\theta$ is a vector containing all of the model parameters, $J$ is the loss function, $\triangledown_\theta J_{\text{minibatch}}(\theta)$ is the gradient of the loss function, and $\alpha$ is the learning rate.  Adam is another possible update rule with two additional steps.

- (2 pts) First, Adam uses a trick called momentum by keep track of $\mathbf{m}$, a rolling average of the gradients:

$$\mathbf{m} = \beta_1 \mathbf{m} + (1-\beta_1)\triangledown_\theta J_{\text{minibatch}}(\theta)$$
$$\theta = \theta - \alpha \mathbf{m}$$

  where $\beta_1$ is a hyperparameter between 0 and 1 (often set to 0.9).  Briefly explain in 2-4 sentences (just give an intuition) how using $\mathbf{m}$ stops the updates from varying as much and why this low variance may be helpful to learning, overall.
  
#### <font color="red">Write your answer here.</font> 

#### <font color="blue">Solution</font> 

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.  Gain faster convergence and reduced oscillation.

- (2 pts) Adam extends the idea of momentum with the trick of adaptive learning rates by keep track of $\mathbf{v}$, a rolling average of the magnitudes of the gradients:

$$\mathbf{m} = \beta_1 \mathbf{m} + (1 - \beta_1)\triangledown_\theta J_{\text{minibatch}}(\theta)$$
$$\mathbf{v} = \beta_2 \mathbf{v} + (1 - \beta_2)(\triangledown_\theta J_{\text{minibatch}}(\theta) \circ \triangledown_\theta J_{\text{minibatch}}(\theta))$$
$$\theta = \theta - \alpha \mathbf{m} \mathbin{/} \sqrt{\mathbf{v}}$$

where $\circ$ and $\mathbin{/}$ denote elementwise multiplication and division (not dot product!).  $\beta_2$ is a hyperparameter between 0 and 1 (often set to 0.99).  Since Adam divides the update by $\sqrt{\mathbf{v}}$, what kinds of weights will receive larger update and smaller update?  Give some simple example of how.  Why might this help with learning?

#### <font color="red">Write your answer here.</font> 

#### <font color="blue">Solution</font> 

Weights that receive high gradients will have their effective learning rate reduced and vice versa for small gradients which will receive increased learning rate.  By such scaling, the system is less prone of overshooting or undershooting. A simple example is let's say $\mathbf{v} = [9, 2, 1]$, its square root is $\sqrt{\mathbf{v}} = [3, \sqrt{2}, 1]$.  Thus by dividing $\mathbf{m}$ by this $\sqrt{\mathbf{v}}$, it is essentially scaling the bigger one to be a bit smaller, and the smaller one will become relatively larger.

### b) Dropout

- (4 pts) Dropout is a regularization technique.  During training, dropout randomly sets units in the hidden layer $\mathbf{h}$ to zero with probabilty $p_{\text{drop}}$