Skip to content

Commit

Permalink
Added Adagrad (#92)
Browse files Browse the repository at this point in the history
  • Loading branch information
hitanshu-mehta committed Feb 14, 2020
1 parent ded4da1 commit fd0f120
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 2 deletions.
8 changes: 7 additions & 1 deletion code/optimizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,13 @@ def Adadelta(data):


def Adagrad(data):
pass
gradient_sums = np.zeros(theta.shape[0])
for t in range(num_iterations):
gradients = compute_gradients(data, weights)
gradient_sums += gradients ** 2
gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))
weights = weights - lr * gradient_update
return weights


def Adam(data):
Expand Down
24 changes: 23 additions & 1 deletion docs/optimizers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,29 @@ Be the first to `contribute! <https://github.com/bfortuner/ml-cheatsheet>`__
Adagrad
-------

Be the first to `contribute! <https://github.com/bfortuner/ml-cheatsheet>`__
Adagrad (short for adaptive gradient) adaptively sets the learning rate according to a parameter.

- Parameters that have higher gradients or frequent updates should have slower learning rate so that we do not overshoot the minimum value.
- Parameters that have low gradients or infrequent updates should faster learning rate so that they get trained quickly.
- It divides the learning rate by the sum of squares of all previous gradients of the parameter.
- When the sum of the squared past gradients has a high value, it basically divides the learning rate by a high value, so the learning rate will become less.
- Similarly, if the sum of the squared past gradients has a low value, it divides the learning rate by a lower value, so the learning rate value will become high.
- This implies that the learning rate is inversely proportional to the sum of the squares of all the previous gradients of the parameter.

.. math::
g_{t}^{i} = \frac{\partial \mathcal{J}(w_{t}^{i})}{\partial W} \\
W = W - \alpha \frac{\partial \mathcal{J}(w_{t}^{i})}{\sqrt{\sum_{r=1}^{t}\left ( g_{r}^{i} \right )^{2} + \varepsilon }}
.. note::

- :math:`g_{t}^{i}` - the gradient of a parameter, :math: `\Theta ` at an iteration t.
- :math:`\alpha` - the learning rate
- :math:`\epsilon` - very small value to avoid dividing by zero

.. literalinclude:: ../code/optimizers.py
:language: python
:pyobject: Adagrad


Adam
Expand Down

0 comments on commit fd0f120

Please sign in to comment.