# SGD

SGD is the simple way to find the optimal parameters by using the gradient of the parameters.
The following equation means a simple equation that goes only a certain distance in the inclined direction.
$$W{\leftarrow}W-\eta \frac{\partial L}{\partial W}$$ ($W$ = weights, $\eta$ = learning rate, $\frac{\partial L}{\partial W}$ = gradient of loss function to $W$)

In [2]:
class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr

    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

## shortcoming of SGD

In SGD, the search path is inefficient in anisotropic function(비등방성 함수) (a function in which the point indicated by the gradient at a specific coordinate(좌표) may change). Momentum, AdaGrad, and Adam are three ways to improve this shortcoming.

# Momentum

Momentum uses the rate of change of the parameter updated by the combination of the current gradient and the momentum accumulated in the previous stage. 
Simply put, the momentum adjusts the parameter in a better direction in the current stage, supported by the direction of parameter update in the previous stage. 
For example, a ball appears to roll on the bottom of the bowl.
$$v{\leftarrow}av-\eta \frac{\partial L}{\partial W}$$
$$W{\leftarrow}W+v$$

In [1]:
class Momentum:
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None

    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)

        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
            params[key] += self.v[key]