# Adam优化器

## 核心公式

1. 一阶矩估计

$$
m_t = \beta_{1} m_{t-1} + (1-\beta_{1}) g_t,
$$

其中，$\beta_1$ 是动量的衰减系数，一般设为 0.9； $m_t$ 为1阶矩估计，$g_t$为当前梯度。

2. 二阶矩估计

$$
v_t = \beta_{2} v_{t-1} + (1-\beta_{2}) g_t^2
$$

其中，$\beta_{2}$是自适应率的衰减系数，一般设为 0.999； $v_t$ 为2阶矩估计。

3. 偏差校正

$$
\hat{m}_t = \frac{m_t}{1-\beta_{1}^t}
$$

$$
\hat{v}_t = \frac{v_t}{1-\beta_{2}^t}
$$

4. 更新参数

$$
\theta_t = \theta_{t-1} - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}
$$

其中，$\alpha$为学习率，$\epsilon$为平滑项，一般设为 1e-8。

In [None]:
import math

class Adam:
    def __init__(
        self,
        gradients: list[float],
        learning_rate: float = 0.001,
        beta1: float = 0.9,
        beta2: float = 0.999,
        epsilon: float = 1e-8,
    ) -> None:
        self.gradients = gradients
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.m = [0.0] * len(gradients)
        self.v = [0.0] * len(gradients)
        self.t = 0
    
    def step(self) -> None:
        self.t += 1
        for i in range(len(self.gradients)):
            self.m[i] = self.beta1 * self.m[i] + (1-self.beta1) * self.gradients[i]
            self.v[i] = self.beta2 * self.v[i] + (1-self.beta2) * self.gradients[i] ** 2
            m_hat = self.m[i] / ( 1 - self.beta1 ** self.t)
            v_hat = self.v[i] / ( 1 - self.beta2 ** self.t)
            self.gradients[i] -= self.learning_rate * m_hat / (math.sqrt(v_hat) + self.epsilon)
