# ADAptive LInear NEuron -> Adaline

Adaline is a **improvement** on Perceptron algorithm. It firstly introduce the **linear activate function(derivable)** and **gradient descent** into neuron model, lay the mathematical foundation for modern deep learning!

There are 4 parts of Adaline:

```mermaid
graph LR
    subgraph Input
        C["Input vector X"]
        A["Bias (x₀=1)"]
    end

    subgraph "Parameters(Weights)"
        D["Weight vector W"]
        B["Bias weight w₀"]
    end

    subgraph Computation
        E["Net input<br/>z = w₀ + W·X"]
        F["Activation function<br/>(e.g., linear)"]
        G["Threshold function<br/>(e.g., sign(z))"]
    end

    subgraph Output
        H["Classification result"]
    end

    %% 前向传播（实线）
    A --> B
    C --> D
    B --> E
    D -->|dot product| E
    E --> F
    E --> G
    G --> H

    %% 反馈更新（虚线，表示训练时的误差反馈）
    F -.->|Update w₀<br/>by error| B
    F -.->|Update W<br/>by error| D

    %% 样式优化（可选）
    classDef forward fill:#e6f3ff,stroke:#333;
    classDef feedback stroke-dasharray: 5 5,stroke:#999;
    class A,B,C,D,E,F,G,H forward
    class F,B,D feedback

```

## 1. Model structure and Mathematical principles
+ **Activate function**:
    $$
    \phi(z) = z = \vec{w}^T\vec{x} + b
    $$
+ **Decision function**:
  + Same as Perceptron
    $$
    \hat{y}=
    \begin{cases}
    +1 &if z\ge 0\\
    -1 &otherwise
    \end{cases}
    $$
+ !!!**Cost function**
  + We use **Mean Square Error(MSE)** to measure the accuracy of each sample
    + i.e. the mean of the sum of the square of each sample's prediction result:
  + For each sample $(\vec{x}^{(i)},\ y^{(i)}),y^{(i)}\in\set{+1,-1}$, we define:
    Squared Error for the i-th sample:
    $$
        l^{(i)} = \frac{1}{2}(y^{(i)}-\phi(z^{(i)}))^2
    $$
    Sum of Squared Errors, **SSE**
    $$
        J = \sum_i l^{(i)} = \frac{1}{2}\sum_i(y^{(i)}-\phi(z^{(i)}))^2
    $$
    Mean Squared Error, **MSE**
    $$
        \mathcal{L} = \frac{SSE}{2n} = \frac{1}{n}\sum_i(y^{(i)}-\phi(z^{(i)}))^2
    $$
  + Note that the MSE is **derivable** and **convex**

By those definition, next we introduce the most important step of Adaline:

## 2. Parameter update
+ Gradient calculate:
    $$
    \frac{\partial l^{(i)}}{\partial w_j} = -(y^{(i)} - \phi(z^{(i)}))x_j^{(i)}\\
    $$
+ So we just need to move the weight follow the gradient's negative direction. But there is a queation: We should update weights for each samples, or we collect a batch samples and update weights once? Here are three way of the samples choose:

### 2.1 Stochastic Gradient Descent
We update weights for each samples. i.e.:
$$
\begin{aligned}
&\vec{w}\leftarrow \vec{w} - \eta\cdot\Delta_wl^{(i)}(w) = w + \eta(y^{(i)} - \phi(z^{(i)}))\vec{x}^{(i)}\\
&b\leftarrow b + \eta(y^{(i)} - \phi(z^{(i)}))
\end{aligned}
$$
Where $\eta$ is the hyper-parameter **learning rate**

+ Since SGD move the weight for each sample, so it may jump left and right on both sides of the bottom of the MSE function. But because of its move pattern, sometimes it can jump out of the local optimal.
+ So SGD perform well at some data set(but now noone will use it)

### 2.2 Batch Gradient Descent
We use the MSE as the objactive function to update parameters. i.e.
$$
\begin{aligned}
&\vec{w}\leftarrow \vec{w} - \eta\cdot\Delta_w\mathcal{L}(w) = w + \eta\frac{1}{n}\sum_i(y^{(i)}-\phi(z^{(i)}))\\
&b\leftarrow b + \eta\frac{1}{n}(y^{(i)} - \phi(z^{(i)}))
\end{aligned}
$$

+ Pros:
  + It has the most precise direction of how gradient descent
+ Cons:
  + Unefficient, waste of computation source.

### 2.3 Mini-BGD 小批量梯度下降
We use a little set of $b(1<b<n)$ samples as a batch to compute descient, balance the accuracy and efficiency.
$$
\mathcal{B}\text{ is the set of this bitch of samples:}\\
\begin{aligned}
&\vec{w}\leftarrow \vec{w} - \eta\cdot\Delta_w\mathcal{L}_{\mathcal{B}}(w) = w + \eta\frac{1}{b}\sum_{i\in \mathcal{B}}(y^{(i)}-\phi(z^{(i)}))\\
&b\leftarrow b + \eta\frac{1}{b}(y^{(i)} - \phi(z^{(i)}))
\end{aligned}
$$

+ The choice of $b$:
  + Empirical law: 
    + Begin with 32
    + 32-64-128-256...
+ The mainstream of modern deep learning(CNN/RNN/Transformer)