# ðŸ“˜ Theoretical Reference: The Mathematics of Volatility

## 1\. The Benchmark: GARCH(1,1)

**Generalized Autoregressive Conditional Heteroskedasticity** (Bollerslev, 1986) is the industry standard for modeling volatility clustering ("shocks tend to be followed by large changes").

### 1.1 The Core Process

We model the returns $r_t$ as:
$$r_t = \sigma_t z_t$$
where $z_t \sim N(0,1)$ is a standard normal innovation (white noise).

The variance $\sigma^2_t$ evolves according to the specific **GARCH(1,1)** dynamic:

$$
\sigma^2_t = \omega + \alpha \epsilon^2_{t-1} + \beta \sigma^2_{t-1}
$$

  * **$\omega$ (Omega):** The baseline variance constant.
  * **$\epsilon^2_{t-1}$ (Lagged Squared Residual):** Represents the "Shock" or "News" from yesterday. The coefficient $\alpha$ measures the **reaction speed** to market events.
  * **$\sigma^2_{t-1}$ (Lagged Variance):** Represents the "Memory" of the system. The coefficient $\beta$ measures the **persistence** of volatility.

### 1.2 Key Constraints (The "Sanity Checks")

For the model to be physically valid in finance, we must satisfy:

1.  **Positivity:** $\omega > 0, \alpha \ge 0, \beta \ge 0$. (Variance cannot be negative).
2.  **Stationarity:** $\alpha + \beta < 1$.
      * If $\alpha + \beta = 1$, it is an **IGARCH** (Integrated GARCH) model, meaning shocks persist forever (infinite memory).
      * If $\alpha + \beta > 1$, the variance explodes to infinity over time.

### 1.3 Long-Run Variance

If the process is stationary, it reverts to a mean unconditional variance $V_L$:

$$
V_L = \mathbb{E}[\sigma^2] = \frac{\omega}{1 - \alpha - \beta}
$$

*Why this matters:* In stress testing, we often check if the current volatility is significantly higher than $V_L$ to detect crisis regimes.

### 1.4 Maximum Likelihood Estimation (MLE)

We find parameters $(\omega, \alpha, \beta)$ by maximizing the Log-Likelihood of the observed returns. Assuming Normal distribution:

$$
\mathcal{L} = -\frac{1}{2} \sum_{t=1}^{T} \left( \ln(2\pi) + \ln(\sigma^2_t) + \frac{r_t^2}{\sigma^2_t} \right)
$$

*Note:* The term $\frac{r_t^2}{\sigma^2_t}$ is crucial. It tries to make the ratio of "Actual Return" to "Predicted Volatility" close to 1.

-----

## 2\. The Challenger: Long Short-Term Memory (LSTM)

Standard Recurrent Neural Networks (RNNs) suffer from the **Vanishing Gradient Problem**: they cannot learn correlations across long time lags (e.g., a shock 30 days ago affecting today). LSTMs solve this with a "Gating" mechanism.

### 2.1 The Gates

The LSTM cell maintains a **Cell State ($C_t$)** (Long-term memory) and a **Hidden State ($h_t$)** (Short-term output).

1.  **Forget Gate ($f_t$):** "What should I forget from the past?"
    $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

      * $\sigma$ is the Sigmoid function (outputs 0 to 1). If 0, information is erased.

2.  **Input Gate ($i_t$):** "What new information matters?"

    $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
    Where $W_i$ are the weights, $h_{t-1}$ is the previous hidden state, $x_t$ is the current input, and $b_i$ are the biases.
    $$ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$
    (Candidate new memory)

3.  **Cell State Update:** The core engineering.

    $$C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$$

      * *Interpretation:* Old memory is scaled by "forget factor" + New memory is scaled by "input importance."

4.  **Output Gate ($o_t$):** "What should I reveal to the next step?"
    $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
    $$h_t = o_t * \tanh(C_t)$$

### 2.2 Why LSTMs for Volatility?

  * **Non-Linearity:** Unlike GARCH (which is linear in parameters), LSTMs can approximate complex non-linear functions (Universal Approximation Theorem).
  * **Regime Switching:** The gating mechanism allows the model to "switch" behavior. For example, during a crash, the **Input Gate** might fully open to react to new shocks, while in calm markets, the **Forget Gate** dominates to maintain the trend.

-----

## 3\. The Loss Function: QLIKE (Quasi-Likelihood)

This is the "Bridge" between AI and Econometrics. We do not use MSE. We use a Physics-Informed loss derived from the likelihood of volatility.

### 3.1 Derivation

If we assume returns follow $r_t \sim N(0, \sigma^2_t)$, the negative log-likelihood (ignoring constants) is:

$$
Loss = \ln(\sigma^2_t) + \frac{r_t^2}{\sigma^2_t}
$$

In our code, we implemented this as:

```python
loss = torch.log(pred_var) + (target_sq_ret / pred_var)
```

### 3.2 Gradient Analysis (Why it's safer)

Let's look at the gradient (derivative) of the loss with respect to the predicted variance $h = \sigma^2$:

$$
\frac{\partial L}{\partial h} = \frac{1}{h} - \frac{y^2}{h^2} = \frac{1}{h} \left( 1 - \frac{y^2}{h} \right)
$$

  * **Case A: Over-estimation ($h > y^2$):** The model predicts high risk, but reality is calm. The gradient is small. The model gently corrects down.
  * **Case B: Under-estimation ($h < y^2$):** The model predicts calm, but reality crashes ($y^2$ is huge). The term $\frac{y^2}{h^2}$ becomes **massive**.
  * **Result:** The gradient explodes. The model is "screamed at" by the optimizer to increase variance immediately. This asymmetry makes QLIKE inherently **risk-averse**, which is exactly what we want for a VaR model.