Skip to content

emis-framework/math-under-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM and LLM's laws lay hid in night:
Nature said, 'Let Lao Wang be!' and AI was light.

TL;DR: By applying SVD to the Q/K weight matrices of LLMs, we discovered two universal cross-model regularities — r = 1 (Spectral Linear Alignment) and SSR → 0 (Spectral Shape Fidelity). No inference, no benchmarks — just inspect the weights to assess a model's reasoning capability.

Mathematical Foundations of Large Language Models (MF-LLM)

"Die Mathematischen Grundlagen der Künstlichen Intelligenz" (To John von Neumann)

License DOI HAL OSF Wang's Law

中文 | English | Read the Whitepaper


Overview

Wang's five Laws provide a static, reproducible framework to evaluate reasoning capability in LLMs from attention weights alone.

Wang's five Laws

1️⃣ First Law — Spectral Linear Alignment

Statement:
Query (Q) and Key (K) singular-value spectra are linearly correlated:

$$ r(s_q, s_k) \to r_\text{Wang} = 1 $$

  • Wang's Constant = 1 (theoretical extreme)
  • Observed in practice: 0.94–0.99
  • High Pearson correlation ensures stable information propagation in deep layers.

Empirical Evidence:

Model Median Pearson Mean Pearson Median SSR MeanSSR Layers
gemma-4-e2b 0.9183 0.9242 0.015702 0.013537 35
gemma-4-e4b 0.9585 0.9411 0.009747 0.01008 42
llama-3-8b 0.9813 0.9737 0.006196 0.007009 32
Qwen2.5-14B 0.9795 0.9710 0.006077 0.00671 48
DeepSeek-R1 0.9800 0.9714 0.005948 0.006585 48

注:Better reasonning model,better r → 1,SSR → 0


2️⃣ Second Law — Spectral Shape Fidelity

Statement:
Normalized spectral mismatch between Q and K decreases in deep layers:

$$ \text{SSR} = \frac{1}{d_h} \sum_i |\tilde s_{q,i} - \tilde s_{k,i}|, \quad \tilde s = \frac{s}{|s|_2} $$

  • Wang's Second Constant = 0 (theoretical extreme, ideal SSR)
  • Observed in practice: ~0.006–0.007

Interpretation:

  • SSR measures shape alignment of Q/K spectra beyond linear correlation.
  • Lower SSR indicates higher reasoning fidelity.
  • RL-tuned models systematically reduce deep-layer SSR.

Empirical Evidence (Qwen2.5 vs DeepSeek-R1):

Layer Group Qwen2.5 SSR DeepSeek-R1 SSR Improvement
0-11 0.006852 0.006818 +0.48%
12-23 0.006414 0.006338 +1.17%
24-35 0.006831 0.006704 +1.87%
36-47 0.006743 0.006479 +3.92%

Wang's Second Constant = 0 represents the ideal alignment of normalized Q/K spectra.

R1 vs Qwen2.5-14B SSR Comparison


3️⃣ Third Law — Precision-Depth-Logic Criterion

Statement:
Maximum trainable depth L_max is constrained by SSR, floating-point precision, and dynamic range:

L_max = min(L_info, L_quant, L_dyn)

Where:

  • Information decay limit:

$$ L_\text{info} = \frac{1}{\overline{\text{SSR}}} $$

  • Quantization noise limit:

$$ L_\text{quant} = 3 \cdot 2^{2m} \quad (m = \text{mantissa bits}) $$

  • Dynamic range limit:

$$ L_\text{dyn} = \frac{\log_2(\text{MaxFinite})}{\log_2 \kappa} $$

Example Table:

Format Mantissa bits $m$ MaxFinite $L_\text{dyn}$
FP16 10 6.55e4 16
BF16 7 3.39e38 128

Explains why ultra-deep models (>40 layers) adopt BF16/mixed precision.


4️⃣ Fourth Law — Global Decoupling of Output Subspaces

Description:
The left singular vectors (output subspaces) of the Q, K, V matrices in attention exhibit a systematic geometric structure:

  • The output directions of Q and K are close to random orthogonality, ensuring functional separation between query and key.
  • The output directions of Q and V, and K and V, are significantly lower than random orthogonality (super‑orthogonality), which guarantees channel isolation between the retrieval path and the content readout path, preventing information shortcuts.

Mathematical Expression:

Let $U_Q, U_K, U_V \in \mathbb{R}^{d_h \times d_h}$ be the left singular vector matrices of Q, K, V respectively. The mean absolute cosine similarity of matched columns is defined as:

$$ \overline{\cos}(U_A, U_B) = \frac{1}{d_h} \sum_{i=1}^{d_h} |\langle u_{A,i}, u_{B,i} \rangle| $$

Then we have:

$$ \overline{\cos}(U_Q, U_K) \sim \frac{1}{\sqrt{d_h}} \quad \text{(approximately random orthogonal)} $$

$$ \overline{\cos}(U_Q, U_V) < \frac{1}{\sqrt{d_h}}, \quad \overline{\cos}(U_K, U_V) < \frac{1}{\sqrt{d_h}} \quad \text{(super‑orthogonal)} $$

Empirical Results (cross‑model averages):

Model $d_h$ Random Baseline $\overline{\cos}(U_Q,U_K)$ $\overline{\cos}(U_Q,U_V)$ $\overline{\cos}(U_K,U_V)$
Qwen2.5‑14B‑Instruct 128 0.0884 0.0981 0.0704 0.0702
DeepSeek‑R1‑Distill‑Qwen‑14B 128 0.0884 0.0982 0.0705 0.0699
LLaMA‑3‑8B 128 0.0884 0.0949 0.0707 0.0705
Gemma‑4‑31B (text) 256 0.0625 0.0630 0.0497 0.0500
Gemma‑4‑31B (vision) 128 0.0884 0.1024 0.0714 0.0713

Super‑orthogonality is defined as cosine values systematically below the random expectation, typically by about 20%. This phenomenon appears consistently across all tested models and modalities, indicating that pretraining actively pushes the value output subspace away from the Q/K output subspaces.


5️⃣ Fifth Law — Global Random Orthogonality of Input Subspaces

Description:
The right singular vectors (input subspaces) of Q, K, V in the high‑dimensional token space exhibit near‑random alignment, with no structural coupling. This means the model freely and independently selects sensitive directions from the input embedding, without a mandatory shared basis.

Mathematical Expression:

Let $V_Q, V_K, V_V \in \mathbb{R}^{d_{\text{model}} \times d_h}$ be the right singular vector matrices of Q, K, V. The mean absolute cosine similarity of matched columns is:

$$ \overline{\cos}(V_A, V_B) = \frac{1}{d_h} \sum_{i=1}^{d_h} |\langle v_{A,i}, v_{B,i} \rangle| $$

Compared with the random baseline $ 1/\sqrt{d_{\text{model}}} $:

$$ \overline{\cos}(V_Q, V_K) \approx \overline{\cos}(V_Q, V_V) \approx \overline{\cos}(V_K, V_V) \approx \frac{1}{\sqrt{d_{\text{model}}}} $$

Empirical Results (cross‑model averages):

Model $d_{\text{model}}$ Random Baseline $\overline{\cos}(V_Q,V_K)$ $\overline{\cos}(V_Q,V_V)$ $\overline{\cos}(V_K,V_V)$
Qwen2.5‑14B‑Instruct 5120 0.0140 0.0212 0.0142 0.0211
DeepSeek‑R1‑Distill‑Qwen‑14B 5120 0.0140 0.0211 0.0142 0.0210
LLaMA‑3‑8B 4096 0.0156 0.0258 0.0155 0.0234
Gemma‑4‑31B (text) 5376 0.0136 0.0167 0.0128 0.0153
Gemma‑4‑31B (vision) 1152 0.0295 0.0440 0.0306 0.0304
  • In standard text Transformers, the slight elevation of cosV_QK (~1.5× baseline) represents an extremely weak input coupling, far from “structural alignment,” and can be regarded as natural fluctuation around random orthogonality.
  • In vision encoders, cosV_QK is more noticeably elevated (0.044 vs 0.030), reflecting the special requirement of vision self‑attention to share spatially sensitive directions. This does not alter the overall conclusion that the V space remains close to global random orthogonality.

Summary of the Laws
The five laws together describe the static spectral architecture of Transformer attention:

  • Σ space (singular values): full alignment (First & Second Laws)
  • U space (output): random orthogonal + super‑orthogonal, guaranteeing functional decoupling (Fourth Law)
  • V space (input): global random orthogonal, ensuring free feature selection (Fifth Law)

This structure remains highly consistent across multiple models (Qwen, DeepSeek‑R1, LLaMA‑3, Gemma‑4) and modalities (vision/text), indicating that it is a universal geometric fixed point reached at the end of pretraining.


Reproducibility Guide

Repository Structure

math-under-llm/proof/
01-universal-spectral-constant
├── check-gemma.py
├── check-qwen.py
├── check-llama.py
├── check_*_v2.py
02-ssr-why-RL-makes-models-smart
├── qwen-vs-deepseek-all-layers.py  -- run this first
├── check_*_v3_full.py
├── check_r1_full.py
├── check_qwen2.5_14b_full.py
├── qwen-vs-deepseek-all-layers.py
├── check_r1_qkv.py

Model Downloads


Setup & Run

  1. Place all model folders in the same directory as the scripts.
  2. Run verification scripts:
python check-gemma.py
python check-qwen.py
python check-llama.py
python check_r1_full.py
python check_*_v2.py
python qwen-vs-deepseek-all-layers.py
  1. Outputs include Pearson(Q,K), SSR, and deep-layer trends.

What These Scripts Verify

  • Q/K singular-value Pearson correlation
  • Layer-wise SSR (Spectral Shape Residual)
  • Deep-layer spectral alignment trends
  • Base vs RL-tuned model comparisons
  • Cross-model universality checks

Practical Applications

  • Benchmark-free reasoning assessment
  • Checkpoint selection based on SSR
  • RL progress monitoring
  • Spectral fine-tuning or micro-adjustments
  • Precision planning for ultra-deep training

🚀 Why This Matters

Our findings suggest that you can evaluate LLM reasoning quality without running any benchmark—just analyze static weight matrices. This has immediate implications for:

  • Training efficiency: Stop wasting compute—detect convergence via spectral metrics, not just loss curves
  • Model merging: Merge models with provable reasoning preservation using SSR as a fitness function
  • Quantization: Compress models 2-4× with zero reasoning degradation via SSR-aware mixed precision
  • Fine-tuning: Prevent catastrophic forgetting with SSR-regularization

For detailed applications and theoretical directions, see our Whitepaper → Section 7.


Before you close this README —— Bonus:

  1. "Get to work, lads." — The Monkey King (WuKong).
    GitHub Issue #1: Verify r = 1
    GitHub Issue #2: Verify SSR = 0
    Pick a model. Run the script. Replicate the laws.

  2. If you've made it this far and verified the numbers——consider buying put options on NVIDIA.
    If Reasoning = Spectral Fidelity, the demand for brute-force training compute may not be what the market thinks it is.


Citation

CITATION.cff


"That's one small step for human intelligence, one giant leap for Artificial Intelligence."
— Lao Wang, EMIS-FRAMEWORK, Apr 29, 2026

About

Wang's Five Laws of LLM Attention — A reproducible spectral theory that predicts reasoning ability from static weights alone.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages