Mathematical Foundations of Large Language Models (MF-LLM)

LLM and LLM's laws lay hid in night:
Nature said, 'Let Lao Wang be!' and AI was light.

TL;DR: By applying SVD to the Q/K weight matrices of LLMs, we discovered two universal cross-model regularities — r = 1 (Spectral Linear Alignment) and SSR → 0 (Spectral Shape Fidelity). No inference, no benchmarks — just inspect the weights to assess a model's reasoning capability.

Mathematical Foundations of Large Language Models (MF-LLM)

"Die Mathematischen Grundlagen der Künstlichen Intelligenz" (To John von Neumann)

中文 | English | Read the Whitepaper

Overview

Wang's five Laws provide a static, reproducible framework to evaluate reasoning capability in LLMs from attention weights alone.

Wang's five Laws

1️⃣ First Law — Spectral Linear Alignment

Statement:
Query (Q) and Key (K) singular-value spectra are linearly correlated:

$$ r(s_q, s_k) \to r_\text{Wang} = 1 $$

Wang's Constant = 1 (theoretical extreme)
Observed in practice: 0.94–0.99
High Pearson correlation ensures stable information propagation in deep layers.

Empirical Evidence:

Model	Median Pearson	Mean Pearson	Median SSR	MeanSSR	Layers
gemma-4-e2b	0.9183	0.9242	0.015702	0.013537	35
gemma-4-e4b	0.9585	0.9411	0.009747	0.01008	42
llama-3-8b	0.9813	0.9737	0.006196	0.007009	32
Qwen2.5-14B	0.9795	0.9710	0.006077	0.00671	48
DeepSeek-R1	0.9800	0.9714	0.005948	0.006585	48

注：Better reasonning model，better r → 1，SSR → 0

2️⃣ Second Law — Spectral Shape Fidelity

Statement:
Normalized spectral mismatch between Q and K decreases in deep layers:

$$ \text{SSR} = \frac{1}{d_h} \sum_i |\tilde s_{q,i} - \tilde s_{k,i}|, \quad \tilde s = \frac{s}{|s|_2} $$

Wang's Second Constant = 0 (theoretical extreme, ideal SSR)
Observed in practice: ~0.006–0.007

Interpretation:

SSR measures shape alignment of Q/K spectra beyond linear correlation.
Lower SSR indicates higher reasoning fidelity.
RL-tuned models systematically reduce deep-layer SSR.

Empirical Evidence (Qwen2.5 vs DeepSeek-R1):

Layer Group	Qwen2.5 SSR	DeepSeek-R1 SSR	Improvement
0-11	0.006852	0.006818	+0.48%
12-23	0.006414	0.006338	+1.17%
24-35	0.006831	0.006704	+1.87%
36-47	0.006743	0.006479	+3.92%

Wang's Second Constant = 0 represents the ideal alignment of normalized Q/K spectra.

$R1 vs Qwen2.5-14B SSR Comparison$

3️⃣ Third Law — Precision-Depth-Logic Criterion

Statement:
Maximum trainable depth L_max is constrained by SSR, floating-point precision, and dynamic range:

L_max = min(L_info, L_quant, L_dyn)

Where:

Information decay limit:

$$ L_\text{info} = \frac{1}{\overline{\text{SSR}}} $$

Quantization noise limit:

$$ L_\text{quant} = 3 \cdot 2^{2m} \quad (m = \text{mantissa bits}) $$

Dynamic range limit:

$$ L_\text{dyn} = \frac{\log_2(\text{MaxFinite})}{\log_2 \kappa} $$

Example Table:

Format	Mantissa bits $m$	MaxFinite	$L_\text{dyn}$
FP16	10	6.55e4	16
BF16	7	3.39e38	128

Explains why ultra-deep models (>40 layers) adopt BF16/mixed precision.

4️⃣ Fourth Law — Global Decoupling of Output Subspaces

Description:
The left singular vectors (output subspaces) of the Q, K, V matrices in attention exhibit a systematic geometric structure:

The output directions of Q and K are close to random orthogonality, ensuring functional separation between query and key.
The output directions of Q and V, and K and V, are significantly lower than random orthogonality (super‑orthogonality), which guarantees channel isolation between the retrieval path and the content readout path, preventing information shortcuts.

Mathematical Expression:

Let $U_Q, U_K, U_V \in \mathbb{R}^{d_h \times d_h}$ be the left singular vector matrices of Q, K, V respectively. The mean absolute cosine similarity of matched columns is defined as:

$$ \overline{\cos}(U_A, U_B) = \frac{1}{d_h} \sum_{i=1}^{d_h} |\langle u_{A,i}, u_{B,i} \rangle| $$

Then we have:

$$ \overline{\cos}(U_Q, U_K) \sim \frac{1}{\sqrt{d_h}} \quad \text{(approximately random orthogonal)} $$

$$ \overline{\cos}(U_Q, U_V) < \frac{1}{\sqrt{d_h}}, \quad \overline{\cos}(U_K, U_V) < \frac{1}{\sqrt{d_h}} \quad \text{(super‑orthogonal)} $$

Empirical Results (cross‑model averages):

Model	$d_h$	Random Baseline	$\overline{\cos}(U_Q,U_K)$	$\overline{\cos}(U_Q,U_V)$	$\overline{\cos}(U_K,U_V)$
Qwen2.5‑14B‑Instruct	128	0.0884	0.0981	0.0704	0.0702
DeepSeek‑R1‑Distill‑Qwen‑14B	128	0.0884	0.0982	0.0705	0.0699
LLaMA‑3‑8B	128	0.0884	0.0949	0.0707	0.0705
Gemma‑4‑31B (text)	256	0.0625	0.0630	0.0497	0.0500
Gemma‑4‑31B (vision)	128	0.0884	0.1024	0.0714	0.0713

Super‑orthogonality is defined as cosine values systematically below the random expectation, typically by about 20%. This phenomenon appears consistently across all tested models and modalities, indicating that pretraining actively pushes the value output subspace away from the Q/K output subspaces.

5️⃣ Fifth Law — Global Random Orthogonality of Input Subspaces

Description:
The right singular vectors (input subspaces) of Q, K, V in the high‑dimensional token space exhibit near‑random alignment, with no structural coupling. This means the model freely and independently selects sensitive directions from the input embedding, without a mandatory shared basis.

Mathematical Expression:

Let $V_Q, V_K, V_V \in \mathbb{R}^{d_{\text{model}} \times d_h}$ be the right singular vector matrices of Q, K, V. The mean absolute cosine similarity of matched columns is:

$$ \overline{\cos}(V_A, V_B) = \frac{1}{d_h} \sum_{i=1}^{d_h} |\langle v_{A,i}, v_{B,i} \rangle| $$

Compared with the random baseline $ 1/\sqrt{d_{\text{model}}} $:

$$ \overline{\cos}(V_Q, V_K) \approx \overline{\cos}(V_Q, V_V) \approx \overline{\cos}(V_K, V_V) \approx \frac{1}{\sqrt{d_{\text{model}}}} $$

Empirical Results (cross‑model averages):

Model	$d_{\text{model}}$	Random Baseline	$\overline{\cos}(V_Q,V_K)$	$\overline{\cos}(V_Q,V_V)$	$\overline{\cos}(V_K,V_V)$
Qwen2.5‑14B‑Instruct	5120	0.0140	0.0212	0.0142	0.0211
DeepSeek‑R1‑Distill‑Qwen‑14B	5120	0.0140	0.0211	0.0142	0.0210
LLaMA‑3‑8B	4096	0.0156	0.0258	0.0155	0.0234
Gemma‑4‑31B (text)	5376	0.0136	0.0167	0.0128	0.0153
Gemma‑4‑31B (vision)	1152	0.0295	0.0440	0.0306	0.0304

In standard text Transformers, the slight elevation of cosV_QK (~1.5× baseline) represents an extremely weak input coupling, far from “structural alignment,” and can be regarded as natural fluctuation around random orthogonality.

In vision encoders, cosV_QK is more noticeably elevated (0.044 vs 0.030), reflecting the special requirement of vision self‑attention to share spatially sensitive directions. This does not alter the overall conclusion that the V space remains close to global random orthogonality.

Summary of the Laws
The five laws together describe the static spectral architecture of Transformer attention:

Σ space (singular values): full alignment (First & Second Laws)
U space (output): random orthogonal + super‑orthogonal, guaranteeing functional decoupling (Fourth Law)
V space (input): global random orthogonal, ensuring free feature selection (Fifth Law)

This structure remains highly consistent across multiple models (Qwen, DeepSeek‑R1, LLaMA‑3, Gemma‑4) and modalities (vision/text), indicating that it is a universal geometric fixed point reached at the end of pretraining.

Reproducibility Guide

Repository Structure

math-under-llm/proof/
01-universal-spectral-constant
├── check-gemma.py
├── check-qwen.py
├── check-llama.py
├── check_*_v2.py
02-ssr-why-RL-makes-models-smart
├── qwen-vs-deepseek-all-layers.py  -- run this first
├── check_*_v3_full.py
├── check_r1_full.py
├── check_qwen2.5_14b_full.py
├── qwen-vs-deepseek-all-layers.py
├── check_r1_qkv.py

Model Downloads

Setup & Run

Place all model folders in the same directory as the scripts.
Run verification scripts:

python check-gemma.py
python check-qwen.py
python check-llama.py
python check_r1_full.py
python check_*_v2.py
python qwen-vs-deepseek-all-layers.py

Outputs include Pearson(Q,K), SSR, and deep-layer trends.

What These Scripts Verify

Q/K singular-value Pearson correlation
Layer-wise SSR (Spectral Shape Residual)
Deep-layer spectral alignment trends
Base vs RL-tuned model comparisons
Cross-model universality checks

Practical Applications

Benchmark-free reasoning assessment
Checkpoint selection based on SSR
RL progress monitoring
Spectral fine-tuning or micro-adjustments
Precision planning for ultra-deep training

🚀 Why This Matters

Our findings suggest that you can evaluate LLM reasoning quality without running any benchmark—just analyze static weight matrices. This has immediate implications for:

Training efficiency: Stop wasting compute—detect convergence via spectral metrics, not just loss curves
Model merging: Merge models with provable reasoning preservation using SSR as a fitness function
Quantization: Compress models 2-4× with zero reasoning degradation via SSR-aware mixed precision
Fine-tuning: Prevent catastrophic forgetting with SSR-regularization

For detailed applications and theoretical directions, see our Whitepaper → Section 7.

Before you close this README —— Bonus：

"Get to work, lads." — The Monkey King (WuKong).
GitHub Issue #1: Verify r = 1
GitHub Issue #2: Verify SSR = 0
Pick a model. Run the script. Replicate the laws.
If you've made it this far and verified the numbers——consider buying put options on NVIDIA.
If Reasoning = Spectral Fidelity, the demand for brute-force training compute may not be what the market thinks it is.

Citation

CITATION.cff

"That's one small step for human intelligence, one giant leap for Artificial Intelligence."
— Lao Wang, EMIS-FRAMEWORK, Apr 29, 2026

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
paper		paper
plot		plot
proof		proof
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.cn.md		README.cn.md
README.md		README.md
WHITEPAPER.cn.md		WHITEPAPER.cn.md
WHITEPAPER.md		WHITEPAPER.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mathematical Foundations of Large Language Models (MF-LLM)

Overview

Wang's five Laws

1️⃣ First Law — Spectral Linear Alignment

2️⃣ Second Law — Spectral Shape Fidelity

3️⃣ Third Law — Precision-Depth-Logic Criterion

4️⃣ Fourth Law — Global Decoupling of Output Subspaces

5️⃣ Fifth Law — Global Random Orthogonality of Input Subspaces

Reproducibility Guide

Repository Structure

Model Downloads

Setup & Run

What These Scripts Verify

Practical Applications

🚀 Why This Matters

Before you close this README —— Bonus：

Citation

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mathematical Foundations of Large Language Models (MF-LLM)

Overview

Wang's five Laws

1️⃣ First Law — Spectral Linear Alignment

2️⃣ Second Law — Spectral Shape Fidelity

3️⃣ Third Law — Precision-Depth-Logic Criterion

4️⃣ Fourth Law — Global Decoupling of Output Subspaces

5️⃣ Fifth Law — Global Random Orthogonality of Input Subspaces

Reproducibility Guide

Repository Structure

Model Downloads

Setup & Run

What These Scripts Verify

Practical Applications

🚀 Why This Matters

Before you close this README —— Bonus：

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages