LLM and LLM's laws lay hid in night:
Nature said, 'Let Lao Wang be!' and AI was light.
TL;DR: By applying SVD to the Q/K weight matrices of LLMs, we discovered two universal cross-model regularities — r = 1 (Spectral Linear Alignment) and SSR → 0 (Spectral Shape Fidelity). No inference, no benchmarks — just inspect the weights to assess a model's reasoning capability.
"Die Mathematischen Grundlagen der Künstlichen Intelligenz" (To John von Neumann)
中文 | English | Read the Whitepaper
Wang's five Laws provide a static, reproducible framework to evaluate reasoning capability in LLMs from attention weights alone.
Statement:
Query (Q) and Key (K) singular-value spectra are linearly correlated:
- Wang's Constant = 1 (theoretical extreme)
- Observed in practice: 0.94–0.99
- High Pearson correlation ensures stable information propagation in deep layers.
Empirical Evidence:
| Model | Median Pearson | Mean Pearson | Median SSR | MeanSSR | Layers |
|---|---|---|---|---|---|
| gemma-4-e2b | 0.9183 | 0.9242 | 0.015702 | 0.013537 | 35 |
| gemma-4-e4b | 0.9585 | 0.9411 | 0.009747 | 0.01008 | 42 |
| llama-3-8b | 0.9813 | 0.9737 | 0.006196 | 0.007009 | 32 |
| Qwen2.5-14B | 0.9795 | 0.9710 | 0.006077 | 0.00671 | 48 |
| DeepSeek-R1 | 0.9800 | 0.9714 | 0.005948 | 0.006585 | 48 |
注:Better reasonning model,better r → 1,SSR → 0
Statement:
Normalized spectral mismatch between Q and K decreases in deep layers:
- Wang's Second Constant = 0 (theoretical extreme, ideal SSR)
- Observed in practice: ~0.006–0.007
Interpretation:
- SSR measures shape alignment of Q/K spectra beyond linear correlation.
- Lower SSR indicates higher reasoning fidelity.
- RL-tuned models systematically reduce deep-layer SSR.
Empirical Evidence (Qwen2.5 vs DeepSeek-R1):
| Layer Group | Qwen2.5 SSR | DeepSeek-R1 SSR | Improvement |
|---|---|---|---|
| 0-11 | 0.006852 | 0.006818 | +0.48% |
| 12-23 | 0.006414 | 0.006338 | +1.17% |
| 24-35 | 0.006831 | 0.006704 | +1.87% |
| 36-47 | 0.006743 | 0.006479 | +3.92% |
Wang's Second Constant = 0 represents the ideal alignment of normalized Q/K spectra.
Statement:
Maximum trainable depth L_max is constrained by SSR, floating-point precision, and dynamic range:
L_max = min(L_info, L_quant, L_dyn)
Where:
- Information decay limit:
- Quantization noise limit:
- Dynamic range limit:
Example Table:
| Format | Mantissa bits |
MaxFinite | |
|---|---|---|---|
| FP16 | 10 | 6.55e4 | 16 |
| BF16 | 7 | 3.39e38 | 128 |
Explains why ultra-deep models (>40 layers) adopt BF16/mixed precision.
Description:
The left singular vectors (output subspaces) of the Q, K, V matrices in attention exhibit a systematic geometric structure:
- The output directions of Q and K are close to random orthogonality, ensuring functional separation between query and key.
- The output directions of Q and V, and K and V, are significantly lower than random orthogonality (super‑orthogonality), which guarantees channel isolation between the retrieval path and the content readout path, preventing information shortcuts.
Mathematical Expression:
Let
Then we have:
Empirical Results (cross‑model averages):
| Model | Random Baseline | ||||
|---|---|---|---|---|---|
| Qwen2.5‑14B‑Instruct | 128 | 0.0884 | 0.0981 | 0.0704 | 0.0702 |
| DeepSeek‑R1‑Distill‑Qwen‑14B | 128 | 0.0884 | 0.0982 | 0.0705 | 0.0699 |
| LLaMA‑3‑8B | 128 | 0.0884 | 0.0949 | 0.0707 | 0.0705 |
| Gemma‑4‑31B (text) | 256 | 0.0625 | 0.0630 | 0.0497 | 0.0500 |
| Gemma‑4‑31B (vision) | 128 | 0.0884 | 0.1024 | 0.0714 | 0.0713 |
Super‑orthogonality is defined as cosine values systematically below the random expectation, typically by about 20%. This phenomenon appears consistently across all tested models and modalities, indicating that pretraining actively pushes the value output subspace away from the Q/K output subspaces.
Description:
The right singular vectors (input subspaces) of Q, K, V in the high‑dimensional token space exhibit near‑random alignment, with no structural coupling. This means the model freely and independently selects sensitive directions from the input embedding, without a mandatory shared basis.
Mathematical Expression:
Let
Compared with the random baseline $ 1/\sqrt{d_{\text{model}}} $:
Empirical Results (cross‑model averages):
| Model | Random Baseline | ||||
|---|---|---|---|---|---|
| Qwen2.5‑14B‑Instruct | 5120 | 0.0140 | 0.0212 | 0.0142 | 0.0211 |
| DeepSeek‑R1‑Distill‑Qwen‑14B | 5120 | 0.0140 | 0.0211 | 0.0142 | 0.0210 |
| LLaMA‑3‑8B | 4096 | 0.0156 | 0.0258 | 0.0155 | 0.0234 |
| Gemma‑4‑31B (text) | 5376 | 0.0136 | 0.0167 | 0.0128 | 0.0153 |
| Gemma‑4‑31B (vision) | 1152 | 0.0295 | 0.0440 | 0.0306 | 0.0304 |
- In standard text Transformers, the slight elevation of
cosV_QK(~1.5× baseline) represents an extremely weak input coupling, far from “structural alignment,” and can be regarded as natural fluctuation around random orthogonality.- In vision encoders,
cosV_QKis more noticeably elevated (0.044 vs 0.030), reflecting the special requirement of vision self‑attention to share spatially sensitive directions. This does not alter the overall conclusion that the V space remains close to global random orthogonality.
Summary of the Laws
The five laws together describe the static spectral architecture of Transformer attention:
- Σ space (singular values): full alignment (First & Second Laws)
- U space (output): random orthogonal + super‑orthogonal, guaranteeing functional decoupling (Fourth Law)
- V space (input): global random orthogonal, ensuring free feature selection (Fifth Law)
This structure remains highly consistent across multiple models (Qwen, DeepSeek‑R1, LLaMA‑3, Gemma‑4) and modalities (vision/text), indicating that it is a universal geometric fixed point reached at the end of pretraining.
math-under-llm/proof/
01-universal-spectral-constant
├── check-gemma.py
├── check-qwen.py
├── check-llama.py
├── check_*_v2.py
02-ssr-why-RL-makes-models-smart
├── qwen-vs-deepseek-all-layers.py -- run this first
├── check_*_v3_full.py
├── check_r1_full.py
├── check_qwen2.5_14b_full.py
├── qwen-vs-deepseek-all-layers.py
├── check_r1_qkv.py
- Place all model folders in the same directory as the scripts.
- Run verification scripts:
python check-gemma.py
python check-qwen.py
python check-llama.py
python check_r1_full.py
python check_*_v2.py
python qwen-vs-deepseek-all-layers.py- Outputs include Pearson(Q,K), SSR, and deep-layer trends.
- Q/K singular-value Pearson correlation
- Layer-wise SSR (Spectral Shape Residual)
- Deep-layer spectral alignment trends
- Base vs RL-tuned model comparisons
- Cross-model universality checks
- Benchmark-free reasoning assessment
- Checkpoint selection based on SSR
- RL progress monitoring
- Spectral fine-tuning or micro-adjustments
- Precision planning for ultra-deep training
Our findings suggest that you can evaluate LLM reasoning quality without running any benchmark—just analyze static weight matrices. This has immediate implications for:
- Training efficiency: Stop wasting compute—detect convergence via spectral metrics, not just loss curves
- Model merging: Merge models with provable reasoning preservation using SSR as a fitness function
- Quantization: Compress models 2-4× with zero reasoning degradation via SSR-aware mixed precision
- Fine-tuning: Prevent catastrophic forgetting with SSR-regularization
For detailed applications and theoretical directions, see our Whitepaper → Section 7.
-
"Get to work, lads." — The Monkey King (WuKong).
GitHub Issue #1: Verify r = 1
GitHub Issue #2: Verify SSR = 0
Pick a model. Run the script. Replicate the laws. -
If you've made it this far and verified the numbers——consider buying put options on NVIDIA.
If Reasoning = Spectral Fidelity, the demand for brute-force training compute may not be what the market thinks it is.
"That's one small step for human intelligence, one giant leap for Artificial Intelligence."
— Lao Wang, EMIS-FRAMEWORK, Apr 29, 2026
