- the updates produced by both SGD-momentum and Adam for the 2D parameters in transformer-based neural networks typically have very high **condition number**. That is, they are almost **low-rank matrices**, with the updates for all neurons being dominated by just a few directions.
    - https://kellerjordan.github.io/posts/muon/

- 矩阵的条件数（Condition Number）衡量的是当输入数据发生微小变化时，输出结果会发生多大程度的变化。换句话说，它是一个数值计算稳定性的指标。
- 对于一个方阵 $A$，其条件数 $\kappa(A)$ 通用定义
    - $\kappa(A) = \|A\| \cdot \|A^{-1}\|$
    - 这个定义已经暗示了，只有可逆矩阵（非奇异矩阵）才有（有限的）条件数。如果一个矩阵是奇异的（不可逆），那么 $A^{-1}$ 不存在，或者是无穷大的；
- 实际计算中，最常用和最能揭示几何本质的是基于奇异值分解 (SVD) 的计算方法。
    - 假设 $A$ 的的奇异值从大到小排列为 $\sigma_1, \sigma_2, \cdots, \sigma_n$
    - $\kappa(A) = \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}$
- 纯数值计算的角度来看，一个训练好的神经网络的权重矩阵或其梯度矩阵，用 np.linalg.matrix_rank 来计算，几乎 100% 会是满秩的。
    - 随机初始化：网络权重在训练开始时就是用小的随机数（例如，从高斯分布或均匀分布中采样）初始化的。从数学概率上讲，随机生成的矩阵几乎不可能存在精确的线性相关性。
    - 随机梯度下降 (SGD) 的更新：在训练过程中，每次参数更新都是一个微小的、带有噪声的浮点数向量。这些连续、随机的扰动会不断地打破任何可能形成的线性依赖关系，使得矩阵中的每一行/列都保持着微小的独特性。
    - 数值秩 (Numerical Rank) vs. 有效秩 (Effective Rank)
        - 你用 np.linalg.matrix_rank 得到的是数值秩。这个函数计算矩阵的奇异值，然后统计有多少个奇异值大于一个非常小的阈值（tolerance）。由于浮点数的精度，这些奇异值几乎都不会精确为零。
        - 更关心的是有效秩 (Effective Rank) 或称 内在秩 (Intrinsic Rank)。这个概念关注的是：有多少个奇异值是“显著的”或“大的”，而其他的都可以被忽略不计？
        - 一个权重矩阵的奇异值谱（Singular Value Spectrum），也就是将其所有奇异值从大到小画出来。对于神经网络的权重矩阵，这个谱通常呈现以下特征：
            - 急剧下降 (Sharp Decay)：有少数几个非常大的奇异值。
            - 长长的尾巴 (Long Tail)：后面跟着大量非常非常小、但非零的奇异值。
    - 为什么会出现“有效低秩”现象？(The Low-Rank Hypothesis)
        - 过度参数化 (Over-parameterization)：现代神经网络的参数数量远远超过训练数据的数量。例如，一个GPT模型有数十亿参数，但训练数据量远小于此。这意味着网络有巨大的“表达自由度”，它不需要利用矩阵的全部“容量”（即满秩）来拟合数据。它可以在这个巨大的参数空间中找到一个更“简单”的、本质上是低秩的解。
        - 信息的内在维度 (Intrinsic Dimension of Information)：网络学习的任务通常具有比输入数据维度低得多的内在结构。例如，要识别一张 256x256 像素图片里的猫，虽然输入维度是 65536，但“猫”这个概念的内在信息维度要低得多。网络层学会的是将数据投影到一个低维子空间（manifold）上，在这个子空间里，分类或回归任务变得更容易。这个“投影”操作，其本质就是低秩的。
        - SGD的隐式正则化 (Implicit Regularization of SGD)：研究表明，随机梯度下降本身就有一种“偏好”，它倾向于寻找那些泛化能力更好的解，而这些解往往对应于更“平坦”的损失函数区域和更简单的（例如，低秩的）参数配置。
    - 一个具有“有效低秩”特性的矩阵，必然是一个病态矩阵 (Ill-conditioned Matrix)，它的条件数会极其巨大。
        - $\kappa(A) = \frac{\sigma_{\text{max}}}{\sigma_{\text{min}}}$

- 这个“有效低秩”的发现不是一个纯理论游戏，它催生了AI领域一些最重要的技术：
    - 模型压缩 (Model Compression)：既然矩阵是有效低秩的，我们就可以用一个真正的低秩矩阵来近似它，从而大幅减少参数量。例如，将一个大矩阵 $W_{m \times n}$ 分解为两个小矩阵 $A_{m \times k}$ 和 $B_{k \times n}$的乘积（其中 $k \ll m, n$），参数量则由 $m\times n$ 锐减为 $k\times (m+n)$
    - LoRA (Low-Rank Adaptation)：这是目前在大型语言模型微调中最火的技术之一。它的核心思想是：在微调时，原始的巨大权重矩阵 $W_0$ 保持不变（冻结），我们只训练一个非常小的、低秩的“更新矩阵” $\Delta W = BA$，。最终的权重是 $W = W_0 + BA$。这极大地降低了微调的计算和存储成本。LoRA的成功，就是“有效低秩假设”最有力的证明。

In [1]:
import numpy as np

def analyze_matrix(name, A):
    """
    Analyzes a matrix to show its rank, singular values, and condition number.
    """
    print(f"--- Analyzing Matrix: {name} ---")
    print("Matrix A:\n", A)

    # Calculate rank
    rank = np.linalg.matrix_rank(A)
    print(f"\nRank of A: {rank}")

    # Calculate singular values
    # The svd function returns U, s, Vh. 's' contains the singular values.
    U, s, Vh = np.linalg.svd(A)
    print(f"Singular values (s): {s}")
    
    sigma_max = np.max(s)
    sigma_min = np.min(s)
    print(f"  - Max singular value (σ_max): {sigma_max:.6f}")
    print(f"  - Min singular value (σ_min): {sigma_min:.6f}")

    # Calculate condition number
    # Check if the matrix is singular before calculating condition number
    if sigma_min < 1e-15: # A small threshold to check for numerical zero
        cond_num = float('inf')
        cond_num_manual = float('inf')
    else:
        cond_num = np.linalg.cond(A)
        cond_num_manual = sigma_max / sigma_min

    print(f"\nCondition number (from np.linalg.cond): {cond_num:,.2f}")
    print(f"Condition number (from σ_max / σ_min): {cond_num_manual:,.2f}")
    
    if cond_num > 1000:
        print("\nConclusion: This is an ill-conditioned (sick) matrix.")
        print("It is very close to being a singular (lower rank) matrix because its smallest singular value is tiny compared to its largest.")
    elif cond_num == float('inf'):
        print("\nConclusion: This is a singular matrix (rank-deficient). Its condition number is infinite.")
    else:
        print("\nConclusion: This is a well-conditioned matrix.")
    print("-" * 40 + "\n")


# 1. A well-conditioned matrix (close to identity)
# The two column vectors are orthogonal.
A_well = np.array([[1.0, 0.0],
                   [0.0, 1.0]])
analyze_matrix("Well-Conditioned Matrix", A_well)


# 2. An ill-conditioned matrix
# The second column is very close to being a multiple of the first.
# It's "almost" a rank-1 matrix.
A_ill = np.array([[1.0, 1.0],
                  [1.0, 1.000001]])
analyze_matrix("Ill-Conditioned Matrix", A_ill)


# 3. A singular matrix (rank-deficient)
# The second column is exactly a multiple of the first.
A_singular = np.array([[1.0, 2.0],
                       [2.0, 4.0]])
analyze_matrix("Singular Matrix", A_singular)

--- Analyzing Matrix: Well-Conditioned Matrix ---
Matrix A:
 [[1. 0.]
 [0. 1.]]

Rank of A: 2
Singular values (s): [1. 1.]
  - Max singular value (σ_max): 1.000000
  - Min singular value (σ_min): 1.000000

Condition number (from np.linalg.cond): 1.00
Condition number (from σ_max / σ_min): 1.00

Conclusion: This is a well-conditioned matrix.
----------------------------------------

--- Analyzing Matrix: Ill-Conditioned Matrix ---
Matrix A:
 [[1.       1.      ]
 [1.       1.000001]]

Rank of A: 2
Singular values (s): [2.00000050e+00 4.99999875e-07]
  - Max singular value (σ_max): 2.000001
  - Min singular value (σ_min): 0.000000

Condition number (from np.linalg.cond): 4,000,002.00
Condition number (from σ_max / σ_min): 4,000,002.00

Conclusion: This is an ill-conditioned (sick) matrix.
It is very close to being a singular (lower rank) matrix because its smallest singular value is tiny compared to its largest.
----------------------------------------

--- Analyzing Matrix: Singular Mat

### gpt2 analysis

- 一组矩阵极度病态 (Ill-Conditioned)：几乎所有 Attention Proj 层的条件数都非常高，甚至有几个在数值上直接被判断为降秩 (Rank-Deficient)。
- 另一组矩阵非常健康 (Well-Conditioned)：几乎所有的 Attention (Q,K,V Fused)、MLP FC (Up-Proj) 和 MLP Proj (Down-Proj) 层的条件数都出奇地小（大多在10到200之间），是典型的良态矩阵。
    - `block.attn.c_attn.weight`: `[768, 2304]`
        - W_q, W_k, W_v
- attention
    - $ \text{head}_i = \text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d_k}}\right)V_i $
    - $ \text{MultiHeadOutput} = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) $
    - $ \text{Output}_{\text{Attn}} = \text{MultiHeadOutput} \cdot W_o + b_o $

In [2]:
import torch
from transformers import GPT2Model
import numpy as np

def analyze_matrix(name: str, matrix: torch.Tensor):
    """
    Analyzes a given matrix (PyTorch tensor) to compute and print its
    shape, numerical rank, and condition number.
    """
    # Ensure matrix is 2D and float for linalg operations
    if matrix.dim() > 2:
        # For layers like c_attn which are (embedding_dim, 3 * embedding_dim),
        # we can analyze them directly. This is a common fused matrix.
        pass
    elif matrix.dim() < 2:
        print(f"--- Skipping {name}: Not a 2D matrix (shape: {matrix.shape}) ---\n")
        return

    # Move to float32 for accurate linalg calculations
    matrix = matrix.to(torch.float32)
    
    print(f"--- Analyzing: {name} ---")
    print(f"Shape: {matrix.shape}")

    # --- Rank Calculation ---
    # torch.linalg.matrix_rank uses SVD and counts singular values > tol
    rank = torch.linalg.matrix_rank(matrix)
    max_possible_rank = min(matrix.shape)
    print(f"Numerical Rank: {rank.item()} (Max possible: {max_possible_rank})")
    if rank.item() == max_possible_rank:
        print("-> The matrix is numerically FULL RANK.")
    else:
        print("-> The matrix is numerically RANK-DEFICIENT.")

    # --- Condition Number Calculation ---
    # torch.linalg.cond also uses SVD: sigma_max / sigma_min
    # This can be very slow for large matrices.
    print("Calculating condition number (this may take a moment)...")
    try:
        # For non-square matrices, cond is still calculated via singular values
        cond_num = torch.linalg.cond(matrix)
        print(f"Condition Number: {cond_num.item():.2e}") # Use scientific notation
        if cond_num > 1e5:
             print("-> The matrix is EXTREMELY ILL-CONDITIONED.")
        elif cond_num > 1e3:
             print("-> The matrix is ill-conditioned.")
        else:
             print("-> The matrix is well-conditioned.")
    except torch.linalg.LinAlgError as e:
        print(f"Could not compute condition number: {e}")
    
    print("-" * (len(name) + 18) + "\n")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}\n")

# Load pre-trained GPT-2 model from Hugging Face
print("Loading pre-trained gpt2 model...")
# We use GPT2Model to get the core transformer blocks without the language model head
model = GPT2Model.from_pretrained('gpt2')
model.to(device)
model.eval() # Set to evaluation mode

# We don't need to compute gradients for this analysis
with torch.no_grad():
    # Iterate through each transformer block (layer)
    for i, block in enumerate(model.h):
        print(f"=========================================")
        print(f" GPT-2 Transformer Block {i} ")
        print(f"=========================================\n")

        # 1. Attention Layer Analysis
        # In GPT-2, Q, K, V are combined into one large matrix `c_attn`
        attn_weights = block.attn.c_attn.weight
        analyze_matrix(f"Block {i} - Attention (Q,K,V Fused)", attn_weights)
        
        # Also analyze the output projection of the attention layer
        attn_proj_weights = block.attn.c_proj.weight
        analyze_matrix(f"Block {i} - Attention Proj", attn_proj_weights)

        # 2. MLP (Feed-Forward) Layer Analysis
        # First fully-connected layer (up-projection)
        mlp_fc_weights = block.mlp.c_fc.weight
        analyze_matrix(f"Block {i} - MLP FC (Up-Proj)", mlp_fc_weights)

        # Second fully-connected layer (down-projection)
        mlp_proj_weights = block.mlp.c_proj.weight
        analyze_matrix(f"Block {i} - MLP Proj (Down-Proj)", mlp_proj_weights)
        
        # Let's just analyze the first few layers to keep the output concise
        # if i == 1:
        #     print("... analysis shown for the first 2 blocks ...")
        #     break

Using device: cuda

Loading pre-trained gpt2 model...
 GPT-2 Transformer Block 0 

--- Analyzing: Block 0 - Attention (Q,K,V Fused) ---
Shape: torch.Size([768, 2304])
Numerical Rank: 768 (Max possible: 768)
-> The matrix is numerically FULL RANK.
Calculating condition number (this may take a moment)...
Condition Number: 4.55e+01
-> The matrix is well-conditioned.
---------------------------------------------------

--- Analyzing: Block 0 - Attention Proj ---
Shape: torch.Size([768, 768])
Numerical Rank: 765 (Max possible: 768)
-> The matrix is numerically RANK-DEFICIENT.
Calculating condition number (this may take a moment)...
Condition Number: 1.09e+05
-> The matrix is EXTREMELY ILL-CONDITIONED.
------------------------------------------

--- Analyzing: Block 0 - MLP FC (Up-Proj) ---
Shape: torch.Size([768, 3072])
Numerical Rank: 768 (Max possible: 768)
-> The matrix is numerically FULL RANK.
Calculating condition number (this may take a moment)...
Condition Number: 4.44e+01
-> The ma