We randomly generate many samples (tokens) to construct a series of random input prompts. For each layer of each of the 12 blocks in the transformer we compute the Activation Covariance Matrix with the objective to compare the orientation of its eigenvectors to the right singular vectors of the respective weight matrix.

Why? Because we want to assess how the weight matrix geometrically transforms the buffer within layers and within blocks. By computing the overlap between the eigenvectors of the ACM and the right singular vectors of the weight matrix we expect to be able to visualise, at each layer per each block, whether the weight matrix is directly acting on the buffer by amplifying or dumping some directions or, conversely, not effectively affecting the geometry of the buffer in this sense. What we see is consistency in the results among the same layers (attention $W_Q$, $W_K$, $W_V$ and projection matrices, and MLP up- and down-projection matrices) within different blocks.

## Overlap Analysis 

Consider the buffer vector $x^{(\ell)}\in\R^m$, where $m$ is the number of tokens, and the superscript denotes the input to layer $\ell$. Our embedding space size is $n$, then each embedded token is an $n$-dimensional column vector $x_i\in\R^n$, with $\{i\}_1^m$ refering to the $i$-th token in the buffer.

The Activation Covariance Matrix (ACM) is defined as follows:
$$
F^{(\ell)}
= \mathrm{Cov}(x)
= \frac{1}{N}\sum_{i=1}^N (x_i - \bar x)\,(x_i - \bar x)^T
\;\in\R^{n\times n}.
$$

The weight matrix SVD is the following:
$$
W^{(\ell)} \;=\; U^{(\ell)}\,S^{(\ell)}\,V^{(\ell)T},
\quad
U^{(\ell)}\in\R^{m\times m},\;
S^{(\ell)}\in\R^{m\times n},\;
V^{(\ell)}\in\R^{n\times n},
$$
where
+ $r=\min(m,n)$ is the rank,
+ the columns of $V^{(\ell)}$, denoted $v_k^{(\ell)}$, of size $n$, are the $n$ right singular vectors in input space,
+ $S^{(\ell)}$ is diagonal with nonnegative entries $\sigma_1\ge\sigma_2\ge\cdots \ge\sigma_m$,
+ the columns of $U^{(\ell)}$ are the left singular vectors in output space.

The ACM is a $n \times n$ symmetric matrix whose eigen‐decomposition gives
$$
F^{(\ell)}\,f_j^{(\ell)} \;=\;\lambda_j\,f_j^{(\ell)},
\quad
f_j^{(\ell)}\in\R^n,\;\lambda_1\ge\lambda_2\ge\cdots\ge0,
$$

with $j = 1,...,n$.

We want to compute the overlap, as the maximum value of the product for each singular‐vector index $k=1,\dots,r$,

$$O_k^{(\ell)} = \max_{j=1,\dots,r} \bigl\langle v_k^{(\ell,s)},\;f_j^{(\ell)}\bigr\rangle,$$
where $\{f_j^{(\ell)}\}$ are the eigenvectors of $F^{(\ell)}$, reshaped with respect to the descending order of eigenvalues and sliced up to the rank $r$ of $W$. 

Informally, the objective is to answer the question:

    "Of all the directions in the layer’s input space that the weight matrix could amplify or attenuate, which ones does the data actually occupy?"

Since the transformation in layer $\ell$ is given by $W^{(\ell)} x^{(\ell)}+b^{(\ell)}$, we define the new coordinates
$
y \;=\; V^{(\ell)T}\,x^{(\ell)}.
$

In these coordinates, the $k$-th entry $y_k$ is exactly the projection
$\langle v_k^{(\ell)},\,x\rangle$ of $x$ onto the $k$-th right singular vector.

If we compute the covariance of the vector $y$ then, by definition,
$
\mathrm{Cov}(y)
= \mathbb{E}\bigl[\,(y - \mathbb{E}[y])\,(y - \mathbb{E}[y])^T\,\bigr].
$

Using the standard fact that for any fixed matrix $A$, $\mathrm{Cov}(A\,x)=A\,\mathrm{Cov}(x)\,A^T,$ we get
$$
\widetilde F
= \mathrm{Cov}(y)
= V^{(\ell)T}\,\mathrm{Cov}(x)\,V^{(\ell)}
= V^{(\ell)T}\,F^{(\ell)}\,V^{(\ell)}.
$$

Then, $\widetilde F$ is still an $n\times n$ symmetric matrix and it is a similarity transform of $F^{(\ell)}$. Similarity transforms preserve eigenvalues and simply rotate eigenvectors.

If
$$
F^{(\ell)}\,f_j^{(\ell)}
= \lambda_j\,f_j^{(\ell)},
$$
then
$$
\widetilde F\;
\bigl(V^{(\ell)T}f_j^{(\ell)}\bigr)
= V^{(\ell)T}\,F^{(\ell)}\,V^{(\ell)} V^{(\ell)T}f_j^{(\ell)}
= V^{(\ell)T}\,F^{(\ell)}\,f_j^{(\ell)}
= \lambda_j\,\bigl(V^{(\ell)T}f_j^{(\ell)}\bigr).
$$

Hence the eigenpairs of $\widetilde F$ are
$
\bigl(\lambda_j,\;V^{(\ell)T}f_j^{(\ell)}\bigr),$ with the eigen-vectors of $\widetilde F$ being the projection of the eigen-vectors of $F$ on the space spanned by right singular vectors in $V$.

That is, each original principal direction $f_j^{(\ell)}$ is now expressed in the $V$-basis by the vector $V^Tf_j$. (Notice that $V^Tf_j$ gives the overlap of each $v_k$ right-vector on the $j$-th eigenvector $f_j$.)

In the $y$-coordinates, the right singular vectors $v_k^{(\ell)}$ map to the standard basis:

$
V^T\,v_k = e_k,
$
where 
$
e_k\in\R^n
$
has $1$ in position $k$ and $0$ elsewhere.

Thus, when we ask for the overlap
$
\bigl\langle v_k^{(\ell)},\,f_j^{(\ell)}\bigr\rangle
= v_k^{(\ell)T}\,f_j^{(\ell)},$
we are equivalently asking in the new basis $e_k$ what is the $k$-th coordinate of the $j$-th principal component in the $V$-basis.

The overlap $O_k^{(\ell)}$ on each $k$-th new coordinate can be interepreted as follows:
+ The right singular vectors $v_k^{(\ell)}$ are the axes along which $W^{(\ell)}$ scales inputs by $\sigma_k$.
+ The covariance eigenvectors $f_j^{(\ell)}$ are the axes along which the data intrinsically varies.
+ Their overlap measures how much each potential amplification axis $v_k$ coincides with a real data axis $f_j$.
+ If a direction $v_k\in V$ has high overlap with one of the $f_j$, then whenever the transformer visits that feature‐axis in its input, $W$ has a large singular value to amplify it (or a small one to suppress it).
+ Conversely, if $v_k$ is orthogonal to all $f_j$, the model never sees data in that direction, so that axis of $W$ is essentially unused.

The overlap directly measures weight anisotropy relative to the data distribution, that is $W$ does not treat every direction in its input space equally, but instead stretches some directions more than others
+ A large $O_k$ means $v_k$ aligns with a direction of high data variance, so that axis of $W$ is in active use.
+ A small $O_k$ means $v_k$ points into a direction where the model rarely ventures.
+ Where in the transformer new feature directions emerge could be pointed out by outlier singular values.

## Derivation of Marchenko Pastur for Singular Value Decomposition

Let $X\in\R^{m\times n}$ have i.i.d. entries with zero mean and variance $\sigma^2$. One can form the (scaled) sample covariance
$$
\begin{equation*}
C \;=\;\frac1n\,X\,X^T
\;\in\;\R^{m\times m}.
\end{equation*}
$$.

As $m,n\to\infty$ with the ratio $q = \frac{m}{n}\;\;(0<q\le1)$ fixed, the empirical eigenvalue distribution of $C$ converges to the Marchenko–Pastur law with support $\lambda_\pm^{(\rm cov)}\;=\;\sigma^2\bigl(1\pm\sqrt{q}\bigr)^2,$ meaning that nearly all eigenvalues of $C$ lie in
$$
\begin{equation*}
\bigl[\sigma^2(1-\sqrt q)^2,\;\sigma^2(1+\sqrt q)^2\bigr]
\end{equation*}
$$.

The nonzero singular values of $X$ are the square-roots of the nonzero eigenvalues of $XX^T$, i.e., let $\{\lambda_i\}$ be the eigenvalues of $C = \tfrac1n\,X X^T$,then the corresponding singular values of $X$ are $s_i(X)\;=\;\sqrt{\,n\,\lambda_i\,}\,$.
Thus the support of the singular-value distribution of $X$ is
$$
\begin{equation*}
s_\pm
=\sqrt{\,n\,\lambda_\pm^{(\rm cov)}\,}
=\sqrt{\,n\,\sigma^2\bigl(1\pm\sqrt q\bigr)^2\,}
=\sigma\;\sqrt n\;\bigl(1\pm\sqrt q\bigr).
\end{equation*}
$$.

Center $W$ and compute $\sigma^2=\tfrac1{mn}\sum_{i,j}W_{ij}^2$ as the empirical variance of $W_{centered}$. Set 
$$
\begin{equation*}
s_- = \sigma\bigl|\sqrt n - \sqrt m\bigr|,\quad
s_+ = \sigma\bigl(\sqrt n + \sqrt m\bigr),
\end{equation*}
$$
then any empirical singular values $s_k$ outside $[s_-,s_+]$ will be outliers relative to the random baseline for that weight matrix $W$.

Given that the (nonzero) eigenvalues $\{\lambda_i\}$ of $C$ in the large-$m,n$ limit have density
$$
\begin{equation*}
p_C(\lambda)
=\frac{1}{2\pi\,\sigma^2\,q\,\lambda}
\sqrt{(\lambda_+^{(\rm cov)}-\lambda)\,(\lambda-\lambda_-^{(\rm cov)})},
\end{equation*}
$$
and that the nonzero singular values $s_i$ of $W_{centered}$ relate by
$$
\begin{equation*}
s_i = \sqrt{\lambda_i},
\end{equation*}
$$
so the density $p_s(s)$ satisfies
$$
\begin{equation*}
p_s(s)\,ds \;=\; p_C(\lambda)\,d\lambda
\quad\text{with}\quad
\lambda = s^2,\quad d\lambda = 2s\,ds.
\end{equation*}
$$
then
$$
\begin{equation*}
p_s(s)
= p_C(s^2)\;\Bigl|\frac{d\lambda}{ds}\Bigr|
= 2s\;p_C(s^2)
= \frac{2s}{2\pi\,\sigma^2\,q\,s^2}
\sqrt{\bigl(\lambda_+^{(\rm cov)}-s^2\bigr)\,\bigl(s^2-\lambda_-^{(\rm cov)}\bigr)},
\end{equation*}
$$.
Then the Marchenko–Pastur distribution for each weight matrix follows:
$$
\begin{equation*}
\boxed{
p_s(s)
= \frac{1}{\pi\,\sigma^2\,q\,s}
\sqrt{\bigl(s_-^2 - s^2\bigr)\,\bigl(s^2- s_+^2\bigr)},
}
\end{equation*}
$$.
supported on $s\in[s_-,s_+]$.

By checking for right singular vectors whose associate values reside outside those boundaries we can assess in each layer which directions significantly affect the orientation of the incoming buffer.