# Linear Algebra Essentials for LLMs from Scratch

This notebook is designed to bridge the gap between basic linear algebra and its practical application in Large Language Models (LLMs).We will use PyTorch exclusively to build an intuitive understanding.


## Why Linear Algebra Matters for LLMs

Large Language Models are fundamentally:

- Matrix multiplications
- Vector projections
- Linear transformations
- Tensor reshaping

Transformers are stacks of:
- Linear layers (matrix multiplications)
- Attention mechanisms (masked matrix multiplications)
- Normalization layers (matrix addition)

Everything reduces to linear algebra.

Let’s build intuition step by step.

# 1. Scalars, Vectors, and Matrices in PyTorch

In neural networks:

- Rank 0 →  **Scalar** → single number (e.g., loss value)
- Rank 1 → **Vector** → 1D tensor (e.g., embedding of a token)
- Rank 2 → **Matrix** → 2D tensor (e.g., weight matrix in Linear layer)
- Rank 3+ → 3D Tensors (e.g., batch_size, seq_length, embedding_dim)

In Transformers:
- Embeddings are vectors
- Weight parameters are matrices
- Attention scores are matrices


## 1.1 Scaler

In [17]:
import torch

# Scalar
scalar = torch.tensor(3.14)
print("Scalar:", scalar)
print("Scalar shape:", scalar.shape)

Scalar: tensor(3.1400)
Scalar shape: torch.Size([])


## 1.2 Vectors

A vector of length $n$, $\mathbf{x}\in\mathbb{R}^{n}$, is a 1-dimensional (1-D) array of real numbers

$$
\mathbf{x} =
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}.
$$

When discussing vectors we will only consider column vectors. A row vector can always be obtained from a column vector via transposition

$$
\mathbf{x}^{T} = [x_1, x_2, \ldots, x_n].
$$

In [13]:
# Vector
vector = torch.tensor([1.0, 2.0, 3.0])
print("\nVector:", vector)
print("Vector shape:", vector.shape)


Vector: tensor([1., 2., 3.])
Vector shape: torch.Size([3])


## 1.3 Matrix


A matrix $A\in\mathbb{R}^{m\times n}$ is a 2-D array of numbers

$$
A = 
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} &a_{m2} & \cdots & a_{mn} \\
\end{bmatrix},
$$

with $m$ rows and $n$ columns. The element at row $i$ and column $j$ is denoted $a_{ij}$. If $m=n$ we call it a square matrix.


In [16]:
# Matrix
matrix = torch.tensor([[1.0, 2.0],
                       [3.0, 4.0]])
print("\nMatrix:\n", matrix)
print("Matrix shape:", matrix.shape)


Matrix:
 tensor([[1., 2.],
        [3., 4.]])
Matrix shape: torch.Size([2, 2])


# 2. Dot Product vs Element-wise Multiplication

Understanding the difference between element-wise operations and matrix multiplication is crucial.

- **Element-wise:** `*` operator. Shapes must match.
- **Dot Product:** Sum of element-wise products (for vectors). Measures similarity between two vectors.


In attention mechanisms, dot products compute similarity between queries and keys.

## 2.1 Element-wise Multiplication

In [21]:
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Element-wise
elementwise = a * b
print("Element-wise multiplication:", elementwise)

Element-wise multiplication: tensor([ 4., 10., 18.])


## 2.2 Dot Product

In [22]:
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Dot product
dot = torch.dot(a, b)
print("Dot product:", dot)

Dot product: tensor(32.)


# 3. Matrix Multiplication

## 3.1 Matrix-scaler multiplication

We can multiply matrices by scalar values and add matrices of the same dimension, i.e.,

Let $c\in\mathbb{R}$ and $A\in\mathbb{R}^{m\times n}$, then
$$
cA =
\begin{bmatrix}
ca_{11} & ca_{12} & \cdots & ca_{1n} \\
ca_{21} & ca_{22} & \cdots & ca_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
ca_{m1} & ca_{m2} & \cdots & ca_{mn} \\
\end{bmatrix}.
$$


In [30]:
c = 100
A = torch.randn(2, 3)

scaled_A = c*A
print("Matrix \n", A)
print("\n Scaled Matrix \n", scaled_A)

Matrix 
 tensor([[-0.6443,  0.5578,  1.3072],
        [-1.5412, -1.1672, -0.8862]])

 Scaled Matrix 
 tensor([[ -64.4300,   55.7777,  130.7181],
        [-154.1192, -116.7232,  -88.6225]])



## 3.2 Matrix-vector multiplication


Let $A\in\mathbb{R}^{m\times n}$ and $\mathbf{x}\in\mathbb{R}^{n}$, then $A\mathbf{x}\in\mathbb{R}^{m}$
can be defined _row-wise_ as 

$$
A\mathbf{x} = 
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn} \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}
=
\begin{bmatrix}
x_1a_{11} + x_2 a_{12} + \cdots + x_na_{1n} \\
x_1a_{21} + x_2 a_{22} + \cdots + x_na_{2n} \\
\vdots \\
x_1a_{m1} + x_2 a_{m2} + \cdots + x_na_{mn} \\
\end{bmatrix}.
$$

In [33]:
x = torch.tensor([1., 2., 3.])
result = A @ x 
print(f"Matrix-vector multiplication A @ x:\n{result}")

Matrix-vector multiplication A @ x:
tensor([ 4.3928, -6.5343])


Equivalently, this means that $A\mathbf{x}$ is a linear combination of the _columns_ of $A$, i.e.,

$$
A\mathbf{x} = 
x_1 \begin{bmatrix} a_{11} \\ a_{21} \\ \vdots \\ a_{m1}  \end{bmatrix} 
+ 
x_2  \begin{bmatrix} a_{12} \\ a_{22} \\ \vdots \\ a_{m2}  \end{bmatrix}
+
\cdots
+
x_n \begin{bmatrix} a_{1n} \\ a_{2n} \\ \vdots \\ a_{mn}  \end{bmatrix}.
$$

Observe that the matrix $A$ is a linear transformation that maps vectors in $\mathbb{R}^{n}$ to $\mathbb{R}^{m}$.


In [34]:
# Interpretation as linear combination of columns
col1, col2, col3 = A[:, 0], A[:, 1], A[:, 2]
linear_combination = x[0]*col1 + x[1]*col2 + x[2]*col3
print(f"As linear combination of columns:\n{linear_combination}")

As linear combination of columns:
tensor([ 4.3928, -6.5343])


## 3.3 Matrix-matrix multiplication (is all you need)

![](https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/00_matrix_multiplication_is_all_you_need.jpeg)

**When you start digging into neural network layers and building your own, you'll find matrix multiplications everywhere.** Source: https://marksaroufim.substack.com/p/working-class-deep-learner

Matrix multiplication is the core operation in any neural networks (not just transformers).

PyTorch implements matrix multiplication functionality in the `torch.matmul()` method.

The main two rules for matrix multiplication to remember are:

The inner dimensions must match:
- (3, 2) @ (3, 2) won't work
- (2, 3) @ (3, 2) will work
- (3, 2) @ (2, 3) will work
  
The resulting matrix has the shape of the outer dimensions:
- (2, 3) @ (3, 2) -> (2, 2)
- (3, 2) @ (2, 3) -> (3, 3)

Note: One of the most common errors in deep learning (shape errors)

---



In a Linear layer: `output = input @ weights`

Where:
- input shape: `(batch_size, input_dim)`
- weights shape: `(input_dim, output_dim)`

This is how embeddings are projected inside Transformers.

Note: "@" in Python is the symbol for matrix multiplication.

In [1]:
import torch

Input shape: torch.Size([3, 2])

Output:
tensor([[2.2368, 1.2292, 0.4714, 0.3864, 0.1309, 0.9838],
        [4.4919, 2.1970, 0.4469, 0.5285, 0.3401, 2.4777],
        [6.7469, 3.1648, 0.4224, 0.6705, 0.5493, 3.9716]],
       grad_fn=<AddmmBackward0>)

Output shape: torch.Size([3, 6])


In [7]:
x = torch.randn(3, 2)  # batch of 3, 2 features
W = torch.randn(2, 4)  # project to 4 features

# torch.matmul(A, B) or torch.mm(A, B) or A @ B
Y = x @ W

print("Input shape:", x.shape)
print("Weight shape:", W.shape)
print("Output shape:", Y.shape)

Input shape: torch.Size([3, 2])
Weight shape: torch.Size([2, 4])
Output shape: torch.Size([3, 4])


The `torch.nn.Linear()` module, also known as a feed-forward layer or fully connected layer, implements a matrix multiplication between an input `x` and a weights matrix `A`.


Where:

- `x` is the input to the layer (deep learning is a stack of layers like torch.nn.Linear() and others on top of each other).
- `A` is the weights matrix created by the layer, this starts out as random numbers that get adjusted as a neural network learns to better represent patterns in the data (notice the "T", that's because the weights matrix gets transposed).
Note: You might also often see W or another letter like X used to showcase the weights matrix.
- `b` is the bias term used to slightly offset the weights and inputs.
- `y` is the output.
This is a linear function (you may have seen something like $y = mx+b$ in high school or elsewhere), and can be used to draw a straight line!


EX: Try changing the values of `in_features` and `out_features` below and see what happens.

You can create your own matrix multiplication visuals at http://matrixmultiplication.xyz/.

# 4. Transpose and Backpropagation



Transposing (`.T` or `torch.t()`) swaps rows and columns.

A matrix $A\in\mathbb{R}^{m\times n}$ is a 2-D array of numbers

$$
A = 
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} &a_{m2} & \cdots & a_{mn} \\
\end{bmatrix},
$$

with $m$ rows and $n$ columns. The element at row $i$ and column $j$ is denoted $a_{ij}$. If $m=n$ we call it a square matrix.

> Note: By convention, matrix indexing is opposite of graphical vector indexing.





---

**Transpose:** The transpose $A^{T}$ is defined as

$$
A^{T} = 
\begin{bmatrix}
a_{11} & a_{21} & \cdots & a_{m1} \\
a_{12} & a_{22} & \cdots & a_{m2} \\
\vdots & \vdots & \ddots & \vdots \\
a_{1n} &a_{2n} & \cdots & a_{nm} \\
\end{bmatrix}.
$$

The transpose turns columns of the matrix into rows (equivalently rows into columns). A square matrix is called symmetric if $A=A^{T}$.



**Why is this important?**
1. **Dimension Alignment:** To perform matrix multiplication, inner dimensions must match.
2. **Backpropagation:** When calculating gradients for weights, the input gradient must be transposed to match the weight shape.

If W has shape (in, out), then W.T has shape (out, in).

In [19]:
W = torch.randn(3, 4)
print("Original shape:", W.shape)

W_T = W.T
print("Transposed shape:", W_T.shape)

Original shape: torch.Size([3, 4])
Transposed shape: torch.Size([4, 3])


In [20]:
weights = torch.randn(3, 4) # 3 output features, 4 input features
inputs = torch.randn(2, 4)  # Batch of 2, 4 input features

# Forward Pass: (Batch, Input) @ (Input, Output) -> (Batch, Output)
# Note: weights usually needs to be transposed if defined as (Output, Input)
output = inputs @ weights.T 

print(f"Inputs Shape: {inputs.shape}")
print(f"Weights Shape: {weights.shape}")
print(f"Weights.T Shape: {weights.T.shape}")
print(f"Output Shape: {output.shape}")

Inputs Shape: torch.Size([2, 4])
Weights Shape: torch.Size([3, 4])
Weights.T Shape: torch.Size([4, 3])
Output Shape: torch.Size([2, 3])


# 5. Special Matrices


## Upper Triangular Matrix



An upper triangular matrix $U\in\mathbb{R}^{n\times n}$ is a matrix where all the entries below the main diagonal are zero

$$
U = 
\begin{bmatrix}
u_{11} & u_{12} & \cdots & u_{1n} \\
0 & u_{22} & \cdots & u_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & u_{nn} \\
\end{bmatrix}.
$$


**Why is this important?**
1. Used for **causal masking** in self-attention (used in all decoder-only models like GPT, LLaMA).
2. It prevents tokens from attending to future tokens.
3. This is essential in GPT-style autoregressive models.


In [5]:
seq_len = 5

mask_upper = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
print(mask_upper)

tensor([[0., 1., 1., 1., 1.],
        [0., 0., 1., 1., 1.],
        [0., 0., 0., 1., 1.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.]])


## Lower Triangular Matrix



A lower triangular matrix $L\in\mathbb{R}^{n\times n}$ is a matrix where all the entries above the main diagonal are zero

$$
L =
\begin{bmatrix}
l_{11} & 0 & \cdots & 0 \\
l_{12} & l_{22} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
l_{1n} & l_{n2} & \cdots & l_{nn} \\
\end{bmatrix}.
$$




In [6]:
mask_lower = torch.tril(torch.ones(seq_len, seq_len))
print(mask_lower)

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])


## Identity Matrix

Acts like multiplication by 1.


The $n \times n$ identity matrix is

$$
\mathbf{I} = 
\begin{bmatrix}
1 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 1 \\
\end{bmatrix}.
$$

For every $\mathbf{A}\in\mathbb{R}^{n\times n}$, then $\mathbf{AI} = \mathbf{IA}$. 

In practice the identity matrix leaves any vector unchanged.  For example,
$$
\begin{bmatrix}1 & 0 \\ 0 & 1 \end{bmatrix}
\begin{bmatrix}3\\ -2 \end{bmatrix} = \begin{bmatrix}3\\ -2 \end{bmatrix}.
$$
Because of this property we sometimes call $I$ the *do nothing* transform.



Useful in residual connections and initialization.

In [7]:
I = torch.eye(4)
print(I)

tensor([[1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.]])


## Diagonal Matrix

A diagonal matrix $D\in\mathbb{R}^{n\times n}$ has entries $d_{ij}=0$ if $i\neq j$, i.e.,

$$
D =
\begin{bmatrix}
d_{11} & 0 & \cdots & 0 \\
0 & d_{22} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & d_{nn} \\
\end{bmatrix}.
$$
Used in scaling operations.

In attention, scaling by sqrt(d_k) is conceptually diagonal scaling.

In [8]:
diag_values = torch.tensor([1.0, 2.0, 3.0])
D = torch.diag(diag_values)
print(D)

tensor([[1., 0., 0.],
        [0., 2., 0.],
        [0., 0., 3.]])


# Final Summary

You now understand:

- Scalars, vectors, matrices
- Dot product vs element-wise multiplication
- Matrix multiplication (is all you need)
- Transpose in backpropagation
- Triangular masks for causal self-attention
- Identity and diagonal matrices