# Intro to using PyTorch for a simple LLM

This notebook is aimed very narrowly at deciphering transformer-relevant code.

* Tensors (PyTorch’s main data structure)
* Shapes and dimensions
* `nn.Module` and `forward`
* Common layers: `Embedding`, `Linear`, `Softmax`
* Combining these elements into a toy mini-network

We will review more PyTorch examples in future weeks, and give some additional examples of more basic neural networks.  For now, we are laying out the key ideas for transformers.

If you want to get a better grasp of PyTorch, as well as see a more thorough introduction to building an LLM from Scratch, see:
* https://github.com/rasbt/LLMs-from-scratch : the repo for the book "Build a Large Language Model (From Scratch)"
* Appendix A of that repo has a PyTorch intro, for example: https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-A/01_main-chapter-code/code-part1.ipynb
* Andrej Karpathy also has some excellent videos, including this one: ["Let's build GPT: from scratch, in code, spelled out."](https://youtu.be/kCc8FmEb1nY?si=v5wAKd8b83EzstyZ)

## What is PyTorch?

PyTorch is a numerical computing library (like NumPy) + autograd (automatic differentiation) + neural network utilities.

Core idea: you work with tensors (multidimensional arrays) and modules (layers/models).

In [None]:
import torch

## Tensors: PyTorch’s arrays

### Creating tensors

In [None]:
# Scalar
a = torch.tensor(3.0)

# 1D tensor (vector)
v = torch.tensor([1.0, 2.0, 3.0])

# 2D tensor (matrix)
M = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])

print("a:", a)
print("v:", v)
print("M:", M)

### Tensor shapes 
##### (`.shape`)

Shape = dimensionality and how many elements each dimension has

In [None]:
print("a.shape:", a.shape)  # ()
print("v.shape:", v.shape)  # (3,)
print("M.shape:", M.shape)  # (2, 2)

In the transformer code, we'll see 2D and 3D tensors, something like:

In [None]:
batch_size = 2     # number of text sequences
seq_len = 5        # number of words (tokens) in the sequence
emb_len = 16       # size of the embedding or language model vector

x = torch.randn(batch_size, seq_len, emb_len)
print(x.shape)

In [None]:
x

## Basic operations
### Indexing

In [None]:
x = torch.tensor([[10, 20, 30],
                  [40, 50, 60]])

print(x[0])      # first row
print(x[1, 2])   # row with index 1, col with index 2
                 # i.e. second row, third column

### Simple math

In [None]:
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([10.0, 20.0, 30.0])

print(a + b)          # elementwise add
print(a * b)          # elementwise multiply
print(a @ b)          # matrix multiplication and dot product operations between tensors

For neural networks, we use lots of matrix multiplications

<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/dynamic_resizing_neural_network_4_obs.png" alt="NNMatrixMultiply" width="500">
* https://ml-cheatsheet.readthedocs.io/en/latest/forwardpropagation.html

## `nn.Module`: how PyTorch defines layers and models

Every neural net "thing" (layer, model) is usually a subclass of `nn.Module`.
* If you haven't used Python classes, a Python class is just a way to bundle data and behavior together.  It's like a custom blueprint for making objects.
* If there was a Python class called "Student", then we might illustrate as:
  * **Class**: a blueprint (e.g., `Student`).
  * **Object / instance**: a thing made from the blueprint (e.g., `student1`).
  * **Attribute**: a variable that belongs to an object (`student1.name`).
  * **Method**: a function inside a class that uses the object’s data (`student1.average_grade()`).
  * `self`: a reference to "this object right here".

For classes with PyTorch that are subclasses of `nn.Module`:
* `super()` lets us call code from `nn.Module` without hard-coding anything from `nn.Module`
* `__init__` is used to set up layers
  * This method is called anytime a new instance is initialized
* `forward(self, x)` is used to describe the computations

### A tiny linear model

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class TinyModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()  # important!

        # Layers (parameters are created here)
        # We'll get to this later, but nn.Linear is essentially implementing the matrix operations
        # that we need, such as y = x @ W^T + b (a fully-connected layer)
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # x: (batch, input_dim)
        h = self.linear1(x)       # h of size (batch, hidden_dim)
        h = F.relu(h)             # activation function
        out = self.linear2(h)     # out of size (batch, output_dim)
        return out

In [None]:
x = torch.randn(2, 4)  # 2 data records, each of dimension 4

In [None]:
x

In [None]:
model = TinyModel(input_dim=4, hidden_dim=8, output_dim=3)

In [None]:
logits = model(x)

In [None]:
print(logits.shape)    # should be model output for 2 records with 3 values each

In [None]:
logits

In [None]:
probs = F.softmax(logits, dim=-1)

print("logits:\n", logits)
print("probs:\n", probs)
print("sum of probs:\n", [i.sum() for i in probs])

### Key points:

* `TinyModel(...)` creates a model with parameters.
* Calling `model(x)` automatically calls `forward`.

## Layers for transformers
### Embeddings
##### `nn.Embedding`

Maps integer IDs (tokens) to vectors, and these can be trainable during training.

In [None]:
vocab_size = 1000
d_model = 32
embedding = nn.Embedding(vocab_size, d_model)

In [None]:
# Suppose we have 2 sequences of length 5:
input_ids = torch.randint(0, vocab_size, (2, 5))  # random token IDs
input_ids

In [None]:
print("input_ids.shape:", input_ids.shape)        # (2, 5)

In [None]:
emb = embedding(input_ids)
emb

In [None]:
print("emb.shape:", emb.shape)

In [None]:
x = torch.tensor([[10, 20, 30],
                  [10, 50, 60]])
emb = embedding(x)

In [None]:
x[0,0], emb[0,0]

In [None]:
x[1,0], emb[1,0]

In [None]:
x[1,1], emb[1,1]

#### Interpretation:

* Each integer ID -> row in an embedding matrix -> vector of size d_model.
* After embedding, we have continuous vectors that go into transformer layers.

### Linear layers
##### `nn.Linear`

Implements: y = x @ W^T + b (a fully-connected layer).

In [None]:
linear = nn.Linear(in_features=32, out_features=10)

x = torch.randn(2, 5, 32)          # (batch, seq, d_model)
out = linear(x)                    # (batch, seq, 10)
print("out.shape:", out.shape)

For LLMs, last layer often is `nn.Linear(d_model, vocab_size)`
* takes hidden state and returns logits over the vocabulary.

### Softmax
##### `F.softmax`

Turns logits (real numbers) into probabilities that sum to 1.

In [None]:
logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=-1)

print("logits:", logits)
print("probs:", probs)
print("sum of probs:", probs.sum())

## A tiny end-to-end example (like a micro language model)

This mimics a transformer in a tiny, simple network.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyToyLM(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.linear = nn.Linear(d_model, vocab_size)

    def forward(self, input_ids):
        # input_ids: (batch, seq_len)
        x = self.embed(input_ids)          # (batch, seq_len, d_model)

        # Just use the last token's representation (like a very dumb LM)
        last_hidden = x[:, -1, :]          # (batch, d_model)

        logits = self.linear(last_hidden)  # (batch, vocab_size)
        return logits

**Example usage**

In [None]:
vocab_size = 20
d_model = 16
model = TinyToyLM(vocab_size, d_model)

batch_size = 2
seq_len = 4
# set up dummy tensor with random IDs ranging from 0 to vocab_size
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

logits = model(input_ids)
probs = F.softmax(logits, dim=-1)

print("input_ids:\n", input_ids)
print("logits shape:", logits.shape)
print("probs[0] sums to:", probs[0].sum())

In [None]:
logits[1]

### Core pattern:

* Integers in (input_ids)
* Embeddings (via nn.Embedding)
* Some computation (here trivial, in transformers it’s self-attention layers)
* Linear layer to logits
* Optional softmax to get probabilities

## How this maps to transformer examples

When you see something like:

In [None]:
class TinyLM(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.layers = nn.ModuleList([...])
        self.ln_f = nn.LayerNorm(d_model)
        self.out_head = nn.Linear(d_model, vocab_size)

    def forward(self, input_ids):
        x = self.embed(input_ids)
        for layer in self.layers:
            x = layer(x)
        x = self.ln_f(x)
        logits = self.out_head(x)
        return logits


We can interpret this as:
* `input_ids` is just a tensor of integers (token IDs).
* `self.embed` is like the simple nn.Embedding example above.
* `self.layers` is a list of more complex blocks (self-attention + feedforward).
* `self.out_head` is a Linear mapping from d_model -> vocab_size.
* The output logits can be turned into probabilities with softmax.