# Simple Bigram Model

### A simple bigram model will help motivate more intricate transformer architectures.

## What is a Bigram?

From Wikipedia:

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, and speech recognition.

## Bigram Model

A bigram model then involves predicting the following token when given a singular preceeding token. 

Let's spell this out in code.

First import PyTorch in python which is a machine learning framework based on the Torch library and the Python language. It is primarily used for creating deep neural networks.

In [3]:
%pip install torch

Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/1e/86/477ec85bf1f122931f00a2f3889ed9322c091497415a563291ffc119dacc/torch-2.1.2-cp311-none-macosx_11_0_arm64.whl.metadata
  Downloading torch-2.1.2-cp311-none-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting filelock (from torch)
  Obtaining dependency information for filelock from https://files.pythonhosted.org/packages/81/54/84d42a0bee35edba99dee7b59a8d4970eccdd44b99fe728ed912106fc781/filelock-3.13.1-py3-none-any.whl.metadata
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting typing-extensions (from torch)
  Obtaining dependency information for typing-extensions from https://files.pythonhosted.org/packages/b7/f4/6a90020cd2d93349b442bfcb657d0dc91eee65491600b2cb1d388bc98e6b/typing_extensions-4.9.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch)
  Downloading sympy-1.12-py3-non

In [4]:
import torch
import torch.nn as nn
from torch.nn import functional as F

  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),


## Hyperparameters
Now we must define some hyperparameters for our Bigram Model. 

***These parameters are taken from Andrej Karpathy's Let's Build GPT YouTube video***

In [5]:
batch_size = 32
block_size = 8 
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

### Batch Size
Batch size is the number of samples used in one forward and backward pass through the network. In principle, batch size determines the number of independent sequences we process in parallel. 

### Block Size
Block size is snynonymous with context length or the context window. It determines the number of tokens considered when predicting a new token.

With our block size of 8, a maximum of 8 tokens will be used to predict the 9th token in the sequence. 

## Seeding the torch.Generator Object

We will be generating random numbers here. To ensure the random numbers are the same everytime we run the preceeding and following code sequence, we must create a torch generator object with a manual seed as follows.

In [6]:
torch.manual_seed(123)

<torch._C.Generator at 0x107616610>

## Read in training corpus

In [7]:
with open('speeches.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [8]:
# get unique characters
chars = sorted(list(set(text)))
vocab_len = len(chars)
# encode chars as integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

#anon functions to encode from char to integer and decode from integer to char
encode = lambda s: [stoi[c] for c in s] 
decode = lambda l: ''.join([itos[i] for i in l])

### Split Data into training and validation splits

In [9]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

### Function to generate batch of inouts and targets

In [11]:
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ints = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ints])
    y = torch.stack([data[i+1:i+block_size+1] for i in ints])
    x, y = x.to(device), y.to(device)
    return x, y

### Function to Estimate the Loss

In [14]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Define Bigram Model Class

In [13]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


### Create Instance of Bigram Model

In [15]:
model = BigramLanguageModel(vocab_len)
m = model.to(device)

### Optimize Estimated Loss with Adam optimizer

Documentation for adam optimizer here: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html

Adam focuses on two ideas

Mometum and RMSprop


Momentum: Helps the optimizer to keep moving in the current direction, similar to the physical concept of momentum. It introduces a moving average of the gradients, and this moving average is then used to update the parameters of the model.

Here's a basic idea of how momentum works:

1. **Update Rule:** Instead of updating the parameters based solely on the current gradient, momentum introduces a moving average of the past gradients. The update rule for a parameter \(w\) becomes:

   $ v_t = \beta \cdot v_{t-1} + (1 - \beta) \cdot \nabla J(w_t) $ \
   $ w_{t+1} = w_t - \alpha \cdot v_t $

   Where:
   - $ \alpha $ is the learning rate.
   - $ \beta$ is the momentum term (typically close to 1, e.g., 0.9 or 0.99).
   - $ \nabla J(w_t)$ is the gradient of the loss with respect to the parameters at the current iteration. 

2. **Benefits:** Momentum helps to smooth out oscillations and speed up convergence, especially in the presence of noisy gradients or if the optimization surface has long, shallow valleys. It helps the optimizer to continue moving in the current direction, even if the gradient changes direction frequently.

3. **Intuition:** Think of it as a ball rolling down a surface with valleys and hills. Momentum allows the ball to accumulate speed when rolling down a slope and carry that speed to overcome small bumps, helping it to converge faster.

RMSprop:

1. **Adaptive Learning Rates:** Similar to Adam, RMSprop adapts the learning rates for each parameter individually. It does so by dividing the learning rate for a parameter by the square root of the exponential moving average of the squared gradients.

   $ v_t = \beta \cdot v_{t-1} + (1 - \beta) \cdot (\nabla J(w_t))^2 $\
   $ w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \cdot \nabla J(w_t) $

   Where:
   - $ \alpha $ is the learning rate.
   - $ \beta $ is a decay term (typically close to 1, e.g., 0.9).
   - $ \epsilon $ is a small constant added for numerical stability.
   - $ \nabla J(w_t) $ is the gradient of the loss with respect to the parameters at the current iteration. 

2. **Benefits:** RMSprop helps address the problem of vanishing or exploding gradients by normalizing the updates. It is particularly effective in scenarios where the scale of the gradients varies widely across different parameters or time steps.

3. **Exponential Moving Average:** The use of the exponential moving average for squared gradients helps RMSprop to adapt its learning rates dynamically. It focuses more on recent information and less on historical gradients.

4. **Intuition:** RMSprop can be thought of as adjusting the learning rates based on the historical magnitudes of the gradients. It scales down the learning rates for parameters with large and frequent updates, allowing for more stable and efficient convergence.

RMSprop is a key component in the family of adaptive learning rate optimization algorithms and is widely used in practice for training deep neural networks.

In [16]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 5.0593, val loss 5.0580
step 300: train loss 2.8387, val loss 3.7905
step 600: train loss 2.5027, val loss 3.7400
step 900: train loss 2.4345, val loss 3.7978
step 1200: train loss 2.4243, val loss 3.8851
step 1500: train loss 2.4086, val loss 3.9322
step 1800: train loss 2.3991, val loss 3.9723
step 2100: train loss 2.3997, val loss 4.0562
step 2400: train loss 2.3889, val loss 4.0725
step 2700: train loss 2.3995, val loss 4.1199


## Generate From the Model

In [17]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


Son. he gotlam rechealkllil t, toulenu'serotan. Asave pathillleler o hevenve th ces t – y, wo I sebe’le (it. caunourgey’lisenes t ce Sonoby. menite soubancthilaugo ry't t knkine wabr Thig…$515%.


Angeathatrith, g ybauthacondoive ate tSo booyin d binny. waro pend fousybut y, I’s t I bond he. The 
NNNo e.


No ithioe Alernc enot, tred nve wa anoully oi!
Joud.
Thee "Oflayoupe asthind waindintine I’t batmeresoutht.


Whal d. me gigngay he angos. t gate be jowoillmeritimplerewe wis the wicat w, thim
