# Projekt: Ein "Bigram"-Langugage Model

Import of required libraries for building a bigram language model using PyTorch.

In [9]:
import torch
import torch.nn as nn
from torch.nn import functional as F

Hyperparameter-Definitionen für das Bigram-Sprachmodell

In [10]:
# --- Hyperparameter ---
batch_size = 32 # How many independent sequences to process in parallel
block_size = 8  # Maximum length of context (irrelevant for Bigram, but important for later)
max_iters = 3000 # number of training iterations
eval_interval = 300 # how often to evaluate the loss
learning_rate = 1e-2 # learning rate for optimizer
eval_iters = 200 # number of iterations for loss estimation

# device configuration
device = 'mps' if torch.backends.mps.is_available() else 'cpu' # M4 Check!
print(f"Using device: {device}")


Using device: mps


## 1. Load data and tokenization

Loading data from a text file and creating character-level tokenization

**Tokenization & Encoding**
Wir nutzen hier Character-Level Tokenization. a -> 1, b -> 2.

Modernere Modelle wie GPT-4 nutzen "Sub-word Tokenization" (Tiktoken), wo häufige Wortteile (z.B. "ing" oder "Pre") ein einziges Token sind. Für unser Verständnis reicht Character-Level völlig aus und macht den Code schlanker.

In [11]:
DATAPATH = 'data/tinyshakespeare.txt'

In [12]:
# !curl -o {DATAPATH} https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [13]:
# set random seed for reproducibility
torch.manual_seed(42)

# Load text data
with open(DATAPATH, 'r', encoding='utf-8') as f:
    text = f.read()
    print("Text data loaded.")
    print(f"Length of dataset in characters: {len(text)}")

Text data loaded.
Length of dataset in characters: 1115394


Sorting and Mapping of characters to indices and vice versa

In [14]:
# Sorting and Mapping of characters to indices and vice versa
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("All unique characters:", ''.join(chars))
print(f"Vocab size: {vocab_size}")

# Mapping: Zeichen zu Integers (Tokenization)
stoi = { ch:i for i,ch in enumerate(chars) } # string to int
itos = { i:ch for i,ch in enumerate(chars) } # int to string
encode = lambda s: [stoi[c] for c in s] # Encoder: String -> Liste von ints
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder: Liste von ints -> String
print(encode("hello world"))
print(decode(encode("hello world")))

All unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65
[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


Data preparation: splitting into training and validation sets

In [15]:
# Train/Test Split
data = torch.tensor(encode(text), dtype=torch.long) # Convert the entire text into a list of token IDs
# Split into training and validation data
n = int(0.9*len(data)) # 90% for Training, 10% for Validation
train_data = data[:n] # train_data 
val_data = data[n:] # val_data

Auxillary functions for data batching and loss estimation

In [16]:
# --- Helper function: Data batching ---
def get_batch(split):
    # Generates a small batch of inputs (x) and targets (y)
    data = train_data if split == 'train' else val_data
    # We choose random starting points in the text
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # x is the context, y is the target (the next character)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # Move to the M4
    return x, y

# --- Helper function: Loss estimation (without backprop) ---
@torch.no_grad()
def estimate_loss(model):
    out = {}
    model.eval() # set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # set model back to training mode
    return out

## 2. Bigram Language Model Definition

### What is a Bigram Model?

A bigram model is a simple statistical language model that predicts the probability of a word based on the one word that precedes it. It works by analyzing pairs of consecutive words (bigrams) from a training corpus to learn which words tend to follow others, a technique based on the Markov assumption. This allows it to generate new text that mimics the patterns of the original text, though it doesn't understand meaning or consider more than the previous word.  

### How it works?

1. **Data Training**: The model is trained on a large body of text (a corpus). 
2. **Counting Bigrams**: It counts how many times each pair of consecutive words appears in the corpus. 
3. **Probability Calculation**: It calculates the probability of a word appearing given the previous word. For example, the probability of "cat" following "the" is the count of "the cat" divided by the count of "the". 
4. **Text Generation**: When generating new text, it uses these probabilities to predict the next word. For instance, after generating the word "the," it will look at the learned probabilities to decide which word is most likely to come next. 
5. **Simplification**: It operates under the assumption that a word's probability only depends on the immediately preceding word, ignoring any words that came before that. 

### Key characteristics

1. **Simple yet powerful**: It is a fundamental and effective way to build a basic language model without complex neural networks. 
2. **Word dependency**: It is better than models that only consider individual words (unigrams) because it captures some local word dependencies. 
3. **Limited context**: Its main limitation is its narrow view, as it only considers one preceding word and has no memory of further context. 

In [17]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Each token directly reads off the logits for the next token from a lookup table
        # Embedding Dimension = Vocab Size, since we have no Hidden Layers
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensors of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            # Reshape for loss computation
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        for _ in range(max_new_tokens):
            # Get the predictions
            logits, _ = self(idx)
            # Focus only on the last time step
            logits = logits[:, -1, :] # (B,C)
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B,1)
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B,T+1)
        return idx
    

**Note to Embeddings (nn.Embedding Layer == Table):**

In this simple model, the embedding table does not yet function as a semantic vector space (like "King - Man + Woman = Queen"). Here it is a simple lookup table. When the model sees the letter "a", it looks up row "a" in the table. There are probability scores (logits) for all possible letters that could come next.

## 3. Initialization and Training of Bigram Language Model

### Model initialization

Initialize the model and move to device

In [18]:
# initialize the model and move to device
model = BigramLanguageModel(vocab_size)
model = model.to(device) # Move model to M4

# Optimizer (AdamW is standard for LLMs)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

### Training Loop
Training the bigram language model using mini-batch gradient descent and periodic loss estimation.

In [19]:
print("Start training ...")
for iter in range(max_iters):
    # Every eval_interval iterations, estimate loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss(model)
        print(f"Step {iter}: Train Loss {losses['train']:.4f}, Val Loss {losses['val']:.4f}")

    # Get a batch of data
    xb, yb = get_batch('train')

    # Forward pass
    logits, loss = model(xb, yb)

    # Backward pass and optimization step
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Start training ...
Step 0: Train Loss 4.7627, Val Loss 4.7633
Step 300: Train Loss 2.8415, Val Loss 2.8635
Step 600: Train Loss 2.5515, Val Loss 2.5881
Step 900: Train Loss 2.4970, Val Loss 2.5305
Step 1200: Train Loss 2.4908, Val Loss 2.5049
Step 1500: Train Loss 2.4716, Val Loss 2.5028
Step 1800: Train Loss 2.4557, Val Loss 2.4971
Step 2100: Train Loss 2.4725, Val Loss 2.4942
Step 2400: Train Loss 2.4637, Val Loss 2.4956
Step 2700: Train Loss 2.4644, Val Loss 2.4894


## 4. Save the trained model weights

In [20]:
model_path = "models/bigram_shakespeare.pt"
torch.save(model.state_dict(), model_path)
print(f"\nModell-Gewichte gespeichert unter: {model_path}")


Modell-Gewichte gespeichert unter: models/bigram_shakespeare.pt


## 5. Deployment and Text Generation

In [21]:
print("Generation of text:")
context = torch.zeros((1, 1), dtype=torch.long, device=device) # start with a single zero token
generated_indices = model.generate(context, max_new_tokens=500)[0].tolist()
print(decode(generated_indices))

Generation of text:

LLELATI k coput bainuthas I' chie athotorde us m wh.
QUCO:
Fit hye my n wanofeaver nd blkerd FReps to mas or,


US:
Hof?
Manoorer h mene be e llpueangbavyoy, frmact te Nonthaixt frel amanl;
Thin CIUSH:

KENCI t me t wh,
ARS:
LAliavere t,
ING herer.

JUSThevin pine lir h ss:
ABethe, s wce misso tayo sourlimede agn ant f whatithis monorcupr t wis io theas yow, bes, the atssen
Thaeadead by, he, whe re,
arend d ha LAnt s CIAnofave, r ough e ss bu thais!

De unime itsery;

ouncl t wise be VO, wimame 
