---
title: "10.7_llm"
format:
  html: default
toc: false
---


# 10.7 • Large Language Models (LLMs): Mechanics, Limits, and Reliable Use

LLMs are **next-token predictors** trained on large corpora. They *feel* intelligent because language encodes vast world structure. But the objective is not ‘truth’, it is **likelihood**.

### Roadmap
1. Tokenisation and embeddings (how text becomes vectors)
2. Self-attention maths with shapes and masking
3. Training objective, sampling strategies, temperature, top-k/p
4. Why hallucinations happen; how RAG helps
5. Prompt design, schema-constrained outputs, and evaluation
6. Fine-tuning vs PEFT/LoRA vs RAG: when to use what

We build a **count-based bigram model** (transparent baseline) and a **NumPy attention demo** you can inspect token-by-token.

In [None]:
import numpy as np, pandas as pd
np.random.seed(11088)

## 1) Tokenisation: from text to IDs
Real LLMs use **BPE**/**SentencePiece** subword tokenisers. We’ll use a trivial whitespace tokenizer for transparency.

In [None]:
corpus = [
 "patient reports chest pain and shortness of breath",
 "no chest pain today patient feels better",
 "shortness of breath worsened during exercise",
 "patient denies pain but notes dizziness",
 "exercise improved breath control patient better"
]
tokens = [w for line in corpus for w in line.split()]
vocab = sorted(set(tokens)); V=len(vocab)
stoi={w:i for i,w in enumerate(vocab)}; itos={i:w for w,i in stoi.items()}
ids = [stoi[w] for w in tokens]
V, list(vocab)[:10]

## 2) Bigram language model (counts → probabilities)
This is the simplest LM: \(P(w_t|w_{t-1})\) estimated by counts with smoothing. It proves that **fluency appears** even without neural nets.

In [None]:
counts = np.zeros((V,V), dtype=np.int64)
for a,b in zip(ids, ids[1:]): counts[a,b]+=1
probs = (counts + 1) / (counts.sum(axis=1, keepdims=True) + V)  # add-one smoothing

def sample_bigram(prompt, n=10):
    words = prompt.split(); cur = stoi.get(words[-1], None)
    if cur is None: return prompt
    out = words.copy()
    for _ in range(n):
        nxt = np.random.choice(V, p=probs[cur])
        out.append(itos[nxt]); cur=nxt
    return ' '.join(out)

sample_bigram('patient', 12)

## 3) Self-attention: Query/Key/Value with a causal mask
For a sequence matrix \(X\) (T×d), the model learns three projections:
- \(Q = X W_Q\), \(K = X W_K\), \(V = X W_V\)
- Attention weights \(A = \text{softmax}(QK^\top/\sqrt{d_k} + \text{mask})\)
- Output \(Z = AV\)

**Causal mask** ensures token *t* cannot attend to future tokens (>t).

In [None]:
def softmax(x):
    x = x - x.max(axis=-1, keepdims=True)
    return np.exp(x)/np.exp(x).sum(axis=-1, keepdims=True)

d_model=12; d_k=12
E = np.random.normal(0,0.4,(V,d_model))  # token embeddings
Wq = np.random.normal(0,0.3,(d_model,d_k))
Wk = np.random.normal(0,0.3,(d_model,d_k))
Wv = np.random.normal(0,0.3,(d_model,d_k))

seq = "patient reports chest pain".split()
X = E[[stoi[w] for w in seq]]  # T x d_model
Q = X @ Wq; K = X @ Wk; Vv = X @ Wv
scores = (Q @ K.T) / np.sqrt(d_k)  # T x T
T = scores.shape[0]
mask = np.triu(np.ones((T,T)), k=1) * 1e9  # large negative
attn = softmax(scores - mask)
Z = attn @ Vv
pd.DataFrame(attn, index=seq, columns=seq)

Inspect the matrix: row *t* shows which earlier tokens the model ‘looks at’ when producing the representation for position *t*.

### Positional encodings (why order matters)
Because attention is permutation-invariant, models add sinusoidal or learned positional embeddings to encode token order.

## 4) Training objective & sampling strategies
**Objective**: minimise cross-entropy for next-token prediction.
- Teacher forcing during training; autoregressive at inference.

**Sampling**:
- Greedy (argmax): deterministic but brittle.
- Temperature: divides logits; T<1 = conservative, T>1 = diverse.
- Top-k: restrict to k highest-probability tokens.
- Nucleus (top-p): restrict to smallest set with cumulative prob p.

In [None]:
def sample_from_logits(logits, temperature=1.0, top_k=None, top_p=None):
    z = logits / max(temperature, 1e-6)
    # top-k
    if top_k is not None:
        idx = np.argpartition(z, -top_k)[-top_k:]
        mask = np.full_like(z, -1e9); mask[idx]=z[idx]; z=mask
    # top-p (nucleus)
    if top_p is not None:
        order = np.argsort(-z)
        probs = softmax(z)
        cum = np.cumsum(probs[order])
        keep = order[cum<=top_p]
        if len(keep)==0: keep=order[:1]
        mask = np.full_like(z, -1e9); mask[keep]=z[keep]; z=mask
    p = softmax(z)
    return np.random.choice(len(z), p=p)

# Demo with bigram probs as pseudo-logits
def demo_sampling(start='patient', steps=10, temperature=0.8, top_k=5):
    cur = stoi.get(start, 0); out=[start]
    for _ in range(steps):
        logits = np.log(probs[cur] + 1e-12)
        nxt = sample_from_logits(logits, temperature=temperature, top_k=top_k)
        out.append(itos[nxt]); cur=nxt
    return ' '.join(out)

demo_sampling('patient', 12, temperature=0.8, top_k=5)

## 5) Why hallucinations happen — and how to mitigate
- Objective ≠ truth: it’s likelihood. The model will confidently produce plausible falsehoods when training data patterns point that way.
- **Mitigations**:
  - Retrieval-Augmented Generation (**RAG**): fetch relevant passages; condition the prompt.
  - Schema-constrained outputs: ask for JSON with required fields; validate programmatically.
  - Calibrate expectations: use confidence cues (self-consistency, citations), or instruct refusal when uncertain.

### RAG architecture (conceptual)
1) Ingest docs → chunk → embed → vector store
2) At query time: embed query → retrieve top-k chunks
3) Compose prompt: system + user + retrieved context
4) Generate; optionally cite chunk IDs.

**Exercise 1**: Draft a retrieval prompt template for summarising a nutrition RCT; include strict output schema.

## 6) Prompt design and guardrails
- Give **role**, **task**, **constraints**, **examples** (few-shot), **format**.
- Ask for chain-of-thought *only if you need it for pedagogy*; otherwise request **concise reasoning** to reduce verbosity.
- For grading tasks, supply a rubric and require JSON with pass/fail + justifications.

**Exercise 2**: Write two prompts for the same task (extract trial arms from a CONSORT abstract): one free-text, one JSON-schema; compare failure modes.

## 7) Fine-tuning vs PEFT/LoRA vs RAG
- **Prompting**: fastest; no data or training; limited to model’s knowledge.
- **RAG**: best first step to ground responses on your corpus; cheap to update.
- **Full fine-tuning**: costly; risk of forgetting; requires eval harness.
- **PEFT/LoRA**: parameter-efficient; adapt capabilities with fewer weights.

**Rule of thumb**: If the task is **formatting/grounding**, use RAG + prompts. If it’s **style or domain nuance**, consider PEFT. For **new capabilities**, full FT (plus careful eval).

## 8) Evaluating LLM outputs (you must measure)
- **Automatic**: exact-match / ROUGE for deterministic tasks; BLEU for translation; factuality checks against a gold knowledge base.
- **Model-graded**: use a strong judge with a rubric; spot-check with humans.
- **Programmatic**: JSON schema validation, unit tests for tool-use traces.
- **Human**: double-blind rating for helpfulness, honesty, and safety.

**Exercise 3**: Design an eval set of 20 nutrition Qs with gold answers + citations. Define pass criteria (e.g., exact cite present, no hallucinated trial).

## 9) Safety and responsible use
- Privacy: never paste secrets; mask identifiers during logging.
- Bias: measure subgroup error rates; mitigate with counter-prompts or curated retrieval.
- Transparency: state limits and sources; provide citations where possible.
- Fallbacks: if retrieval empty or confidence low, return a **safe refusal** or ask for clarification.

## Takeaways
- LLMs are **likelihood machines** with impressive priors; treat outputs as **claims to verify**.
- Retrieval and schema constraints turn them from storytellers into **tools**.
- Evaluation and guardrails are non-negotiable for responsible deployment.