### Overview

| Step | Function to Call | What to Inspect |
| :---- | :---- | :---- |
| **1\. Input** | forwardEmbedding | Shapes, verify vectors are non-zero. |
| **2\. Mask** | makeCausalMask | **Visualize:** Ensure it is a triangle. |
| **3\. Attention** | forwardMHADebug | **Visualize:** attnWeights. Are they focusing on relevant past words? |
| **4\. Norms** | forwardLayerNorm | Check Mean/StdDev (should be \~0 and \~1). |
| **5\. Prediction** | finalLayer | argmax of the last token. Is it a real word? |

### Initialization

In [2]:
{-# LANGUAGE OverloadedStrings #-}

import Torch (DType, Device, Parameter, Tensor, asTensor, defaultOpts, makeIndependent, ones, ones', toDependent, zeros, zeros')
import qualified Torch as Th
import qualified Torch.Autograd as TA
import qualified Torch.Functional as F
import qualified Torch.Functional.Internal as FI
import qualified Torch.NN as NN
import qualified Torch.Optim as Optim
import qualified Torch.Serialize as Serialize
import qualified Data.Set.Ordered as OSet
import qualified Control.Foldl as L
import qualified Data.Text as T
import Data.Maybe (fromMaybe)
import Text.Printf
import IHaskell.Display
import DecoderTransformer

-- 1. Define both files used during training
let trainFile = "rpg-training-tokenized.txt"
let evalFile = "rpg-evaluation-tokenized.txt"

-- 2. Load both files
vocabParts <- traverse buildVocabFromFile [trainFile, evalFile]

-- 3. Merge them AND add the [PAD] token (Crucial step!)
-- This matches the logic in your compiled program
let vocab = L.fold (L.Fold (OSet.|<>) (OSet.singleton "[PAD]") id) vocabParts
let vocabSize = OSet.size vocab

-- 4. Initialize an empty model structure
model <- initModel vocabSize

-- 2. Load the trained weights from the file created by 'stack run'
loadedTensors <- Serialize.load "rpg_model.pt"

-- 3. Hydrate the model
loadedParams <- mapM makeIndependent loadedTensors
let trainedModel = Th.replaceParameters model loadedParams



In [51]:
{-# LANGUAGE RecordWildCards #-}

forwardEmbedding :: TransformerModel -> Th.Tensor -> Th.Tensor
forwardEmbedding TransformerModel {..} input = 
  let w = toDependent embedWeights
      emb = F.embedding False False w paddingIdx input
      
      -- Get current sequence length from input shape [Batch, SeqLen]
      currentSeqLen = Th.size 1 input
      
      -- Unwrap and Slice Positional Encoding
      -- From: [1, 64, 64] -> To: [1, 5, 64]
      fullPos = toDependent posEncoding
      slicedPos = Th.sliceDim 1 0 currentSeqLen 1 fullPos
      
  in emb + slicedPos

### **Phase 1: The Input Stage (Shape Shifting)**

Goal: Understand how text becomes geometry.  
Key Concept: \[Batch, Seq\] $\rightarrow$ \[Batch, Seq, EmbedDim\]

1. **Inspect the Vocabulary:** Pick 5 words and find their indices manually.  
2. **Run Embeddings:** Use your forwardEmbedding helper.  
   * **Investigation:** Print the shape. It should be \[1, 5, 64\].  
   * **Sanity Check:** Print embedding\[0\]\[0\] (the vector for the first word) and embedding\[0\]\[1\] (the second). They should be completely different sets of numbers.  
3. **Positional Encoding:**  
   * **Action:** Extract posEncoding from the model.  
   * **Experiment:** Verify that the vector at position 0 is different from the vector at position 1\. Without this, the model sees "The dog bit the man" and "The man bit the dog" as identical "bags of words."

In [70]:
{-# LANGUAGE OverloadedStrings #-}

-- Indices of 5 words
let fiveWords = ["if", "else", "endif", "eval", "callp"] :: [T.Text]
let wordIdxs = map (\w -> fromMaybe 0 (OSet.findIndex w vocab)) fiveWords
print wordIdxs

-- embedding shape
let wordsTensor = Th.reshape [1,5] $ asTensor wordIdxs
let weights = toDependent (embedWeights trainedModel)
let emb = F.embedding False False weights 0 wordsTensor 
print (Th.shape emb)

-- "if" embedding
let ifEmb = Th.select 0 0 $ Th.select 0 0 emb
print $ Th.sliceDim 0 0 5 1 ifEmb

-- "else" embedding
let elseEmb = Th.select 0 1 $ Th.select 0 0 emb
print $ Th.sliceDim 0 0 5 1 elseEmb

-- positional encoding
let fullPosEnc = toDependent $ posEncoding trainedModel
let posEnc = Th.sliceDim 1 0 5 1 fullPosEnc
print $ Th.shape posEnc

-- first word in sequence encoding
let pos1 = Th.select 0 0 $ Th.select 0 0 posEnc
print $ Th.sliceDim 0 0 5 1 pos1

-- second word in sequence encoding
let pos2 = Th.select 0 1 $ Th.select 0 0 posEnc
print $ Th.sliceDim 0 0 5 1 pos2

[5,128,13,126,139]

[1,5,64]

Tensor Float [5] [-7.3626e-3,  3.7367e-2, -3.4854e-2, -9.8547e-2, -6.7193e-2]

Tensor Float [5] [ 6.7673e-2, -2.2636e-4, -5.7980e-2,  1.9481e-3, -4.4925e-2]

[1,5,64]

Tensor Float [5] [-2.9523e-2, -1.7858e-2, -1.9255e-3, -1.4800e-2,  4.0148e-3]

Tensor Float [5] [-3.7810e-3, -1.0424e-2,  1.2837e-2,  9.9363e-5, -1.9342e-2]

### **Phase 2: The Heart (Multi-Head Attention)**

Goal: Understand how tokens "talk" to each other.  
Key Concept: $Q, K, V$ and the Causal Mask.

1. **Use forwardMHADebug:** Pass your embeddings from Phase 1 into this function.  
2. **Inspect Q, K, V:**  
   * We project the input \[64\] into three different spaces.  
   * **Action:** Check that $Q$ and $K$ have the same dimensions (so they can be multiplied).  
3. **The "Raw" Scores (Affinity):**  
   * Modify forwardMHADebug to return scoresRaw (before Softmax).  
   * **Visual:** If you visualize this, the values will be wild (large positives and negatives).  
4. **The Causal Mask (The Triangle):**  
   * **Visual:** Use visualizeAttention on the mask tensor itself. You **must** see a solid triangle of \-inf (or very large negative numbers) in the upper right.  
   * **Why:** This proves the model cannot "cheat" by looking at future words.  
5. **The Probability Map (Softmax):**  
   * **Visual:** Look at attnWeights. The rows must sum to 1.0. This tells you: *For the word at position 3, how much does it care about positions 0, 1, and 2?*

### **Phase 3: The Body (Feed Forward & Norms)**

Goal: Understand how the model "thinks" about what it just saw.  
Key Concept: Expansion (64 \-\> 256\) and Contraction (256 \-\> 64).

1. **Extract the FFN:** let ff \= feedForward firstBlock.  
2. **Run it:** Pass the output of Attention into forwardFF.  
3. **Shape Check:** Notice that the shape **does not change** (\[1, 5, 64\]). The FFN processes every token *independently* (unlike Attention, which mixes them).  
4. **LayerNorm:**  
   * **Experiment:** Calculate the mean and variance of the tensor *before* and *after* forwardLayerNorm.  
   * **Observation:** After norm, the numbers should be roughly in the range of \-2 to \+2. This keeps the math stable.

### **Phase 4: The Output (Logits to Predictions)**

Goal: Converting abstract vectors back to words.  
Key Concept: The Unembedding Layer (Linear).

1. **The Final Projection:** Run NN.forward (finalLayer model) lastState.  
2. **Shape Change:** \[1, 5, 64\] $\\rightarrow$ \[1, 5, VocabSize\].  
3. **The "Next Word" Game:**  
   * Take the vector for the **last** token in the sequence (index 4).  
   * Run F.softmax on it.  
   * **Action:** Find the argmax (the highest probability index). Look that index up in your vocab.  
   * **Verification:** Does the predicted word make grammatical sense following your prompt?

### **Phase 5: The Loop (Autoregression)**

**Goal:** Watch the sequence grow.

1. **Manual Step-by-Step:**  
   * Start with "The". Run the model. Get "wizard".  
   * **Manually** construct the new input "The wizard". Run the model. Get "cast".  
   * **Manually** construct "The wizard cast".  
2. **Why:** This tedious manual process forces you to understand exactly what the program loop in Main.hs does millions of times.