### Overview

| Step | Function to Call | What to Inspect |
| :---- | :---- | :---- |
| **1\. Input** | forwardEmbedding | Shapes, verify vectors are non-zero. |
| **2\. Mask** | makeCausalMask | **Visualize:** Ensure it is a triangle. |
| **3\. Attention** | forwardMHADebug | **Visualize:** attnWeights. Are they focusing on relevant past words? |
| **4\. Norms** | forwardLayerNorm | Check Mean/StdDev (should be \~0 and \~1). |
| **5\. Prediction** | finalLayer | argmax of the last token. Is it a real word? |

### Initialization

In [1]:
{-# LANGUAGE OverloadedStrings #-}

import Torch (DType, Device, Parameter, Tensor, asTensor, defaultOpts, makeIndependent, ones, ones', toDependent, zeros, zeros')
import qualified Torch as Th
import qualified Torch.Autograd as TA
import qualified Torch.Functional as F
import qualified Torch.Functional.Internal as FI
import qualified Torch.NN as NN
import qualified Torch.Optim as Optim
import qualified Torch.Serialize as Serialize
import qualified Data.Set.Ordered as OSet
import qualified Control.Foldl as L
import qualified Data.Text as T
import Data.Maybe (fromMaybe)
import Text.Printf
import IHaskell.Display
import DecoderTransformer

-- 1. Define both files used during training
let trainFile = "rpg-training-tokenized.txt"
let evalFile = "rpg-evaluation-tokenized.txt"

-- 2. Load both files
vocabParts <- traverse buildVocabFromFile [trainFile, evalFile]

-- 3. Merge them AND add the [PAD] token (Crucial step!)
-- This matches the logic in your compiled program
let vocab = L.fold (L.Fold (OSet.|<>) (OSet.singleton "[PAD]") id) vocabParts
let vocabSize = OSet.size vocab

-- 4. Initialize an empty model structure
model <- initModel vocabSize

-- 2. Load the trained weights from the file created by 'stack run'
loadedTensors <- Serialize.load "rpg_model.pt"

-- 3. Hydrate the model
loadedParams <- mapM makeIndependent loadedTensors
let trainedModel = Th.replaceParameters model loadedParams



In [2]:
{-# LANGUAGE RecordWildCards #-}

forwardEmbedding :: TransformerModel -> Th.Tensor -> Th.Tensor
forwardEmbedding TransformerModel {..} input = 
  let w = toDependent embedWeights
      emb = F.embedding False False w paddingIdx input
      
      -- Get current sequence length from input shape [Batch, SeqLen]
      currentSeqLen = Th.size 1 input
      
      -- Unwrap and Slice Positional Encoding
      -- From: [1, 64, 64] -> To: [1, 5, 64]
      fullPos = toDependent posEncoding
      slicedPos = Th.sliceDim 1 0 currentSeqLen 1 fullPos
      
  in emb + slicedPos

In [10]:
{-# LANGUAGE RecordWildCards #-}

-- Returns (Output, AttentionWeights)
forwardMHADebug :: MultiHeadAttention -> Th.Tensor -> (Th.Tensor, Th.Tensor)
forwardMHADebug MultiHeadAttention {..} x =
  let 
      -- 1. Linear Projections
      q = NN.forward mhaLinearQ x
      k = NN.forward mhaLinearK x
      v = NN.forward mhaLinearV x

      headDim = mhaEmbedDim `Prelude.div` mhaHeads
      batch = head (Th.shape x)
      seqLength = Th.shape x !! 1

      -- 2. Reshape
      viewShape = [batch, seqLength, mhaHeads, headDim]
      q' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape q
      k' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape k
      v' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape v

      -- 3. Scores
      kT = F.transpose (F.Dim 2) (F.Dim 3) k'
      scoresRaw = F.matmul q' kT
      dk = Th.asTensor (fromIntegral headDim :: Float)
      scoresScaled = scoresRaw / F.sqrt dk

      -- 4. Mask (Simplified for debug: Optional)
      -- For simple visualization, we can skip the mask or apply it if needed.
      -- If you want to see the triangle, include the masking logic here.
      
      -- 5. Softmax -> THIS IS THE HEATMAP
      attnWeights = F.softmax (F.Dim 3) scoresScaled

      -- 6. Context
      context = F.matmul attnWeights v'
      contextT = F.transpose (F.Dim 1) (F.Dim 2) context
      contextReshaped = Th.reshape [batch, seqLength, mhaEmbedDim] contextT
      
      finalOut = NN.forward mhaLinearOut contextReshaped
   in (finalOut, attnWeights)

In [31]:
-- Returns: (Output, SoftmaxWeights, RawScores, Q, K, V)
forwardMhaInspect :: MultiHeadAttention -> Th.Tensor -> (Th.Tensor, Th.Tensor, Th.Tensor, Th.Tensor, Th.Tensor, Th.Tensor)
forwardMhaInspect MultiHeadAttention {..} x =
  let 
      -- 1. Linear Projections
      q = NN.forward mhaLinearQ x
      k = NN.forward mhaLinearK x
      v = NN.forward mhaLinearV x

      headDim = mhaEmbedDim `Prelude.div` mhaHeads
      batch = head (Th.shape x)
      seqLength = Th.shape x !! 1

      -- 2. Reshape for Heads
      viewShape = [batch, seqLength, mhaHeads, headDim]
      q' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape q
      k' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape k
      v' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape v

      -- 3. Raw Scores (Affinity)
      kT = F.transpose (F.Dim 2) (F.Dim 3) k'
      scoresRaw = F.matmul q' kT
      
      -- 4. Scaled Scores
      dk = Th.asTensor (fromIntegral headDim :: Float)
      scoresScaled = scoresRaw / F.sqrt dk

      -- 4b. Apply Causal Mask
      -- Shape: [Seq, Seq] broadcasted to [Batch, Heads, Seq, Seq]
      mask = makeCausalMask seqLength (Th.device x) (Th.dtype x)
      scoresMasked = scoresScaled + mask

      -- 5. Softmax (Probability)
      attnWeights = F.softmax (F.Dim 3) scoresMasked

      -- 6. Context
      context = F.matmul attnWeights v'
      contextT = F.transpose (F.Dim 1) (F.Dim 2) context
      contextReshaped = Th.reshape [batch, seqLength, mhaEmbedDim] contextT
      
      finalOut = NN.forward mhaLinearOut contextReshaped
   in (finalOut, attnWeights, scoresRaw, q', k', v')

In [29]:
-- Helper: Create a Causal Mask (Upper Triangular = -inf)
makeCausalMask :: Int -> Th.Device -> Th.DType -> Th.Tensor
makeCausalMask sz dev dtype =
  let -- Create a matrix of ones [sz, sz]
      opts = Th.withDevice dev (Th.withDType dtype defaultOpts)
      onesMat = ones [sz, sz] opts
      
      -- Create Upper Triangular mask (1s in upper triangle, 0s elsewhere)
      -- Diag 1 means "start one step above the main diagonal"
      upperTri = F.triu (F.Diag 1) onesMat

      -- Convert 1s to -1e9, 0s to 0.0
      negInf = -1e9 :: Float
   in upperTri * asTensor negInf

In [24]:
{-# LANGUAGE OverloadedStrings #-}

-- A helper to visualize a 2D attention matrix as an HTML Table
visualizeAttention :: Th.Tensor -> IO ()
visualizeAttention t = do
  -- Convert tensor to list of lists (assuming 2D [Seq, Seq])
  let rows = (Th.asValue t :: [[Float]])
  
  -- Helper to generate the HTML for a single cell
  let cell val = 
        let 
           -- Background: Blue with opacity matching the attention weight
           bgStyle = printf "background-color: rgba(0, 0, 255, %.2f)" val :: String
           
           -- Text Color: White if background is dark/intense (> 0.5), Black otherwise
           -- This prevents "White Text on White Background" issues for low values
           textColor = if val > 0.5 then "white" else "black" :: String
           
        in printf "<td style='%s; color: %s; width: 40px; height: 40px; border: 1px solid #ddd; text-align: center; font-size: 12px;'>%.2f</td>" bgStyle textColor val
  
  let rowHtml r = "<tr>" ++ concatMap cell r ++ "</tr>"
  let tableHtml = "<table style='border-collapse: collapse; font-family: sans-serif;'>" ++ concatMap rowHtml rows ++ "</table>"
  
  printDisplay $ Display [html tableHtml]

### **Phase 1: The Input Stage (Shape Shifting)**

Goal: Understand how text becomes geometry.  
Key Concept: \[Batch, Seq\] $\rightarrow$ \[Batch, Seq, EmbedDim\]

1. **Inspect the Vocabulary:** Pick 5 words and find their indices manually.  
2. **Run Embeddings:** Use your forwardEmbedding helper.  
   * **Investigation:** Print the shape. It should be \[1, 5, 64\].  
   * **Sanity Check:** Print embedding\[0\]\[0\] (the vector for the first word) and embedding\[0\]\[1\] (the second). They should be completely different sets of numbers.  
3. **Positional Encoding:**  
   * **Action:** Extract posEncoding from the model.  
   * **Experiment:** Verify that the vector at position 0 is different from the vector at position 1\. Without this, the model sees "The dog bit the man" and "The man bit the dog" as identical "bags of words."

In [9]:
{-# LANGUAGE OverloadedStrings #-}

-- Indices of 5 words
let fiveWords = ["if", "else", "endif", "eval", "callp"] :: [T.Text]
let wordIdxs = map (\w -> fromMaybe 0 (OSet.findIndex w vocab)) fiveWords
print wordIdxs

-- embedding shape
let wordsTensor = Th.reshape [1,5] $ asTensor wordIdxs
let weights = toDependent (embedWeights trainedModel)
let emb = F.embedding False False weights 0 wordsTensor 
print (Th.shape emb)

putStrLn "Embedding"

-- "if" embedding
let ifEmb = Th.select 0 0 $ Th.select 0 0 emb
print $ Th.sliceDim 0 0 5 1 ifEmb

-- "else" embedding
let elseEmb = Th.select 0 1 $ Th.select 0 0 emb
print $ Th.sliceDim 0 0 5 1 elseEmb

putStrLn "Positional encoding"

-- positional encoding
let fullPosEnc = toDependent $ posEncoding trainedModel
let posEnc = Th.sliceDim 1 0 5 1 fullPosEnc
print $ Th.shape posEnc

-- first word in sequence encoding
let pos1 = Th.select 0 0 $ Th.select 0 0 posEnc
print $ Th.sliceDim 0 0 5 1 pos1

-- second word in sequence encoding
let pos2 = Th.select 0 1 $ Th.select 0 0 posEnc
print $ Th.sliceDim 0 0 5 1 pos2

putStrLn "Embedding + Positional encoding"

-- run forwardEmbedding
let embWithPos = forwardEmbedding trainedModel wordsTensor
print (Th.shape embWithPos)
let ifEmbWithPos = Th.select 0 0 $ Th.select 0 0 embWithPos
print $ Th.sliceDim 0 0 5 1 ifEmbWithPos
print $ Th.sliceDim 0 0 5 1 (ifEmb + pos1)
let elseEmbWithPos = Th.select 0 1 $ Th.select 0 0 embWithPos
print $ Th.sliceDim 0 0 5 1 elseEmbWithPos
print $ Th.sliceDim 0 0 5 1 (elseEmb + pos2)


[5,128,13,126,139]

[1,5,64]

Embedding

Tensor Float [5] [-7.3626e-3,  3.7367e-2, -3.4854e-2, -9.8547e-2, -6.7193e-2]

Tensor Float [5] [ 6.7673e-2, -2.2636e-4, -5.7980e-2,  1.9481e-3, -4.4925e-2]

Positional encoding

[1,5,64]

Tensor Float [5] [-2.9523e-2, -1.7858e-2, -1.9255e-3, -1.4800e-2,  4.0148e-3]

Tensor Float [5] [-3.7810e-3, -1.0424e-2,  1.2837e-2,  9.9363e-5, -1.9342e-2]

Embedding + Positional encoding

[1,5,64]

Tensor Float [5] [-3.6886e-2,  1.9509e-2, -3.6780e-2, -0.1133   , -6.3179e-2]

Tensor Float [5] [-3.6886e-2,  1.9509e-2, -3.6780e-2, -0.1133   , -6.3179e-2]

Tensor Float [5] [ 6.3892e-2, -1.0651e-2, -4.5144e-2,  2.0474e-3, -6.4267e-2]

Tensor Float [5] [ 6.3892e-2, -1.0651e-2, -4.5144e-2,  2.0474e-3, -6.4267e-2]

### **Phase 2: The Heart (Multi-Head Attention)**

Goal: Understand how tokens "talk" to each other.  
Key Concept: $Q, K, V$ and the Causal Mask.

1. **Use forwardMHADebug:** Pass your embeddings from Phase 1 into this function.  
2. **Inspect Q, K, V:**  
   * We project the input \[64\] into three different spaces.  
   * **Action:** Check that $Q$ and $K$ have the same dimensions (so they can be multiplied).  
3. **The "Raw" Scores (Affinity):**  
   * Modify forwardMHADebug to return scoresRaw (before Softmax).  
   * **Visual:** If you visualize this, the values will be wild (large positives and negatives).  
4. **The Causal Mask (The Triangle):**  
   * **Visual:** Use visualizeAttention on the mask tensor itself. You **must** see a solid triangle of \-inf (or very large negative numbers) in the upper right.  
   * **Why:** This proves the model cannot "cheat" by looking at future words.  
5. **The Probability Map (Softmax):**  
   * **Visual:** Look at attnWeights. The rows must sum to 1.0. This tells you: *For the word at position 3, how much does it care about positions 0, 1, and 2?*

What you are looking for here:
1. Q vs K: These vectors should be different. If $Q \approx K$, your model hasn't learned to separate "Asking" from "Answering".
2. Raw vs Softmax:
   * Raw: Can be positive or negative (e.g., 5.2, -9.1). High positive means "look here". High negative means "ignore this".
   * Softmax: Must be between 0.0 and 1.0.
3. The Map: Since you provided 5 specific words, does "endif" (index 2) look back at "if" (index 0)? If the model understands code structure, you might see a higher weight there!

Look at the attention row for "else". Does it have a high value (bright color) in the column for "if"? That would confirm "else" is paying attention to "if".

In [50]:
let firstBlock = head (layers trainedModel)
let mhaLayer = attention firstBlock

putStrLn "What's going on with the Q, K, V transformation"
let numHeads = mhaHeads mhaLayer
    headDim = mhaEmbedDim mhaLayer `Prelude.div` numHeads
    batch = head (Th.shape embWithPos)
    seqLength = Th.shape embWithPos !! 1
let qRaw = NN.forward (mhaLinearQ mhaLayer) embWithPos
print $ Th.shape qRaw
let viewShape = [batch, seqLength, numHeads, headDim]
let qRawReshaped = Th.reshape viewShape qRaw
print $ Th.shape qRawReshaped
let qRawTransposed = F.transpose (F.Dim 1) (F.Dim 2) qRawReshaped
print $ Th.shape qRawTransposed

-- 1. Run the inspection pass
-- We use the 'mhaLayer' you extracted earlier and your 'embWithPos'
let (out, weights, raw, q, k, v) = forwardMhaInspect mhaLayer embWithPos

putStrLn "shapes of: q, k, v, raw, weights, out"
print $ Th.shape q
print $ Th.shape k
print $ Th.shape v
print $ Th.shape raw
print $ Th.shape weights
print $ Th.shape out

-- 2. Inspect Q vs K (The "Query" and "Key")
-- Pick Head 0, Batch 0
-- Shape: [SeqLen, HeadDim]
let q0 = Th.select 0 0 (Th.select 0 0 q)
let k0 = Th.select 0 0 (Th.select 0 0 k)

putStrLn "Query Vector for word 'else' (Position 1):"
print $ Th.sliceDim 0 0 5 1 (Th.select 0 1 q0) 

putStrLn "Key Vector for word 'if' (Position 0):"
print $ Th.sliceDim 0 0 5 1 (Th.select 0 0 k0)

-- 3. Visualizing the "Raw Affinity" (Pre-Softmax) vs "Probability" (Post-Softmax)
let raw0 = Th.select 0 0 (Th.select 0 0 raw)
let w0   = Th.select 0 0 (Th.select 0 0 weights)

putStrLn "Raw Score (How much 'else' likes 'if'):"
-- Row 1 ("else"), Column 0 ("if")
print $ Th.select 0 0 (Th.select 0 1 raw0)

putStrLn "Probability (After Softmax):"
print $ Th.select 0 0 (Th.select 0 1 w0)

-- 4. See the whole map
putStrLn "Attention Map (Head 0):"
visualizeAttention w0

What's going on with the Q, K, V transformation

[1,5,64]

[1,5,4,16]

[1,4,5,16]

shapes of: q, k, v, raw, weights, out

[1,4,5,16]

[1,4,5,16]

[1,4,5,16]

[1,4,5,5]

[1,4,5,5]

[1,5,64]

Query Vector for word 'else' (Position 1):

Tensor Float [5] [-0.4223   , -0.2898   ,  0.5669   ,  0.4373   , -0.5407   ]

Key Vector for word 'if' (Position 0):

Tensor Float [5] [-0.1322   ,  1.8142e-2, -1.8296e-2,  4.5183e-2, -8.9592e-2]

Raw Score (How much 'else' likes 'if'):

Tensor Float []  8.8453e-2

Probability (After Softmax):

Tensor Float []  0.6105

Attention Map (Head 0):

0,1,2,3,4
1.0,0.0,0.0,0.0,0.0
0.61,0.39,0.0,0.0,0.0
0.44,0.27,0.29,0.0,0.0
0.29,0.21,0.23,0.27,0.0
0.23,0.16,0.18,0.22,0.21


### **Phase 3: The Body (Feed Forward & Norms)**

Goal: Understand how the model "thinks" about what it just saw.  
Key Concept: Expansion (64 \-\> 256\) and Contraction (256 \-\> 64).

1. **Extract the FFN:** let ff \= feedForward firstBlock.  
2. **Run it:** Pass the output of Attention into forwardFF.  
3. **Shape Check:** Notice that the shape **does not change** (\[1, 5, 64\]). The FFN processes every token *independently* (unlike Attention, which mixes them).  
4. **LayerNorm:**  
   * **Experiment:** Calculate the mean and variance of the tensor *before* and *after* forwardLayerNorm.  
   * **Observation:** After norm, the numbers should be roughly in the range of \-2 to \+2. This keeps the math stable.

### **Phase 4: The Output (Logits to Predictions)**

Goal: Converting abstract vectors back to words.  
Key Concept: The Unembedding Layer (Linear).

1. **The Final Projection:** Run NN.forward (finalLayer model) lastState.  
2. **Shape Change:** \[1, 5, 64\] $\\rightarrow$ \[1, 5, VocabSize\].  
3. **The "Next Word" Game:**  
   * Take the vector for the **last** token in the sequence (index 4).  
   * Run F.softmax on it.  
   * **Action:** Find the argmax (the highest probability index). Look that index up in your vocab.  
   * **Verification:** Does the predicted word make grammatical sense following your prompt?

### **Phase 5: The Loop (Autoregression)**

**Goal:** Watch the sequence grow.

1. **Manual Step-by-Step:**  
   * Start with "The". Run the model. Get "wizard".  
   * **Manually** construct the new input "The wizard". Run the model. Get "cast".  
   * **Manually** construct "The wizard cast".  
2. **Why:** This tedious manual process forces you to understand exactly what the program loop in Main.hs does millions of times.