You have moved from Batch Mode (compile $\rightarrow$ run $\rightarrow$ wait $\rightarrow$ read logs) to Interactive Mode (REPL with persistence).
For Deep Learning specifically, this is a massive shift. In your Main.hs, if you want to know "What is the shape of the Query tensor inside the attention head?", you have to insert a print statement, recompile the whole project, and run it. In Jupyter, you can inspect it instantly.

At the bottom there are 4 concrete things you can do right now with your specific Transformer implementation that you couldn't do easily before.

In [7]:
:m +Prelude

In [24]:
import Torch (DType, Device, Parameter, Tensor, asTensor, defaultOpts, makeIndependent, ones, ones', toDependent, zeros, zeros')
import qualified Torch as Th
import qualified Torch.Autograd as TA
import qualified Torch.Functional as F
import qualified Torch.Functional.Internal as FI
import qualified Torch.NN as NN
import qualified Torch.Optim as Optim
import qualified Torch.Serialize as Serialize
import qualified Data.Set.Ordered as OSet
import qualified Control.Foldl as L
import qualified Data.Text as T
import Data.Maybe (fromMaybe)
import Text.Printf
import IHaskell.Display
import DecoderTransformer

In [12]:
t = ones' [2, 2]
print t

Tensor Float [2,2] [[ 1.0000   ,  1.0000   ],
                    [ 1.0000   ,  1.0000   ]]

In [15]:
let a = ones' [3, 3]
b <- Th.randIO' [3, 3]
-- This triggers the underlying C++ matrix multiplication
let c = Th.matmul a b
print c

Tensor Float [3,3] [[ 0.7668   ,  1.1573   ,  1.6517   ],
                    [ 0.7668   ,  1.1573   ,  1.6517   ],
                    [ 0.7668   ,  1.1573   ,  1.6517   ]]

## Four things that can be done now

### 1. "Surgical" Shape Debugging
Transformers are notorious for silent failures where dimensions are permuted (e.g., swapping Sequence Length and Embedding Dimension).
In Main.hs, forwardMHA is a black box. In Jupyter, you can copy the body of that function into a cell and inspect the intermediate tensors step-by-step.

In [17]:
-- 1. Initialize the full model structure
-- We use a small vocab size (e.g., 100) just for testing shapes
let vocabSize = 100
model <- initModel vocabSize

-- 2. Extract just the Attention layer from the first block
-- Your model has 'layers', which is a list of TransformerBlocks
let firstBlock = head (layers model)
let mhaLayer = attention firstBlock

-- 3. Create dummy input [Batch=2, Seq=10, Dim=64]
-- (Matches the embedDim=64 in your code)
let dummyInput = ones' [2, 10, 64]

-- 4. Run ONLY the Multi-Head Attention forward pass
-- This lets you see the output shape of just that specific component
let output = forwardMHA mhaLayer dummyInput

print (Th.shape output)
-- Should be [2, 10, 64]


[2,10,64]

### 2. Interactive Generation (The "Chatbot" Feel)
Currently, your generateSequence function runs at the end of training. If you want to see how the model responds to "The wizard cast a", you have to re-run the program.
In Jupyter, you can load the model once and then generate endless variations instantly.

In [11]:
{-# LANGUAGE OverloadedStrings #-}

-- 1. Define both files used during training
let trainFile = "rpg-training-tokenized.txt"
let evalFile = "rpg-evaluation-tokenized.txt"

-- 2. Load both files
vocabParts <- traverse buildVocabFromFile [trainFile, evalFile]

-- 3. Merge them AND add the [PAD] token (Crucial step!)
-- This matches the logic in your compiled program
let vocab = L.fold (L.Fold (OSet.|<>) (OSet.singleton "[PAD]") id) vocabParts
let vocabSize = OSet.size vocab

-- 4. Initialize an empty model structure
model <- initModel vocabSize

-- 2. Load the trained weights from the file created by 'stack run'
loadedTensors <- Serialize.load "rpg_model.pt"

-- 3. Hydrate the model
loadedParams <- mapM makeIndependent loadedTensors
let trainedModel = Th.replaceParameters model loadedParams

In [12]:
generateSequence trainedModel vocab 50 "The wizard"

Prompt: The wizard -> 
 = 4
 c eval BinaryData = " encoding
 c if BinaryData = 20
 P E
 R * *** Exported because , internally used by RPGUNIT tests ***
 P LIKE SENT TO YOUR E
 P HTTP_SetfileCCSID ...
 // Validate types
 P E

Done.

### 3. Visualizing Attention Weights
This is the "Killer Feature" of Transformers. Your current forwardMHA calculates attnWeights but discards them after the matrix multiplication.
In Jupyter, you can redefine a "Debug" version of forwardMHA that returns the weights, pass your data through it, and actually look at the probability distribution.

In [25]:
-- Verify your causal mask works
let mask = makeCausalMask 5 (Th.Device Th.CPU 0) Th.Float
print mask

Tensor Float [5,5] [[ -0.0000, -1.0000e9   , -1.0000e9   , -1.0000e9   , -1.0000e9   ],
                    [ -0.0000,  -0.0000, -1.0000e9   , -1.0000e9   , -1.0000e9   ],
                    [ -0.0000,  -0.0000,  -0.0000, -1.0000e9   , -1.0000e9   ],
                    [ -0.0000,  -0.0000,  -0.0000,  -0.0000, -1.0000e9   ],
                    [ -0.0000,  -0.0000,  -0.0000,  -0.0000,  -0.0000]]

In [17]:
{-# LANGUAGE OverloadedStrings #-}

-- A helper to visualize a 2D attention matrix as an HTML Table
visualizeAttention :: Th.Tensor -> IO ()
visualizeAttention t = do
  -- Convert tensor to list of lists (assuming 2D [Seq, Seq])
  let rows = (Th.asValue t :: [[Float]])
  
  -- Helper to generate the HTML for a single cell
  let cell val = 
        let 
           -- Background: Blue with opacity matching the attention weight
           bgStyle = printf "background-color: rgba(0, 0, 255, %.2f)" val :: String
           
           -- Text Color: White if background is dark/intense (> 0.5), Black otherwise
           -- This prevents "White Text on White Background" issues for low values
           textColor = if val > 0.5 then "white" else "black" :: String
           
        in printf "<td style='%s; color: %s; width: 40px; height: 40px; border: 1px solid #ddd; text-align: center; font-size: 12px;'>%.2f</td>" bgStyle textColor val
  
  let rowHtml r = "<tr>" ++ concatMap cell r ++ "</tr>"
  let tableHtml = "<table style='border-collapse: collapse; font-family: sans-serif;'>" ++ concatMap rowHtml rows ++ "</table>"
  
  printDisplay $ Display [html tableHtml]

In [18]:
{-# LANGUAGE RecordWildCards #-}

-- Returns (Output, AttentionWeights)
forwardMHADebug :: MultiHeadAttention -> Th.Tensor -> (Th.Tensor, Th.Tensor)
forwardMHADebug MultiHeadAttention {..} x =
  let 
      -- 1. Linear Projections
      q = NN.forward mhaLinearQ x
      k = NN.forward mhaLinearK x
      v = NN.forward mhaLinearV x

      headDim = mhaEmbedDim `Prelude.div` mhaHeads
      batch = head (Th.shape x)
      seqLength = Th.shape x !! 1

      -- 2. Reshape
      viewShape = [batch, seqLength, mhaHeads, headDim]
      q' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape q
      k' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape k
      v' = F.transpose (F.Dim 1) (F.Dim 2) $ Th.reshape viewShape v

      -- 3. Scores
      kT = F.transpose (F.Dim 2) (F.Dim 3) k'
      scoresRaw = F.matmul q' kT
      dk = Th.asTensor (fromIntegral headDim :: Float)
      scoresScaled = scoresRaw / F.sqrt dk

      -- 4. Mask (Simplified for debug: Optional)
      -- For simple visualization, we can skip the mask or apply it if needed.
      -- If you want to see the triangle, include the masking logic here.
      
      -- 5. Softmax -> THIS IS THE HEATMAP
      attnWeights = F.softmax (F.Dim 3) scoresScaled

      -- 6. Context
      context = F.matmul attnWeights v'
      contextT = F.transpose (F.Dim 1) (F.Dim 2) context
      contextReshaped = Th.reshape [batch, seqLength, mhaEmbedDim] contextT
      
      finalOut = NN.forward mhaLinearOut contextReshaped
   in (finalOut, attnWeights)

In [27]:
forwardEmbedding :: TransformerModel -> Tensor -> Tensor
forwardEmbedding TransformerModel {..} input = 
  let w = toDependent embedWeights
      emb = F.embedding False False w paddingIdx input
      pos = toDependent posEncoding
      -- Slice pos to match input sequence length if needed, or broadcast
      -- Simple broadcasting works if seqLen matches
  in emb + pos

In [28]:
let firstBlock = head (layers trainedModel)
let mhaLayer = attention firstBlock

-- We need to tokenize a prompt manually here since we aren't using the full 'generate' loop
let prompt = "The wizard cast a spell"
let tokens = T.words (T.pack prompt)
let indices = map (\w -> fromMaybe 0 (OSet.findIndex w vocab)) tokens

-- Add batch dimension: [1, SeqLen]
let inputTensorLong = asTensor [indices]

-- Embed it: [1, SeqLen, 64]
let dummyInput = forwardEmbedding trainedModel inputTensorLong

-- Run the Debug Forward Pass
let (output, weights) = forwardMHADebug mhaLayer dummyInput

-- Check the shape of weights: Should be [2, 4, 10, 10] 
-- (Batch=2, Heads=4, Seq=10, Seq=10)
print (Th.shape weights)

-- 2. Slice out ONE attention map
-- Select Batch 0
let batch0 = Th.select 0 0 weights 
-- Select Head 0
let head0 = Th.select 0 0 batch0 

-- Shape should now be [10, 10]
print (Th.shape head0)

-- 3. Visualize!
visualizeAttention head0

: 

### 4. Step-by-Step Gradient Watch
Your training loop prints loss every 50 iterations.
In Jupyter, you can run one single training step and inspect the gradients manually to check for "Exploding Gradients" or "Dead Neurons" (gradients of 0)