diff --git a/_pages/dat450/assignment2.md b/_pages/dat450/assignment2.md
index 1872f88044120..83c374c9b84df 100644
--- a/_pages/dat450/assignment2.md
+++ b/_pages/dat450/assignment2.md
@@ -64,7 +64,7 @@ OLMo 2 uses a type of normalization called [Root Mean Square layer normalization
You can either implement your own normalization layer, or use the built-in [`RMSNorm`](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) from PyTorch. In the PyTorch implementation, `eps` corresponds to `rms_norm_eps` from our model configuration, while `normalized_shape` should be equal to the hidden layer size. The hyperparameter `elementwise_affine` should be set to `True`, meaning that we include some learnable weights in this layer instead of a pure normalization.
-If you want to make your own layer, the PyTorch documentation shows the formula you should implement. (The $\gamma_i$ parameters are the learnable weights.)
+If you want to make your own layer, the PyTorch documentation shows the formula you should implement. (The $$\gamma_i$$ parameters are the learnable weights.)
**Sanity check.**
@@ -76,15 +76,17 @@ Now, let's turn to the tricky part!
The smaller versions of the OLMo 2 model, which we will follow here, use the same implementation of *multi-head attention* as the original Transformer, plus a couple of additional normalizers. (The bigger OLMo 2 models use [grouped-query attention](https://sebastianraschka.com/llms-from-scratch/ch04/04_gqa/) rather than standard MHA; GQA is also used in various Llama, Qwen and some other popular LLMs.)
-The figure below shows what we will have to implement.
+The figure below shows a high-level overview of the pieces we will have to put together. (In the figure, the four *W* blocks are `nn.Linear`, and RN means RMSNorm.)
+
+
**Hyperparameters:** The hyperparameters you will need to consider when implementing the MHA are
`hidden_size` which defines the input dimensionality as in the MLP and normalizer above, and
-`num_attention_heads` which defines the number of attention heads. **Note** that `hidden_size` has to be evenly divisible by `num_attention_heads`. (Below, we will refer to `hidden_size // num_attention_heads` as the head dimensionality $d_h$.)
+`num_attention_heads` which defines the number of attention heads. **Note** that `hidden_size` has to be evenly divisible by `num_attention_heads`. (Below, we will refer to `hidden_size // num_attention_heads` as the head dimensionality $$d_h$$.)
-**Defining MHA components.** In `__init__`, define the `nn.Linear` components (square matrices) that compute query, key, and value representations, and the final outputs. (They correspond to what we called $W_Q$, $W_K$, $W_V$, and $W_O$ in [the lecture on Transformers](https://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_2.pdf).) OLMo 2 also applies layer normalizers after the query and key representations.
+**Defining MHA components.** In `__init__`, define the `nn.Linear` components (square matrices) that compute query, key, and value representations, and the final outputs. (They correspond to what we called $$W_Q$$, $$W_K$$, $$W_V$$, and $$W_O$$ in [the lecture on Transformers](https://www.cse.chalmers.se/~richajo/dat450/lectures/l4/m4_2.pdf).) OLMo 2 also applies layer normalizers after the query and key representations.
-**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `rope_rotations`. The latter contains the precomputed rotations required for RoPE; the last step of the transformer explains how to compute them.
+**MHA computation, step 1.** The `forward` method takes two inputs `hidden_states` and `rope_rotations`. The latter contains the precomputed rotations required for RoPE. (The section **The complete Transformer stack** below explains where they come from.)
Continuing to work in `forward`, now compute query, key, and value representations; don't forget the normalizers after the query and key representations.
@@ -143,6 +145,9 @@ Once again create a MHA layer for testing and apply it to an input tensor of the
### The full Transformer decoder layer
After coding up the multi-head attention, everything else is just a simple assembly of pieces!
+The figure below shows the required components in a single Transformer decoder layer.
+
+
In the constructor `__init__`, create the components in this block, taking the model configuration into account.
As shown in the figure, a Transformer layer should include an attention layer and an MLP, with normalizers. In `forward`, connect the components to each other; remember to put residual connections at the right places.
@@ -162,7 +167,7 @@ out = h_new + h_old
### The complete Transformer stack
-Now, set up the complete Transformer stack including embedding, top-level normalizer, and unembedding layers.
+Now, set up the complete Transformer stack including embedding, top-level normalizer, and unembedding layers. (You may look at the figure presented previously.)
The embedding and unembedding layers will be identical to what you had in Programming Assignment 1 (except that the unembedding layer should not use bias terms, as mentioned in the beginning).
diff --git a/_pages/dat450/fullblock.svg b/_pages/dat450/fullblock.svg
new file mode 100644
index 0000000000000..56f0cdf0593bc
--- /dev/null
+++ b/_pages/dat450/fullblock.svg
@@ -0,0 +1,484 @@
+
+
+
+
diff --git a/_pages/dat450/mha.svg b/_pages/dat450/mha.svg
new file mode 100644
index 0000000000000..6271a58c61352
--- /dev/null
+++ b/_pages/dat450/mha.svg
@@ -0,0 +1,529 @@
+
+
+
+
diff --git a/_pages/dat450/swiglu.svg b/_pages/dat450/swiglu.svg
index 063daa349bf81..0157b8464d08f 100644
--- a/_pages/dat450/swiglu.svg
+++ b/_pages/dat450/swiglu.svg
@@ -6,8 +6,8 @@
id="svg2"
xml:space="preserve"
width="314.66666"
- height="400"
- viewBox="0 0 314.66666 400"
+ height="321.33334"
+ viewBox="0 0 314.66666 321.33334"
sodipodi:docname="swiglu.svg"
inkscape:version="1.1.2 (0a00cf5339, 2022-02-04)"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
@@ -23,9 +23,9 @@
inkscape:pageopacity="0.0"
inkscape:pagecheckerboard="0"
showgrid="false"
- inkscape:zoom="1.8175"
- inkscape:cx="35.763411"
- inkscape:cy="200.2751"
+ inkscape:zoom="2.2624481"
+ inkscape:cx="59.669878"
+ inkscape:cy="160.44567"
inkscape:window-width="1920"
inkscape:window-height="1025"
inkscape:window-x="0"
@@ -35,366 +35,342 @@
id="g8"
inkscape:groupmode="layer"
inkscape:label="ink_ext_XXXXXX"
- transform="matrix(1.3333333,0,0,-1.3333333,0,400)">
+ id="path234" />