# ▂▂▂▂▂▂▂▂▂▂▂▂

# Initial Decoder Experiments

(Author: Chris McCormick)

I ran Decoder pre-training + fine-tuning experiments at a similar scale to what we've tried with the `SubspaceEncoder`.

For these:
* I did more thorough sweeps of the subspace sizes.
* Instead of implementing a custom class like the `SubspaceEncoder`, I patched the huggingface implementation in `transformers.DeepseekV3`

The notes below document the experiment and results, but also some general project updates and insights.

## S1. General

### 1.1. Performance Improvements


The below changes would probably make sense to apply to the Encoder setup as well.

**torch.compile**

I found that `torch.compile` did give some easy performance gains.

In the JSON configuration files:

```python
  "pre_train": {
    ...
    "bf16": true,
    "fp16": false,
    "torch_compile": true,
    "torch_compile_backend": "inductor",
    "torch_compile_mode": "default"
  },
```

I didn't record the difference, but I think it was in the neighborhood of 15-25% faster.



**bf16**

I also added support for bf16, since it's generally recommended when available.


**Example Packing**

The basic training approach can result in training on a large number of padding tokens, especially as the sequence length gets longer. wikitext103 contains full articles, but many can be stubs.

For the final training runs at sequence length 1,024, I switched to using concatenated samples. This definitely seemed to help on perplexity and benchmark accuracy--I think the models effectively got to see more (real) tokens.

Here's what the code looks in `subspace_decoder\scripts\train.py`. I let GPT 5 write this and haven't done a fine-grained review, but here's what's generally going on.

First, you do a raw tokenization (no padding, no BOS or EOS):

```python

    block_size = ptrain_cfg["max_seq_length"]
    eos_id = tokenizer.eos_token_id
    
    # 1) Tokenize without truncation/padding
    def tokenize_function(examples):
        # add_special_tokens=False keeps things raw; we’ll insert EOS between docs
        return tokenizer(
            examples["text"],
            add_special_tokens=False,
        )

    # Tokenize
    tokenized = dataset.map(
        tokenize_function,
        batched=True,
        num_proc=8,
        remove_columns=dataset["train"].column_names,  # drop raw "text"
    )

```

Then,

1. Take a batch of samples and concatenate them all, putting EOS tokens in between samples.
2. Break it back apart into samples that are max_seq_len (i.e., 1024).
3. Drop the final block so that every sample is full length.

```python

    # 2) Group into contiguous 1024-token blocks (concat + chunk)
    def group_texts(examples):
        # Flatten and insert EOS between documents to avoid cross-article bleed
        input_ids = []
        for ids in examples["input_ids"]:
            if len(ids) > 0:
                input_ids.extend(ids)
            # add an EOS fencepost between docs
            input_ids.append(eos_id)
    
        # Drop the trailing partial block so every example is full length
        total_length = (len(input_ids) // block_size) * block_size
        input_ids = input_ids[:total_length]
    
        # Split into equal blocks
        result_input_ids = [input_ids[i:i + block_size] for i in range(0, total_length, block_size)]
        # Labels are next-token targets; Trainer/model will do the shift
        return {
            "input_ids": result_input_ids,
            "labels": [ids.copy() for ids in result_input_ids],
            # Optional attention masks (all ones because no padding)
            "attention_mask": [[1] * block_size for _ in result_input_ids],
        }
       
    # Concatenate + chunk
    tokenized = tokenized.map(
        group_texts,
        batched=True,
        num_proc=8,
    )
    
    # Use a simple collator; we already created labels and have no pads
    from transformers import default_data_collator
    data_collator = default_data_collator
```


### 1.2. Talking to DeepSeek

I reached out to Huazuo Gao at DeepSeek, one of the main contributors to MLA.

He confirmed that they're working on the output projection themselves (to reduce flops), but haven't found an acceptable solution yet.

He connected me with Wangding Zeng, another author at DeepSeek, who's been doing the experiments. I've reached out to him as well to see if he's willing to share any insights.

**Implications**

* That was cool confirmation that this is worth investigating.
* It also suggests we might be able to publish something if we take an easier angle on it (e.g., maybe more efficiency or analysis centered).


### 1.3. Patching vs. Re-Implementing

Initially, I pulled in the code for the DeepSeekV3Attention class and added the output decomposition, with the attention of patching the entire attention block. This code is still present in `subspace_decoder/layers/deepseek_mla_o.py`.

I learned, though, that you can use a DeepSeekV3 instance as-is, and simply patch the `o_proj` module (it's an `nn.Linear`) with a `nn.Sequential` which performs the two steps.

i.e., you can do something like:

```python
attn.o_proj = nn.Sequential(
    nn.Linear(in_features,  o_latent_dim,  bias=False),  # W^OA
    nn.Linear(o_latent_dim, out_features, bias=bias),    # W^OB
)
```

This means that it's actually pretty trivial to apply the output decomposition to any existing model implementation--you don't need to re-implement the attention block.

_However_--re-loading from disk is a little uglier. You load the base model from disk, patch `o_proj`, and then you have to manually load the output weights in from the safetensors file.

The "patch o_proj" approach is what I took for these experiments. The functions for doing this are in `subspace_decoder/layers/patch_o_proj.py`.

It got more complex because I experimented with a number of different normalization strategies (discussed more further down).


### 1.4. RoPE and NoPE

For our head size of 32, I had intended to do 16 RoPE dims and 16 NoPE dims, but there's a HuggingFace utility function involved with the rotary embeddings that throws an error.

(I remember I encountered this and resolved it in the custom Encoder implementation. I think it was that the function tries to infer the RoPE head size from d_model // num_heads?)

Instead, I used 32 RoPE dims and 0 NoPE dims. Here's what that means:

1. The **key-value latent** space is only being used for the **value** heads.
2. DeepSeekV3 uses a single key head for RoPE. Because I set NoPE to 0, it means that the model is currently only using this **single key head** (but does still have 8 query heads).
   * So we have 8 query heads, 1 key head, and 8 value heads.

I'd like to fix this and go with 16 RoPE dims and 16 NoPE dims. It might be trickier here than in the Encoder, since here we're just patching the existing DeepSeekV3 classes rather than overwriting any of them.

### 1.5. MoE

I had intended to just use all dense layers, just in case MoE training had issues.

Somehow that config change got lost, and so the models were trained with 1 initial dense layer and MoE layers instead.

It seemed to work fine, though? It might be good to assess the router health in the final models to make sure.

## S2. Benchmarks

### 2.1. Setup & Runtimes

I ran everything on a lambda instance with a GH200 with 96 GB. Pretty sweet!

Most of my experiments were sweeps over the latent space sizes and done at a sequence length of 128. Those pretraining runs took ~35 minutes each.

I did two final training runs (one for MLA, one for MLA-o) with the best settings and a sequence length of 1024. Those took about 5.5 hours each. Still not too bad!

### 2.2. Metrics

**Accuracy**

* Used perplexity on wikitext103 and accuracy on SST-2.
* For SST-2, I prompted the model with the task, gave a couple examples, and then looked at the logits for the predicted token.
* Somewhat surprisingly, the decoder actually outperformed the encoder on SST-2. My guess would be it's related to the Mixture of Experts?
* Overall, MLA-o underperforms MLA, similar to what we saw with the Encoder.

**Throughput**
* Still not seeing a speed benefit from decomposition, even at sequence length 1024.
* Next step might be to use 12 heads instead of 8, since head count and size matter for performance benefits.

**Weights and Biases**

Decoder Pre-Training workspace [here](https://wandb.ai/chrismccormick/decoder-pretrain-wiki103) and sst-2 [here](https://wandb.ai/chrismccormick/decoder-finetune-sst2), but note that they're raw / not organized for presenting.

### 2.3. Results

The rows compare MLA vs. MLA-o, and the columns compare them at a sequence length of 128 vs. 1024.


**Subspace Sizes**

These are the sizes that performed the best.

```
 Query   96
    KV   64
Output   96  (for MLA-o)
```

Some of the larger configurations did worse on perplexity but did do slightly better on SST-2. Since the difference wasn't large and we're going for throughput gains, I stuck with the above sizes.


**Comparing Sequence Lengths**

I included results both at length 128 and at 1024 to see how MLA and MLA-o compare to eachother in the short vs. "long" context setting.

However, the length 128 results shouldn't be compared to the length 1024 results, because only the latter used "example packing".

In other words, compare the rows below, don't compare across the columns.

**Accuracy and Perplexity**



SST-2 Accuracy (higher is better)
```
         1,024                  128
         -----                 -----
MLA-o:   86.24                 85.78           
  MLA:   87.96  +1.72          87.04  +1.26
```


Wikitext103 Perplexity (lower is better)

```
         1,024                  128
         -----                 -----
MLA-o:   29.33                 40.94
  MLA:   28.89   -0.44         39.21   -1.73
```



_Model Size & Training Time_
```
                                   Time to Pre-Train
         Params                   1,024          128         
         ------                  ------          ---
MLA-o:   16.17M                  5h 31m          36m
  MLA:   16.26M   + ~90K         5h 31m          36m
```

I was expecting that MLA-o would start to be faster at the longer sequence length based on some earlier throughput benchmarks I did, but there's almost no speed difference so far.



**Training Curves**

That grad norm is interesting, I wonder what it means that it was twice the magnitude?

<img src='https://lh3.googleusercontent.com/d/1vgZAtuFv-kCyeBizMHGu6UyVtSUBRdKh' alt='Training curve comparisons' width='900' />

<img src='https://lh3.googleusercontent.com/d/1EMn613Ufj_YDzC1kkd2x9TdaqfPwr9WG' alt='Comparison of pretraining perplexity and eval throughput' width='900' />

## S3. Normalization Variants

(Some rough notes that GPT captured for me, I'll have to revisit this)

* Tried several normalization strategies:

  1. **Per-head RMSNorm with independent gain weights** → highest performing variant, but slowed things down.
  2. **RMSNorm after summing the heads** → nearly as good, no slowdown.
  3. **Per-head RMSNorm with a shared gain vector** → not as good as (1), also slow.
* For now I’m sticking with the **post-sum RMSNorm** as the best trade-off.

## S4. Next Steps

**Throughput Experiments**

I'd like to try scaling the model further to find the point where MLA-o is faster than MLA.

We don't need to fully pre-train a model to do this, though.

* For **inference** throughput, we could just leave the model randomly initialized and send text through it.
* For **training** throughput, we could train for just a small number of steps (e.g., 1000?).

Increasing the number of heads and their size is the next thing I'd try.


**Analyzing / Evaluating the New Models**

(1) We could try testing these two new long-sequence models on more benchmarks beyond SST-2. 1,024 tokens is enough for many of them. For example, we could do imdb movie reviews instead.

(2) We could try analyzing / characterizing their behavior.
* We now have two models trained identically, one with the output latent and one without. Maybe there are some interesting ways to compare them?
* The normalization strategy for the output is an interesting challenge--it could be interesting to compare magnitudes of the attention output between the two. Is MLA-o producing larger outputs?




**Pre-Training Experiments**

Here are some thoughts on potential issues to resolve and new things to try.

(1) Address the **RoPE issue** so we can run with split dims (16 RoPE / 16 non-RoPE) as originally intended.
* I think the options to weigh are:
    1. Continue trying to surgically patch the DeepseekV3 code.
    2. Do a more heavy-handed patch by overwriting the Attention class entirely.
    3. Make a Decoder variant of the SubspaceEncoder.
    
(2) Review training curves.
* What's going on with the high grad norm for MLA-o? Is that a problem?
* Does the data tell us whether we could have stopped earlier, or whether it might be beneficial to train on more tokens?

(3) Compare impact of Output subspace to impact of Query subspace
* To what degree does adding a Query subspace hurt performance?
   * It would be interesting to see if the loss in accuracy and change in throughput are similar or not.
* I noticed that in DeepSeek-V2-Lite, they don't use a latent space for the Query heads. Perhaps these decompositions are more harmful / less beneficial at smaller scales?

(4) Review example packing.
* This is a tricky topic that I'd like to understand better.
* It sounds great (and seemed to work!), but how it affects the model is still a little unclear to me.
* Some more specific questions for our case:
    * Is it ok to drop trailing blocks? How many tokens are being lost?
    * Do we need to define a separate padding token?



# ▂▂▂▂▂▂▂▂▂▂▂▂

# Appendix


## Workflow

Some lessons learned for me on the tooling and methodology side.



**Cursor**

I finally took the plunge and signed up for Cursor.

_Tab completions_

My favorite part so far are the tab completions.

I've appreciated the completions in Google Colab, but they're limited to continuing what you're currently writing. You have to be at the end of a line, and it will suggest what to write next.

Cursor broadens this to making suggestions for the surrounding code, and will also make suggestions for the entire file that you can quickly tab through and accept. It's incredibly helpful.

_Agent mode_

I've used OpenAI's Codex. What I appreciate in Cursor is how much easier it is to modify and accept the changes.

I've heard great things about Claude 4, but so far have felt disappointed.

I still find that I get the most intelligent / useful results by copying code into a GPT 5 conversation. Maybe it's because I do the work of identifying the correct context to give it, whereas the agent is filling up the context trying to make sense of the codebase on its own.

Of course, copying the code back out from a conversation and merging it in is such an obnoxious process.



**Lambda**

I can probably improve my workflow significantly–I was doing a lot of copy-pasting between Cursor, GPT, and the Lambda instance.

I bet there's a way to run the Jupyter Notebook in Cursor, and have it hooked in to a Python kernel running on the Lambda instance.

That still leaves file management. Can the code maybe live on the Lambda instance, but still be editable in Cursor?

Also, getting a good diff tool would be big. (Done! It's already there--right click the files and use the "select for compare" options).



**Google Drive**

I wanted to continue using Google Drive for storing checkpoints. Getting that working was a pain in the butt, but I got there.


**Git**

I'm still a noob at Git. I've really only ever done a simple `pull` / `commit` / `push` workflow.

For the experimenting that I was doing, creating a branch to work in, and then merging that branch back (after cleaning up / erasing the commit history) would be better.

**Weights and Biases**

wandb seems like it could be so much more useful if I understood it better.

Some things I want to figure out:

(1) Tables!
* These are what I really want for looking at results...
    * I went and dug around--I think a good initial answer is the "runs" tab. It's pretty much the table view I'm looking for.
    * Still don't know why it's so hard to define custom tables to include as panels, but oh well.

(2) Customizing the Dashboard
* This ought to be simple--creating a section that has just the panels I'm most interested in.
    * Maybe it is, but I feel like I've had trouble with it before.
* Even simple zooming is something I haven't figured out yet!

(3) Organizing Runs
* I always end up with a lot of runs, and have to sift through them to show and hide different ones to make comparisons.
* I think the solution may be to set up the "runs" view properly (to show just the columns I'm interested in), and then use it to handle selecting what I want to show or hide.
* Being able to backfill values would be really helpful--is that possible?
    * Doing it from code might be a workaround, or by using tags.

(4) Auto-delete cancelled runs
* While debugging or solving OOM issues, I'll often end up with a bunch of spurious runs in the project. I can probably put something in the code which will automatically delete the run under certain conditions.



# ▂▂▂▂▂▂▂▂▂▂▂▂