# Proteus Attention Walkthrough

This is a concise, tour of Adaptive Sparse Proto Attention (ASPA). Each section runs a single command, prints its telemetry directly in the notebook, and stops. It takes a while due to the nature of the tests. Run them and wait if you wish, or use the provided metrics. Ensure T4 or better is enabled.

Sections
1. Environment & prerequisites
2. Tiny Shakespeare duel (dense vs ASPA)
3. Context-length sweep (`tinytoy.py`)
4. Chunked shortlist stream

## 0. Environment & prerequisites
Run this to clone the repo and install the package. Safe to rerun; it becomes a no-op after the first install.

In [1]:
%%bash
set -euo pipefail
if [ ! -d Proteus-Attention ]; then
  git clone https://github.com/Zen-Sherbert/Proteus-Attention.git
fi
cd Proteus-Attention
pip install -q -e .


Cloning into 'Proteus-Attention'...


## 1. Tiny Shakespeare duel
Trains both the ASPA stack and the dense baseline using the default Tiny Shakespeare config. All metrics, router stats, and sample generations show up right below the cell. This takes a hot minute. Suggested that you run this when you have time.

In [2]:
%%bash
set -euo pipefail
cd Proteus-Attention
python src/proteus_attention/examples/aspa_train.py   --epochs 2   --batch_size 1   --block_size 2048   --d_model 512   --n_layer 4   --lr 3e-4   --target_density 0.28   --density_tol 0.05   --min_active_heads 2   --max_active_heads 4   --token_sparse   --token_keep_ratio 0.85   --token_keep_min 8   --token_keep_guard 8   --gen_tokens 80

Process is terminated.


## 2. Context-length sweep
Runs the tinytoy sweep, showing direct stats over standard.

Take it with a grain of salt, as it was mainly meant to showcase structural architecture.

There are many elements to look at regardless when it does pertain to the structure. Alpha is our control knob for sparsity and computational complexity. Shifting between quadratic and linear when you push the value of alpha higher.

This test takes ~2-3 minutes. Shortlist chunking is currently bork'd in tinytoy due to a refactor. Repairs are on the way.


In [8]:
%%bash
set -euo pipefail
cd Proteus-Attention
python src/proteus_attention/kernels/tinytoy.py


[Config] SDPA fast-path disabled; benchmarking Shortlist kernels only.
--- Starting Attention Benchmark ---
Device: cuda
Sequence lengths: [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576]

[Probe] Standard testing seq_len=128 batch=4 runs=3
[Probe] Standard testing seq_len=256 batch=4 runs=3
[Probe] Standard testing seq_len=512 batch=4 runs=3
[Probe] Standard testing seq_len=1024 batch=4 runs=3
[Probe] Standard testing seq_len=2048 batch=4 runs=3
[Probe] Standard testing seq_len=4096 batch=4 runs=3
[Probe] Standard testing seq_len=8192 batch=1 runs=1
[Probe] Standard testing seq_len=16384 batch=1 runs=1
[Probe] Standard testing seq_len=32768 batch=1 runs=1
[Probe] Standard failed at sequence length 32768: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 14.74 GiB of which 14.44 GiB is free. Process 102090 has 308.00 MiB memory in use. Of the allocated memory 138.13 MiB is allocated by PyTorch, and 43.87 MiB is r

## 3. Chunked shortlist stream
Streams a million token synthetic prompt through the shortlist pipeline. Diagnostics (chunk stats, Top-A agreement, backend info) appear inline.

There is an error with Top-A I believe, which is being worked on.

The main thing to look at here is that chunking is working, and that it can identify storage limits. If overflow happens, it should correctly shunt it to disk storage to prevent collapse. This is still highly experimental and the quality of anything actually used from chunks has yet to be tested.

The mechanics are sound for the most part, but until I actually test the true, live system, then its primarily speculation.

In [5]:
%%bash
set -euo pipefail
cd Proteus-Attention
python scripts/chunked_shortlist.py   --seq-len 1048576   --d-model 512   --chunk-len 65536   --buffer-tokens 32768   --per-chunk-budget 4096   --chunk-sparse-ratio 0.05   --final-sparse-ratio 0.5   --shortlist-alpha 1.0   --nucleus-top-p 0.9   --heads 8   --device cpu   --report-latency

2025-11-09 02:34:50,615 | INFO | proteus_attention.tools.chunked_shortlist | Host staging: cpu (limit=10,272,145,408B, required=2,147,483,648B)
2025-11-09 02:37:22,320 | INFO | proteus_attention.tools.chunked_shortlist | Chunked Shortlist retained 32768/1048576 tokens (0.031) across 16 chunks
2025-11-09 02:37:22,517 | INFO | root | Chunked Shortlist complete: retained 32768/1048576 tokens (3.12%) on cpu
2025-11-09 02:37:22,517 | INFO | root | Chunk streaming latency: 134209.5 ms
2025-11-09 02:37:22,517 | INFO | root | Chunk throughput: 7812.98 tok/s
2025-11-09 02:37:22,517 | INFO | root | Final pass latency: 8368.0 ms
2025-11-09 02:37:22,517 | INFO | root | Final pass throughput: 3915.88 tok/s
2025-11-09 02:37:22,517 | INFO | root | Overall throughput: 7349.89 tok/s
2025-11-09 02:37:22,517 | INFO | root | Storage mode: cpu
2025-11-09 02:37:22,517 | INFO | root | Storage reasoning: limit=10,272,145,408B, required=2,147,483,648B
2025-11-09 02:37:22,517 | INFO | root | Host requirement: 2

# Breakdown

There is a lot going on here and its difficult to showcase some features without explination. So I will do my best to describe what is going on.

There are some fundamental differences with what is going on here and what is being done currently in the field.

None of these concepts are new in any way. I have not discovered some novel system no one has thought of and I refute any claims to that.

What I am claiming is novel architecture.

So to try help the headache, I will describe some of the things I have here as a sort of glossery.

**Adaptive Sparse Proto Attention (ASPA)**

*Prototypes*

So this is a confusing and clinical term, but I will describe their function in this system regardless.

By taking tokens and essentially fingerprinting them and taking their semantic values, we add that to the prototype and update it, creating a centroid that is an average of all the tokens it has seen. More recent tokens take are weighted more than older ones, creating a moving average.  So when it sees tokens close to its value, it acts as a lure to the token.

*Decoupled Gating*

This is more of an accidental architectual choice than some strategic calculation. This turned out to be pretty awesome though when you look into it.

So decoupling gates alone wasn't what was awesome. It was actually when you assign more gates to heads, that it became interesting. It creates a buffer between the router and heads, which essentially lets you do some weird things. This is where my sparsity is kind of strange, as it doesnt rely on patterns or anything.

The gates hold the protypes, and each head has more than one assigned to it. Creating a granulated effect on how they take tokens.

Another interesting concept with this is actually what this does. The router no longer cares whats behind the gates. It only cares about the gates themselves. This means you can dynamically clone, prune, merge, etc.

Anything behind these gates can be changed. When applied to the MLP, it gets even better.

*Dynamic Heads*

I touched on it in the decoupled gates section, and there isnt a focus on it yet, though the system naturally supports it.

You can choose the amount of heads you want, say you want 64. You can assign that and it will work just fine. Increase that to 2000 heads, and it will also run that. Not all at once, but in an intelligent sparse mode. So you can have a really talented attention layer. Thats not to say that its not without issues, because it brings the question of actual intelligent assignment. How do we know its evenly splitting tokens across different heads?

One interesting thing is, that we can infact add more heads later on. Since we assign gates to heads, and we use a many to one assignment ratio, we can dynamically add more heads as long as there are gates to support them.

*Double Sparcity*

By leaning on the prototypes, we can actually create a salience based token pruning system. We can identify tokens that the model thinks are important due to its learned protypes, and when coupled with head sparcity, we can push speed and long context while keeping our memory footprint low.

This isnt to say that this is a perfect system. It can easily miss tokens if not properly trained, and it can lose context quality because of this. So this still have many areas that can be improved on or changed entirely.

*RoPE + Prototypes*

A unique synergy that has been observed is actally the interaction of using RoPE for positional data, and Prototying for saliance data. If it has seen a concept ages ago in a document, it can actually bridge two concepts across a large document as if they were connected in the same area. This is powerful for long context, but its not a catch all.

*Top-A*

Didnt know what to name this, so if it is fundamentally wrong, please say so.

Top-A is an agreement filter. We use it in conjunction with Top-P. This create a double filter effect where Top-P chooses how many tokens, and Top-A uses cosine simularity between heads to decide what tokens to actully use. Theoretically is a synamic correction for uncertainty, but its still dependants on training, and quality. So the most I can say is that its an optional experimental filter.

*KV Banking*

Again this is another concept I wasnt sure about its naming convention. Its similar to a KV cache, but its more robust in its use and ability.

The KV Bank keeps several caches. These are usually associated with specific heads. It acts as a shortlist of concepts it has seen and known. Its dynamic in the sense that what is in it, is not read-only. If it finds a better representation, it will replace the old one. So if you think of a KV Cache, it would be good to say it was a notebook. If you look at KV Bank, it would be better to call it sticky notes.  
So KV Cache with less steps and more selective memory and per head.

*Shortlist Chunking*

This is what we use to extend our reach to massive distances.

Instead of processing an entire 10M token document, we chunk it into sections. Feeding that to our model, we can then use some of its unique features to piece together things. Its not 100% effective and it wont be able to summarize entire sections, but for precise information, it would work well.

What happens is that as the model looks over the chunk, it collects data relating to the prompt. So if the user is asking about the best way to cook a chicken caprese and the context is a mass of jumbled recipes, reciepts, random books, automotive manuals and dog food ingrediants, It would be able to search each chunk and find the chicken caprese recipe, while also finding other bits of information relating to the concept. Store it in the buffer and send that back to the model.

You would think that it would be a jumbled mess of random exerpts of words, but there is contextual meaning, semantic meaning, positional data, etc, that we as humans wouldnt understand. So the model can generally answer a question.

This is not better than anything else out there by any means, its entirely untested outside of my own experiments. What it aims to be is a good alternative to extreme long context that the standalone model can perform on its own.

**Conclusion**

ASPA is a highly ambitious project aimed at making the attention smarter and better at handling long context.

It is not a drop in replacement for attention, and it still has a huge distance to cover.

There are trade offs, massive ones that come with this architecture, that can be engineered around, but not out of entirely. The overhead and complexity is a massive issue, and I'm doing my best to reign it all in and create per system adaptation at cheap upfront computational costs.

The net positives arguably can outweigh the negatives, especially in longer context.

Thank you for visiting and giving this a read.
If you have corrections, or if you think there are fundamental issues with the project that should be addessed, I would appreciate if you could bring it up so I can address it and fix it.

If there are bold assumptions in this and they make you upset, please bring it up and I will either explain why it is said that way, or fix it.