<a href="https://colab.research.google.com/github/Zen-Sherbert/Proteus-Attention/blob/main/TinyPlayground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the Proteus Attention Playground!

This notebook was made because I'm bad at explaining things, and it's better to show, not tell.

A little Q&A before we start:

**What is this project?**

A: Proteus is an attention architecture that gives AI a new superpower: **Selective Memory.**

Instead of treating every word in a vast context equally, it learns the concept of **Salience**, the art of identifying what is truly important. It then curates a perfect, high-fidelity "highlight reel" of the entire context, allowing it to reason over millions of tokens with a tiny, fixed memory footprint. It's not just a bigger context window; it's a smarter one.

**How is this different from RAG, Mamba, or anything else out there?**

A: Proteus isn't just one new idea; it's a synthesis of three systems that work together to create a new paradigm.

---

### **Sys 1: The CPU or Brain (DNA & Hybrid Routing)**

My philosophy is that tokens aren't just data. They have learnable values and metrics. They are the food of an AI and should be treated with more respect.

*   **The System:** I created a salience-gating system called **DNA**. Think of it as giving the AI "taste." As the model trains, its attention "gates" evolve and mutate, learning to recognize the "flavor" of important concepts. A gate might learn to prefer tokens related to legal language, while another learns to prefer tokens related to Python code.
*   **The Result:** This creates a **Hybrid Router**. A standard router acts as a backup, but the primary driver is the DNA's "instinct." It actively draws in tokens that match its learned preferences. This is the **"what"**, it tells the system what information is worth paying attention to.

---

### **Sys 2: The Chassis (Radical Sparsity & The Alpha Slider)**

Why should a model have eight heads? Or sixteen? Why not 20,000? I am not joking.

*   **The System:** Proteus is designed to handle a massive number of specialized attention heads. Because you can't use all 20,000 at once, the system is **inherently sparse**, it only activates a tiny fraction of heads for any given token. This radical sparsity is what allowed me to build a single, custom kernel that is a master of all trades.
*   **The Result:** A single parameter, the **Alpha Slider**, acts as the "gear shift" for the entire architecture.
    *   At `alpha=0.0`, it uses a careful, high-fidelity sparse mode for short contexts.
    *   As you increase `alpha` towards `1.0`, it smoothly transitions into a linear-time "bullet train" for extreme contexts.

---

### **Sys 3: The Data (Chunking & Coherent Recall)**

Ever play Minecraft and watch the world load in, chunk by chunk?

*   **The System:** To handle a context of 100 million tokens, you don't load it all at once. You break that "chocolate bar" into even chunks. The system streams these chunks one by one, using its DNA to identify the most salient "highlight" tokens from each. These champions are placed into a small, fixed-size buffer in VRAM.
*   **The Result: An "Impossible" Memory.** This is where the magic happens, and it's natural to be skeptical. You'd think the buffer would be a jumbled, incoherent mess. You'd be wrong for two reasons:
    1.  **DNA is the "What":** The salience system ensures that the *right* information (the recipe, the apples, the pie) makes it into the buffer.
    2.  **RoPE is the "Where":** Rotary Position Embeddings provide an unbreakable, absolute sense of position. The system doesn't just know *what* was said; it knows *where* it was said in the original document.

---

### **The Synthesis: Teleportation**

When you combine these three systems, you get something that feels like science fiction.

The system can be at the end of a 10-million-token document and, through its salience-based memory, create a direct, instantaneous informational link to a single, critical sentence from the very beginning.

It's like an **Einstein-Rosen Bridge** for context. It connects two distant points in the sequence, making them appear right next to each other for the final reasoning pass. It is a weird, creepy, and unbelievably powerful mechanic.

Wild, I know. Now, let's prove it.


In [1]:
# 1. Clone the repo, probably a no brainer.
!git clone https://github.com/Zen-Sherbert/Proteus-Attention.git

%cd Proteus-Attention

print("Repository files:")
!ls -F

Cloning into 'Proteus-Attention'...
remote: Enumerating objects: 136, done.[K
remote: Counting objects: 100% (136/136), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 136 (delta 43), reused 91 (delta 19), pack-reused 0 (from 0)[K
Receiving objects: 100% (136/136), 164.82 KiB | 1.28 MiB/s, done.
Resolving deltas: 100% (43/43), done.
/content/Proteus-Attention
Repository files:
examples/  MANIFEST.in	   README.md  src/    TinyPlayground.ipynb
LICENSE    pyproject.toml  scripts/   tests/  tinytoy_run.txt


In [2]:
# Install the proteus_attention package (It'll kind of run without it, but thats like having no oil in a car)
!pip install .

Processing /content/Proteus-Attention
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: proteus-attention
  Building wheel for proteus-attention (pyproject.toml) ... [?25l[?25hdone
  Created wheel for proteus-attention: filename=proteus_attention-0.1.0-py3-none-any.whl size=77301 sha256=f373ff72df602fac07cb516c9b8d42d0a5640331b86c731601e15d96a457801c
  Stored in directory: /root/.cache/pip/wheels/e1/38/b5/a9e731a0381ba75a59307d7a0512b8a0ac1e70c603ff48dc4f
Successfully built proteus-attention
Installing collected packages: proteus-attention
Successfully installed proteus-attention-0.1.0


In [4]:
# 4. This is the TinyToy benchmark. Make fun of it please. I hate it and it's been the source of constant headaches.
# Tiny Toy Compares a standard dense model against some of the different modes of my system at different context lengths. There are some CLI arguments, but they are a headache right now.
!python src/proteus_attention/kernels/tinytoy.py --enable-flux-chunk

[Config] SDPA fast-path disabled; benchmarking Flux kernels only.
--- Starting Attention Benchmark ---
Device: cuda
Sequence lengths: [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576]

[Probe] Standard testing seq_len=128 batch=4 runs=3
[Probe] Standard testing seq_len=256 batch=4 runs=3
[Probe] Standard testing seq_len=512 batch=4 runs=3
[Probe] Standard testing seq_len=1024 batch=4 runs=3
[Probe] Standard testing seq_len=2048 batch=4 runs=3
[Probe] Standard testing seq_len=4096 batch=4 runs=3
[Probe] Standard testing seq_len=8192 batch=1 runs=1
[Probe] Standard testing seq_len=16384 batch=1 runs=1
[Probe] Standard testing seq_len=32768 batch=1 runs=1
[Probe] Standard failed at sequence length 32768: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 14.74 GiB of which 14.44 GiB is free. Process 10366 has 308.00 MiB memory in use. Of the allocated memory 138.13 MiB is allocated by PyTorch, and 43.87 MiB is reserve

In [None]:
# This will show you a chunked 1 Mil context, with some interesting metrics. It's a neat proof of concept. Takes a second to load.
!python scripts/chunked_flux_smoke.py

Chunked Flux: 100% 16/16 [00:02<00:00,  7.71chunk/s]

=== Chunked Flux Smoke Test ===
device              : cuda
sequence length     : 1000000
retained tokens     : 32768 (3.28%)
chunks processed    : 16
stream latency (ms) : 2076.38
stream throughput   : 481,607.44 tok/s
final latency (ms)  : 112.54
final throughput    : 291,164.72 tok/s
final peak memory   : 1257.5 MB
overall throughput  : 439,936.04 tok/s



In [None]:
# Alright, this is some good stuff. Mechanical Needle in a Haystack tests, to prove the actual mechanics are sound. Testing teleportation across chunks
# Also if the buffer isnt actually a word salad with a very bad drizzle of confusion. Look it over, play with it, punch it in. It's pretty fun to watch it go brrr when you mess with it.
!python scripts/chunked_flux_tests.py

config: seq_len=8,192 chunk_len=1,024 buffer_tokens=8 per_chunk_budget=2 (keep ratio 1.000)
=== Needle recall (high router score) ===
retained sentinel? True | keep_indices=[1047, 2891, 3937, 4357, 4369, 5231, 7174, 7233]

=== DNA teleportation hypothesis ===
without DNA retained? False | with DNA retained? True

=== Ordering sanity (RoPE alignment) ===
keep_indices monotonic? True

=== Needle cluster stress test ===
[cluster] needles kept 1/10 (recall=0.10) with per_chunk_budget=1
[cluster-adapt] needles kept 8/10 (recall=0.80) margin=0.1 max_extra=8

=== Fading signal sweep ===
[fading] sensitivity sweep:
  boost=5.00 -> recall 100.00% (20/20)
  boost=2.00 -> recall 100.00% (20/20)
  boost=1.00 -> recall 100.00% (20/20)
  boost=0.50 -> recall 100.00% (20/20)
  boost=0.25 -> recall 55.00% (11/20)
  boost=0.10 -> recall 0.00% (0/20)

=== Jigsaw puzzle synthesis ===
[jigsaw] both clues retained? True | keep_indices=[512, 1287, 2313, 2641, 4823, 5179, 6712, 7851]

=== Sample text demo ==

In [None]:
# For the love of the machine god, do not try to go to 10 Million yet. My auto shunt system to push the SYS RAM Excess into a file on disk, is not in this to the right degree yet.
# You can actually watch the RAM go up as you test this. It's pretty cool.
!python scripts/run_fluxchunk_sweep.py

[FluxChunk] seq_len=1,000,000 buffer=1,000 alpha=1.000 device=cuda
  chunk: 1038.4 ms | 962,974 tok/s
  final: 2917.0 ms | 343 tok/s
  peak VRAM: 113.4 MB
  total: 3971.0 ms | 251,825 tok/s
  retained=1,000 (0.0010) | fallback=False
[FluxChunk] seq_len=2,000,000 buffer=10,000 alpha=1.000 device=cuda
  chunk: 2497.4 ms | 800,828 tok/s
  final: 17.3 ms | 578,983 tok/s
  peak VRAM: 225.7 MB
  total: 2525.4 ms | 791,942 tok/s
  retained=10,000 (0.0050) | fallback=False
[FluxChunk] seq_len=3,000,000 buffer=30,000 alpha=1.000 device=cuda
  chunk: 2860.2 ms | 1,048,866 tok/s
  final: 29.5 ms | 1,018,330 tok/s
  peak VRAM: 372.9 MB
  total: 2909.5 ms | 1,031,101 tok/s
  retained=30,000 (0.0100) | fallback=False
[FluxChunk] seq_len=4,000,000 buffer=80,000 alpha=1.000 device=cuda
  chunk: 4061.8 ms | 984,789 tok/s
  final: 71.3 ms | 1,121,594 tok/s
  peak VRAM: 694.3 MB
  total: 4190.2 ms | 954,599 tok/s
  retained=80,000 (0.0200) | fallback=False
[FluxChunk] seq_len=5,000,000 buffer=100,000 alp