# ♔♛♞ ZIP-RC: Zero-Overhead Introspection for Adaptive Compute ♝♙♜
### Running Qwen3-4B-Instruct-2507 with Self-Correction

This notebook demonstrates **Zero-Overhead Introspection (ZIP-RC)**, a technique from the paper *"Zero-Overhead Introspection for Adaptive Test-Time Compute"* (Manvi et al., 2025).

Standard reasoning models (like OpenAI's o1 or DeepSeek-R1) often rely on heavy "test-time compute"—generating many solutions and grading them with a separate Reward Model. This is effective but expensive and slow.

**ZIP-RC changes the game.** By repurposing unused space in the model's vocabulary, this modified **Qwen3-4B** model predicts the next token **AND** its own probability of success (Expected Reward) simultaneously, with **zero additional computational overhead**.


 **Model:** `dataopsnick/Qwen3-4B-Instruct-2507-zip-rc`
 **Method:** The LM Head is fine-tuned to predict a joint distribution of *Reward* (Correctness) and *Cost* (Remaining Length) at every step.


In [8]:
import torch
import sys
import os
from huggingface_hub import hf_hub_download

# 1. Download the helper script dynamically
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="ziprc.py")
sys.path.append(os.path.dirname(script_path))

# 2. Import the downloaded module
import ziprc

# 3. Run Inference
model = ziprc.ZIPRCModel(ziprc.ZIPRCConfig())
sampler = ziprc.ZIPRCSampler(model)

prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
trajectories = sampler.generate(prompt, initial_samples=2)

best = sampler.select_best_trajectory(trajectories)
print(f"Confidence: {best['final_score']:.2%}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Step 0: Meta-Action -> branch_top1 (Util: 0.8699) | Pool: 3
Step 14: Meta-Action -> branch_top2 (Util: 0.8984) | Pool: 5
Step 40: Meta-Action -> prune_bot2 (Util: 0.8699) | Pool: 3
Step 74: Meta-Action -> branch_top1 (Util: 0.8640) | Pool: 4
Step 76: Meta-Action -> swap (Util: 0.8660) | Pool: 4
Step 83: Meta-Action -> branch_top2 (Util: 0.8720) | Pool: 6
Step 96: Meta-Action -> swap (Util: 0.8726) | Pool: 6
Step 135: Meta-Action -> branch_top2 (Util: 0.8810) | Pool: 8
Step 154: Meta-Action -> prune_bot2 (Util: 0.8844) | Pool: 6
Step 166: Meta-Action -> prune_bot2 (Util: 0.8950) | Pool: 4
Step 169: Meta-Action -> swap (Util: 0.8947) | Pool: 4
Step 174: Meta-Action -> branch_top2 (Util: 0.8887) | Pool: 6
Step 183: Meta-Action -> branch_top2 (Util: 0.9074) | Pool: 8
Step 190: Meta-Action -> swap (Util: 0.9006) | Pool: 8
Step 191: Meta-Action -> swap (Util: 0.9068) | Pool: 8
Step 193: Meta-Action -> branch_top2 (Util: 0.8721) | Pool: 7
Step 212: Meta-Action -> branch_top1 (Util: 0.8838) | 

### What you will see in this notebook:

1.  **High-Level Adaptive Sampling (Meta-MDP):** We will run a `ZIPRCSampler` that uses these signals to dynamically **Branch** (explore promising thoughts), **Prune** (kill bad thoughts), and **Swap** (self-correct) in real-time.

2.  **Low-Level Introspection:** We will "crack open" the logits to see the raw self-confidence signals hidden in the vocabulary.

3.  **Drop-in OpenAI-SDK Replacement Streaming (Async):**
For easy integration with frontends or evaluation pipelines, the sampler provides an openai() async generator that mimics the standard OpenAI API structure but injects a custom zip_rc field into every chunk.

4.  **OpenAI-Compatible Server:** We will spin up a local API server that streams these introspection insights to standard tools like OpenWebUI or ChatKit.


## High-Level: Adaptive Sampling (Meta-MDP)

While decoding logits manually reveals the *signal*, effectively acting on that signal requires complex logic (managing a tree of token trajectories, calculating utility scores, and deciding when to branch or prune).

To make this easy, we provide a helper library **`ziprc.py`** (downloaded dynamically below).

The `ZIPRCSampler` implements the **Meta-MDP** policy described in the paper. It automatically:
1.  **Monitors** the introspection signal at every step.
2.  **Branches** (duplicates) high-confidence thoughts to explore variations.
3.  **Prunes** (stops) low-confidence or high-cost thoughts to save compute.
4.  **Swaps** to better trajectories if the current one starts failing.

Let's run a logic puzzle to see it in action:

In [1]:
import sys
import os
#import tqdm
from huggingface_hub import hf_hub_download
from tqdm import tqdm

# 1. Download the helper script dynamically from the repo
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="ziprc.py")
sys.path.append(os.path.dirname(script_path))

# 2. Import the module
from ziprc import ZIPRCModel, ZIPRCConfig, ZIPRCSampler

# 3. Configure and Load Model
# Note: The model weights are downloaded automatically here
cfg = ZIPRCConfig(
    model_name="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc",
    alpha=0.1,          # Threshold for pruning
    beta=0.05,          # Cost penalty
    smoothing_window=3  # For stable predictions
)

model = ZIPRCModel(cfg)
sampler = ZIPRCSampler(model)

# 4. Generate with Introspection
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"

# generate_stream produces trajectories with introspection data
trajectories = sampler.generate_stream(prompt, initial_samples=2)

# Select the best answer based on the introspection score
best = sampler.select_best_trajectory(trajectories)

print(f"Confidence: {best['final_score']:.2%}")
print(f"Answer: {model.tokenizer.decode(best['ids'][0], skip_special_tokens=True)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ziprc.py: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/213 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

HTML(value='Initialization...')


Best Trajectory (Top-1) Confidence: 86.94%

Let's break down the logic puzzle:

**Question:** *Five adults check into a hotel with three dogs. How many shoes are they all wearing?*

We are to determine how many **shoes** are being worn by the group — five adults and three dogs.

---

### Step 1: Consider what entities wear shoes

- **Adults**: Humans typically wear shoes. Most adults wear **2 shoes per person** (one pair).  
  So, 5 adults × 2 shoes = **10 shoes**

- **Dogs**: Dogs do **not** typically wear shoes. In the absence of any indication that the dogs are wearing shoes (e.g., in a puzzle about "dog shoes" or special circumstances), we assume **dogs do not wear shoes**.

So, 3 dogs × 0 shoes = **0 shoes**

---

### Step 2: Total shoes

Add them together:

10 (from adults) + 0 (from dogs) = **10 shoes**

---

### Final Answer:

**10 shoes**

✅ This is a common logic puzzle that tricks people into thinking about the dogs' shoes, but in reality, dogs don't wear shoes — so only th

## Low-Level: Decoding the Introspection Signal

Before using the high-level sampler, let's look "under the hood" to see how ZIP-RC actually works.

Unlike standard LLMs, this model uses its **Language Model Head** to predict two things simultaneously at every step:
1.  The next text token (standard generation).
2.  A **Joint Distribution** of Expected Reward (Correctness) and Remaining Generation Length.

These introspection predictions are encoded in **56 reserved logits** at the very end of the vocabulary. By slicing these logits, reshaping them into a grid (**8 Reward Bins × 7 Length Bins**), and calculating the expected value, we can extract the model's self-confidence with **zero computational overhead**.

The code below demonstrates how to manually extract this signal from a raw forward pass:

In [2]:
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

# Configuration used during training
reward_bins = 8
length_bins = 7
total_zip_tokens = 56
zip_start_offset = 56
# ZIP tokens are located at the very end of the vocabulary
zip_start_id = model.config.vocab_size - zip_start_offset

def get_introspection_probs(logits):
    """
    Extracts the joint distribution P(Reward, Length) from the logits.
    """
    # Slice the reserved ZIP logits
    zip_logits = logits[:, zip_start_id : zip_start_id + total_zip_tokens]

    # Softmax over the flat ZIP tokens to get valid probabilities
    probs = F.softmax(zip_logits, dim=-1)

    # Reshape to [Batch, Reward_Bins, Length_Bins]
    return probs.view(-1, reward_bins, length_bins)

# Example Inference Step
prompt = "Solve this logic puzzle: ..."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(inputs.input_ids)
    next_token_logits = outputs.logits[:, -1, :]

    # Get Introspection Signal (Zero Overhead)
    joint_dist = get_introspection_probs(next_token_logits)

    # 1. Marginalize over length to get P(Reward) distribution
    p_reward = joint_dist.sum(dim=2)  # Shape: [Batch, Reward_Bins]

    # 2. Calculate Expected Reward (Confidence)
    # The reward bins are linearly spaced [0, 1]. We use bin centers for the weighted sum.
    # centers = 0.0625, 0.1875, ..., 0.9375
    reward_grid = torch.linspace(0.0625, 0.9375, reward_bins).to(model.device)

    # E[R] = sum(P(r) * r)
    expected_reward = (p_reward * reward_grid).sum(dim=1).item()

    print(f"Model Confidence: {expected_reward:.2%}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model Confidence: 40.61%


### 3. OpenAI-Compatible Streaming (Async)

For easy integration with frontends or evaluation pipelines, the sampler provides an `openai()` async generator. This mimics the standard OpenAI API structure but injects a custom `zip_rc` field into every chunk.

This allows you to consume the **standard text stream** normally, while simultaneously monitoring the **introspection signals** (confidence scores, alternative branches, and meta-actions) to visualize the model's efficient compute usage.


In [5]:
import asyncio
import nest_asyncio
from ziprc import ZIPRCModel, ZIPRCConfig, ZIPRCSampler

# 1. Setup (Run once)
# This patch is required for running async loops in Colab/Jupyter
nest_asyncio.apply()

# Load Model
cfg = ZIPRCConfig(model_name="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc")
model = ZIPRCModel(cfg)
sampler = ZIPRCSampler(model)

async def consume_inference_stream():
    prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"

    print(f"User: {prompt}\n" + "-"*60)
    print("Assistant (Streaming with Introspection):")

    # 2. Get the OpenAI-compatible stream
    # Returns an async generator yielding chunk objects
    stream = sampler.openai(prompt, max_tokens=256)

    final_clean_answer = ""

    async for chunk in stream:
        # --- Channel A: Standard Text (Compatible with standard UIs) ---
        # Use .get() to handle the final chunk where delta is empty
        # Use .get() to safely handle the final chunk where delta is empty
        delta = chunk.choices[0].delta
        content = delta.get("content", "")

        if content:
            print(content, end="", flush=True)

        # --- Channel B: Zero-Overhead Introspection (The "Pareto" Gain) ---
        # We access the side-channel data to see what the model is thinking
        # without running separate reward model inference.
        if hasattr(chunk, 'zip_rc'):
            info = chunk.zip_rc

            # If the model performs a meta-action (Branching/Pruning), log it
            # Filter out 'finished' to avoid accessing missing utility/score fields
            if info.action not in ['keep', 'finished']:
                print(f"\n[⚙️ META-ACTION: {info.action} | Utility: {info.utility:.4f}] ", end="")

            # Check for the Final Answer
            if info.get('action') == 'finished' and 'final_text' in info:
                final_clean_answer = info['final_text']

            # Optional: Peek at the "Confidence" (Expected Correctness) in real-time
            # if info.step % 10 == 0:
            #     print(f" (Conf: {info.lhs_score:.1%}) ", end="")

    print("\n" + "-" * 40)
    print("🏆 FINAL BEST ANSWER (Clean):")
    print("-" * 40)
    print(final_clean_answer)

# 3. Execution
loop = asyncio.get_event_loop()
loop.run_until_complete(consume_inference_stream())

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

User: Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?
------------------------------------------------------------
Assistant (Streaming with Introspection):

[⚙️ META-ACTION: branch_top1 | Utility: 0.6149] 's
[⚙️ META-ACTION: branch_top1 | Utility: 0.6949]  break
[⚙️ META-ACTION: branch_top1 | Utility: 0.7781] 
[⚙️ META-ACTION: swap | Utility: 0.8147]  step by
[⚙️ META-ACTION: swap | Utility: 0.8592]  by WeNumber of Problem:
> Five adults check into aNumber dogs. How many shoes are they?

We are
[⚙️ META-ACTION: swap | Utility: 0.8544]  asked the total number of shoes** all the **five adults and three
[⚙️ META-ACTION: swap | Utility: 0.8546]  dogs**.

:
[⚙️ META-ACTION: swap | Utility: 0.8581]  Understand: ConsiderFive adults**
 adults
3 dogs

We when theysh
[⚙️ META-ACTION: swap | Utility: 0.8564] oes
[⚙️ META-ACTION: swap | Utility: 0.8729] ** are.

---


[⚙️ META-ACTION: branch_top1 | Utility: 0.8449]  Do adults 

### 4. Deploy as OpenAI-Compatible Server

To use this model with external tools (like **ChatKit**, **OpenWebUI**, or the **official `openai` Python client**), you can wrap the sampler in a simple FastAPI server.

This server exposes a `/v1/chat/completions` endpoint that streams the generated text **and** the introspection data (confidence, branching, etc.) in real-time.

#### 1. Server Script (`server.py`)

Save this script alongside `ziprc.py` and run it to launch the API.

In [11]:
import sys
import os
import asyncio
import uvicorn
from huggingface_hub import hf_hub_download

# 1. Download server.py
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="server.py")
sys.path.append(os.path.dirname(script_path))

# 2. Import the app
# NOTE: This will load the model weights again if they aren't cached.
# If you are low on VRAM, restart your runtime before running this cell.
from server import app

# 3. Run the Server (Colab/Jupyter Compatible)
HOST = "0.0.0.0"
PORT = 8000
config = uvicorn.Config(app, host=HOST, port=PORT)
server = uvicorn.Server(config)

try:
    # Check if we are in an existing loop (Colab)
    loop = asyncio.get_running_loop()
    print(f"🚀 Server running in background on http://{HOST}:{PORT}")
    loop.create_task(server.serve())
except RuntimeError:
    # Standard script execution
    asyncio.run(server.serve())

🚀 Server running in background on http://0.0.0.0:8000


#### 2. Usage Examples

**Option A: Using the `openai` Python Client**

In [None]:
import asyncio
from openai import AsyncOpenAI

# Point to your local server
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="ziprc"  # Key is ignored by local server
)

async def main():
    print("User: Solve the logic puzzle...")

    stream = await client.chat.completions.create(
        model="zip-rc",
        messages=[{"role": "user", "content": "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"}],
        stream=True
    )

    final_clean_answer = ""

    async for chunk in stream:
        # 1. Standard Content
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)

        # 2. Access ZIP-RC Introspection Data (via raw dictionary)
        # The 'zip_rc' field contains zero-overhead confidence scores & meta-actions
        try:
            raw = chunk.model_dump()
            if 'zip_rc' in raw:
                info = raw['zip_rc']
                if info['action'] not in ['keep', 'finished']:
                    print(f"\n[⚙️ {info['action']}] ", end="")
                # Check for the Final Answer
                if info.get('action') == 'finished' and 'final_text' in info:
                    final_clean_answer = info['final_text']
        except:
            pass

    print("\n" + "-" * 40)
    print("🏆 FINAL BEST ANSWER (Clean):")
    print("-" * 40)
    print(final_clean_answer)

asyncio.run(main())

**Option B: Integrating with ChatKit (Self-Hosted)**

To use this with the **ChatKit Starter App**, configure the backend to point to your local ZIP-RC server.

1.  Navigate to your `chatkit/` directory.
2.  Set the environment variable before running the backend:

```bash
# Point agents/OpenAI to your local ZIP-RC server
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="ziprc"

# Run the ChatKit backend
npm run backend
```