Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.
Built by Base76 Research Lab — research into epistemic AI architecture.
token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:
- LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
- Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.90) — if not, the original is sent unchanged
The result: shorter prompts, lower costs, same intent.
Input prompt (300 tokens)
↓
LLM compresses
↓
Embedding validates (cosine ≥ 0.90?)
↓
Pass → compressed (120 tokens) Fail → original (300 tokens)
Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.
- Python 3.10+
- Ollama running locally
- Two models pulled:
ollama pull llama3.2:1b
ollama pull nomic-embed-text- Python dependencies:
pip install ollama numpyfrom compressor import LLMCompressEmbedValidate
pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")
print(result.output_text) # compressed (or original if validation failed)
print(result.report()) # MODE / COVERAGE / TOKENS savedResult object:
| Field | Description |
|---|---|
output_text |
Text to send to your LLM |
mode |
compressed / raw_fallback / skipped |
coverage |
Cosine similarity (0.0–1.0) |
tokens_in |
Estimated input tokens |
tokens_out |
Estimated output tokens |
tokens_saved |
Difference |
echo "Your long prompt here..." | python3 cli.pyOutput: compressed text on stdout, stats on stderr.
Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:
{
"type": "command",
"command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.
The included MCP server exposes compression as a tool callable from any MCP-compatible client (Claude Code, etc.):
python3 mcp_server.pyTool: compress_prompt
- Input:
text(string) - Output: compressed text + stats
Claude Code MCP config (~/.claude/settings.json):
{
"mcpServers": {
"token-compressor": {
"command": "python3",
"args": ["/path/to/token-compressor/mcp_server.py"]
}
}
}pipeline = LLMCompressEmbedValidate(
threshold=0.90, # cosine similarity floor (lower = more aggressive)
min_tokens=80, # skip pipeline below this (not worth compressing)
compress_model="llama3.2:1b",
embed_model="nomic-embed-text",
)Stage 1 — LLM compression
The compression prompt instructs the model to:
- Preserve all conditionals (
if,only if,unless,when,but only) - Preserve all negations
- Remove filler, hedging, redundancy
- Target 40–60% of original length
Stage 2 — Embedding validation
Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.
Tested across Swedish and English prompts, technical and natural language:
| Input | Tokens in | Tokens out | Saved |
|---|---|---|---|
| Research abstract (EN) | 89 | 38 | 57% |
| Session intent (SV) | 32 | 18 | 44% |
| Technical instruction | 47 | 22 | 53% |
| Short command (<80t) | — | — | skipped |
This tool implements the architecture from:
Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535
Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.
MIT — Base76 Research Lab, Sweden