Skip to content

VDADev2022/token-diet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🥗 Token-Diet v3.0: Production-Ready LLM Prompt Optimization Framework

Token-Diet v3.0 is a comprehensive, production-grade framework designed to minimize token consumption in Large Language Model (LLM) prompts without sacrificing output quality. It evolves from a collection of techniques into a structured pipeline with engineering guardrails and measurable outcomes. By integrating linguistic, algorithmic, and architectural patterns, Token-Diet enables up to 20x reduction in token usage, significantly lowering API costs and improving inference latency.


🚀 Key Innovations in v3.0

1. ⚙️ Optimized Execution Flow (Order Matters)

Token-Diet v3.0 introduces a strict, optimized pipeline for token processing:

Query → Retrieve → Prune → Cache → Route → Build Prompt → Compress → Call LLM → Measure → Update State

This order ensures maximum efficiency by eliminating waste early (Pruning) before costly decisions like caching or routing, and applying linguistic compression as a final, surgical step.

2. ✂️ Retrieval Pruning (Highest ROI)

Never send unnecessary tokens to your LLM. This new feature implements a robust pruning pipeline (dedupe → sentence-trim → topK) to filter irrelevant context from retrieval results.

  • Impact: Achieves 30–60% token reduction at the source, with zero quality loss if implemented correctly.
  • Benefit: Significantly reduces input context before it even reaches the LLM, maximizing the value of each token.

3. 🧠 Stateful Memory (Structured Context Management)

Replaces naive conversational summarization with structured JSON state. This allows the LLM to reason over precise facts and constraints, preventing context drift and ensuring accurate reasoning.

  • Mechanism: State is structured as JSON (facts, constraints, open_tasks) with clear regeneration triggers.

4. 🔪 Surgical Linguistic Compression (Last-Mile Only)

While retaining the power of "Caveman Ultra," v3.0 refines its application as a last-mile optimization. Crucially, it introduces strict guardrails:

  • DO NOT compress: Code, JSON schemas, legal/safety text, URLs, or identifiers.
  • Benefit: ~15% token reduction on prose prompts, applied safely and surgically to prevent syntax errors or semantic loss.

5. 🛡️ Engineering Guardrails & Quantitative Measurement

v3.0 treats token optimization as an engineering discipline. It mandates:

  • Tracking: tokens_in, tokens_out, latency_ms, cache_hit_rate, cost_usd per request.
  • Alerts: Proactive notifications for context bloat, low cache hit rates, or cost spikes.
  • Safety Guards: A framework to diagnose and recover from common issues like quality drops, hallucinations, or over-pruning, with a "Golden Rule": If unsure, disable and rerun to measure the difference.

📊 Framework Comparison

Feature Caveman Skill Token-Diet v2.0 Token-Diet v3.0 (Current)
Primary Focus Linguistic Compression Ad-hoc Optimization Techniques Production-Ready Pipeline
Execution Manual / Ad-hoc Manual / Ad-hoc Structured & Ordered
Context Mgmt None Rolling Summarization Retrieval Pruning + Stateful Memory
Compression Aggressive (Prose) Aggressive (Prose) Surgical (Last-Mile)
Safety None Basic Checklist Engineering Guardrails & Alerts
Measurement None Qualitative Quantitative (Metrics-Driven)
Typical Reduction ~75% (on prose) ~20x (ideal conditions) ~50-80% (realistic, stable)

📈 Effectiveness & ROI (v3.0 Realistic)

Token-Diet v3.0 delivers multiplicative benefits, leading to significant cost and latency improvements while maintaining output quality.

Metric Target Conditions
Token Reduction 50–80% All 9 steps applied in order
Latency ↓20–50% Caching + routing active
Cost ↓40–75% Routing to smaller models + cache hits >70%
Quality Stable Guards enforced; no compression on code/schema/legal

Typical Breakdown:

  • Pruning: 30–40% reduction (Highest impact on RAG).
  • Caching: 20–30% reduction (if reuse >50%).
  • Routing: 30–50% cost reduction (simple tasks to cheap model).
  • Compression: 10–15% reduction (prose only).

🛠️ Usage Checklist

  1. Retrieve & Prune: Filter irrelevant context before prompt assembly.
  2. Cache: Identify and cache static system prompts and reusable blocks.
  3. Route: Direct tasks to the most cost-effective LLM based on complexity.
  4. Batch: Group tool calls and data operations to reduce API round-trips.
  5. Stateful Memory: Use structured JSON for dynamic context, regenerating as needed.
  6. Compress: Apply linguistic compression only as a last-mile step on safe text.
  7. Measure: Implement comprehensive tracking for tokens, latency, and cost.
  8. Guard: Establish safety guardrails and alerts to prevent silent failures.

🚀 Integration

To leverage the Token-Diet skill in your AI workflows, follow these integration guidelines:

For Manus Agents

Manus agents can directly utilize the Token-Diet skill by referencing its name. When you want to apply token reduction techniques, simply invoke the skill with the /token-diet command, followed by your prompt or specific instructions. The agent will automatically apply the defined strategies (e.g., Caveman Ultra, caching, summarization) to optimize token usage.

Example Usage within Manus:

/token-diet Analyze the provided log data for anomalies and summarize key findings concisely.

For Claude and Other LLMs (via PCodec/Gateway)

For direct integration with Claude or other LLMs, the Token-Diet principles are best applied at the gateway or orchestration layer using the PCodec (Prompt-Codec) mechanism. This involves an intermediary service that intercepts and transforms prompts before sending them to the LLM.

  1. Implement a Gateway: Set up a proxy or gateway service that sits between your application and the LLM API.
  2. Apply PCodec Rules: Configure the gateway to apply the Token-Diet's PCodec rules. This means defining short codes (e.g., R1, S1) that expand into full, but optimized, instructions or schema definitions.
  3. Integrate Advanced Techniques: The gateway can also manage prompt caching, rolling summarization, tool call batching, and conditional model routing before the prompt reaches the LLM.

Example Prompt (after gateway processing):

User: R1: Analyze data. Summarize. S1: JSON output.

(Where R1 expands to a Caveman Ultra compressed instruction and S1 ensures a structured JSON output.)


🧑‍💻 Author


📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


#LLMOptimization #TokenReduction #PromptEngineering #AICostSavings #LLMEfficiency #AgenticAI #GenerativeAI #AI #MachineLearning #DeepLearning #NLP #PromptCompression #ContextManagement #AIDevelopment

Releases

No releases published

Packages

 
 
 

Contributors