Skip to content

Context compaction agent — cheap-LLM summarize, expensive-LLM consume #44

@andrewmusselman

Description

@andrewmusselman

Carry-forward from #14 ("Refactor prompts"). That thread split into three
readings of "telescoping" and converged on the right one being context
compaction with a hierarchical reference tree. The ASVS-specific budget work
landed in 48a15e8 (see ASVS/audit-budget.md and
ASVS/optimization-plan.md), but the general compaction primitive
sbp asked for never shipped. Filing this issue to revisit it now.

Problem

The expensive part of any LLM-heavy pipeline is feeding the expensive model.
Today, an audit pass for a single ASVS section might hand Opus 200–300KB of
raw source code, most of which is irrelevant to the requirement being
audited. The same pattern shows up anywhere that a high-context, high-cost
reasoning call follows a data-gathering step: the gathering step assembled
the world; the reasoning step pays for the world even though it only needs
a fraction of it.

A cheaper model, asked to first summarize the gathered context with
respect to the question being asked, produces a fraction of the tokens at a
tiny fraction of the cost. The expensive model then reasons over the
summary, optionally drilling into specific parts of the raw via tool calls.

What sbp described in #14

"Use cheaper LLMs to gather the context, medium cost LLMs to collate and
do a small amount of reasoning about that gathered context (also
compactifying), and then high cost LLMs to work on the output of that."

"The idea was to telescope the compressed outputs, so you have a kind of
compressed tree of references."

So the shape isn't just summarization — it's a hierarchical index the
expensive model can navigate:

  • Level 0: raw source (full files, full data).
  • Level 1: per-unit abstracts (function signatures + key comments for
    code; per-file summaries for docs; per-row digests for data).
  • Level 2: group abstracts (per-directory or per-domain summaries).
  • Level N: the entry-point index Opus actually receives.

The expensive model reads the top of the tree, picks which subtree it
needs, and a tool call fetches the next level down. Most calls never need
to descend to Level 0.

Lossless compaction (claw-compactor style, ~25% reduction) doesn't get
anywhere near the cost reduction needed. The win is lossy: discarding
everything not relevant to the question at hand, with the raw still
available behind a tool call if the expensive model asks for it.

Why this matters now

  • ASVS specifically. The pipeline today sends ~150–300KB of source per
    Opus call. Half of any file is irrelevant to whichever ASVS section is
    being audited. A 70–80% reduction is realistic; that's a several-X cost
    reduction on the hottest path.
  • Demo budget control. A user wants to demo an agent on a real
    codebase but doesn't want to pay full Opus rates for every iteration.
    Compaction-with-cache makes the first run expensive and every subsequent
    iteration cheap.
  • Quality-curve experiments. Comparing "Opus on raw" vs "Opus on
    compacted" is the kind of experiment that's hard to run today because
    the compaction step doesn't exist as a reusable thing.

Recommendation: build it as an agent first

The compaction strategy is domain-specific: code wants signature
extraction, audit reports want finding-extraction, narrative wants
narrative summarization. Different users will want different strategies.
Different agents in the same pipeline will want different strategies.
Putting strategy in user agent code keeps the platform simple and the
iteration fast — same argument the original #14 thread reached for
telescoping itself.

What the platform should provide is the substrate: caching with
content-addressed keys, so the same source → strategy → compact mapping is
computed once and reused. That substrate already exists, just not formally
data_store with a disciplined naming convention does it today. A
platform-level promotion is small enough that it can wait until the agent
pattern has shaken out.

MVP — entirely as an agent

A compactor agent with this contract:

async def run(input_dict, tools):
    source_namespace = input_dict["source_namespace"]
    keys = input_dict.get("keys") or "*"           # specific keys or all
    strategy = input_dict["strategy"]              # named strategy
    level = input_dict.get("level", 1)             # which level to produce
    purpose_hint = input_dict.get("purpose", "")   # what's being audited/asked

Returns:

{
  "compact_namespace": "compact:files:apache/airflow:v1:security_skeleton:abc12345",
  "level": 1,
  "key_map": {                                   # original key → compact key
    "airflow/dags/example.py": "summary_of_example_py",
    # ...
  },
  "stats": {"input_tokens": 89432, "output_tokens": 18211, "saved_pct": 79.6}
}

The agent computes the content+strategy hash, checks data_store for
existing compacts under compact:{source}:{version}:{strategy}:{hash},
returns immediately if cached. Otherwise calls the cheap LLM with the
strategy's prompt template against each input chunk, stores the results,
and returns the namespace pointer.

Other agents in the pipeline call this once at the start of a pass, get
back a namespace pointer, and feed that namespace to Opus instead of the
raw. No platform change needed.

Naming convention for cache keys

compact:{source_namespace}:{strategy_version}:{strategy_name}:{content_hash}
                                                              └─ first 8 chars
                                                                 of sha256 of
                                                                 input content
                                                                 + strategy
                                                                 + level

Hash includes the source content, strategy version, strategy name, and
level so that any of those changing produces a fresh cache key without
invalidating siblings. Same pattern the relevance-filter cache uses today
with its policy hash.

Strategy registry — also as an agent, not platform

Each strategy is a named (prompt template, target model, target token
budget) tuple. Ship a few canned ones:

Strategy name What it does Cheap LLM Notes
code_signature Functions + docstrings + critical patterns Haiku Skips bodies of pure-utility functions
code_security Security-relevant code only (auth, validation, crypto, IO) Haiku Hot path for ASVS
audit_findings Extracts finding objects from audit reports Sonnet Generalize from what asvs_consolidate does today
narrative_summary Standard text summarization Haiku Generic fallback

Users can register their own strategies by adding entries to a small YAML
config, or by passing the strategy inline as a prompt template. The agent
reads from data_store:strategies or similar.

When to promote to platform

After the agent has been in production use for one or two pipelines and
the pattern is clear, promote just the cache primitive to a platform
feature:

  • A compact: namespace prefix with first-class semantics
  • Content-addressed keys enforced by the platform (key must equal hash of
    (content, strategy_version, strategy_name, level))
  • Optional TTL so old compactions expire automatically
  • Optional metrics: per-strategy hit rate, average compression ratio,
    total tokens saved

Then maybe a tools.compact() helper as sugar. Then maybe
auto-compaction in tools.call_llm() ("if context exceeds budget, swap in
the cached compact"). All optional polish on top of the substrate.

The premature version is the one where the platform owns the strategy.
Don't do that.

Three concrete uses driving this

  1. ASVS Opus calls. asvs_audit and asvs_bundle would compact-first
    via code_security, then call Opus on the compact namespace. Estimated
    win: 70–80% token reduction on the hot path, with the option for Opus
    to drill into raw via tool call when something looks suspicious.
  2. Consolidation. asvs_consolidate already does finding-extraction;
    that's structurally the same compaction pattern. Generalizing it gives
    consolidation the same tree-of-references behavior other agents would
    have.
  3. General reasoning over large corpora. Any agent that wants to
    "read this 5MB documentation and answer questions about it" becomes
    feasible. Today that's prohibitively expensive.

Phasing

Phase 1 — Agent + a few canned strategies. Build compactor agent.
Ship code_security and narrative_summary strategies. Wire ASVS audit
to call it. Measure cost reduction. ~400 LOC agent + strategy configs.

Phase 2 — Strategy registry as an agent. A compactor_admin agent
for adding/editing strategies. UI lives in data store configuration
screens. ~150 LOC.

Phase 3 — Tree-of-references navigation tool. A compact_drilldown
tool exposed to LLMs so they can ask "show me the raw for
airflow/dags/example.py" mid-reasoning. Requires LLM tool-calling support
in the call_llm path. ~250 LOC.

Phase 4 — Platform promotion (conditional). Only if Phases 1–3 have
shown the pattern is right. Promote cache + content-addressed keys to
first-class platform primitives. Roughly ~300 LOC on the gofannon side.

Effort overall

Phase 1 alone is probably one focused two-week sprint with measurable cost
reduction on the ASVS pipeline. Phases 2–3 add another month. Phase 4 is
platform-team work after pattern validation.

Open questions

  1. Cache invalidation when source changes. Content hash handles it
    implicitly — different source = different hash = different cache key —
    but old compacts accumulate. TTL? Reference counting? Background GC?
    Recommend TTL (30 days?) as the simplest answer.
  2. Cross-agent cache sharing. A compact computed for agent A's
    strategy is reusable by agent B if they use the same strategy. The
    naming scheme already supports this; just make sure strategies have
    human-meaningful names so reuse is visible.
  3. Quality degradation. Compaction is lossy by design. Measuring
    whether the expensive model's final output got worse when fed compact
    vs raw is essential before relying on this in anything important. Each
    strategy should ship with an eval set that compares Opus-on-raw vs
    Opus-on-compact for representative inputs.
  4. Tool-call drill-down vs flat compact. Phase 1 ships flat (Opus
    gets the compact, never sees raw). Phase 3 adds the drill-down tool.
    Whether the drill-down is worth the complexity depends on Phase 1
    results — if 80% reduction with no quality loss, drill-down may be
    overkill.
  5. Drill-down or flat for the MVP? Worth deciding before any code:
    does sbp want compaction-with-drilldown specifically, or is flat
    compaction acceptable for Phase 1? If drilldown is essential, Phase 1
    scope grows. If flat is fine for now, Phase 1 ships faster and we
    learn from real use whether drilldown matters.

Why this is high-value

Cost reduction on the hot path is the flywheel: every dollar saved on
existing pipelines funds more pipelines, more iteration, longer runs.
ASVS at full ASF coverage is constrained by Bedrock budget right now. A
70–80% reduction makes "audit every Apache project quarterly" tractable
instead of aspirational. Other LLM-heavy pipelines that can't get off the
ground today become feasible with the same change.

Proposed next steps

  1. Confirm the MVP shape (flat compact vs. drill-down) with sbp.
  2. Spec the code_security strategy prompt and the eval set for measuring
    quality preservation on representative ASVS inputs.
  3. Build the compactor agent and the two MVP strategies.
  4. Wire ASVS audit/bundle to compact-first, measure cost reduction and
    finding parity vs the current uncompacted runs.
  5. If the numbers hold up, ship strategy 2 and 3 as natural follow-ons.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions