Context compaction agent — cheap-LLM summarize, expensive-LLM consume

Carry-forward from #14 ("Refactor prompts"). That thread split into three
readings of "telescoping" and converged on the right one being context
compaction with a hierarchical reference tree. The ASVS-specific budget work
landed in `48a15e8` (see [`ASVS/audit-budget.md`][budget] and
[`ASVS/optimization-plan.md`][plan]), but the *general* compaction primitive
sbp asked for never shipped. Filing this issue to revisit it now.

[budget]: https://github.com/apache/tooling-agents/blob/main/ASVS/audit-budget.md
[plan]: https://github.com/apache/tooling-agents/blob/main/ASVS/optimization-plan.md

## Problem

The expensive part of any LLM-heavy pipeline is feeding the expensive model.
Today, an audit pass for a single ASVS section might hand Opus 200–300KB of
raw source code, most of which is irrelevant to the requirement being
audited. The same pattern shows up anywhere that a high-context, high-cost
reasoning call follows a data-gathering step: the gathering step assembled
the world; the reasoning step pays for the world even though it only needs
a fraction of it.

A cheaper model, asked to *first* summarize the gathered context with
respect to the question being asked, produces a fraction of the tokens at a
tiny fraction of the cost. The expensive model then reasons over the
summary, optionally drilling into specific parts of the raw via tool calls.

## What sbp described in #14

> "Use cheaper LLMs to gather the context, medium cost LLMs to collate and
> do a small amount of reasoning about that gathered context (also
> compactifying), and then high cost LLMs to work on the output of that."

> "The idea was to telescope the compressed outputs, so you have a kind of
> compressed tree of references."

So the shape isn't just summarization — it's a *hierarchical* index the
expensive model can navigate:

- **Level 0:** raw source (full files, full data).
- **Level 1:** per-unit abstracts (function signatures + key comments for
  code; per-file summaries for docs; per-row digests for data).
- **Level 2:** group abstracts (per-directory or per-domain summaries).
- **Level N:** the entry-point index Opus actually receives.

The expensive model reads the top of the tree, picks which subtree it
needs, and a tool call fetches the next level down. Most calls never need
to descend to Level 0.

Lossless compaction (claw-compactor style, ~25% reduction) doesn't get
anywhere near the cost reduction needed. The win is lossy: discarding
everything not relevant to the question at hand, with the raw still
available behind a tool call if the expensive model asks for it.

## Why this matters now

- **ASVS specifically.** The pipeline today sends ~150–300KB of source per
  Opus call. Half of any file is irrelevant to whichever ASVS section is
  being audited. A 70–80% reduction is realistic; that's a several-X cost
  reduction on the hottest path.
- **Demo budget control.** A user wants to demo an agent on a real
  codebase but doesn't want to pay full Opus rates for every iteration.
  Compaction-with-cache makes the first run expensive and every subsequent
  iteration cheap.
- **Quality-curve experiments.** Comparing "Opus on raw" vs "Opus on
  compacted" is the kind of experiment that's hard to run today because
  the compaction step doesn't exist as a reusable thing.

## Recommendation: build it as an agent first

The compaction *strategy* is domain-specific: code wants signature
extraction, audit reports want finding-extraction, narrative wants
narrative summarization. Different users will want different strategies.
Different agents in the same pipeline will want different strategies.
Putting strategy in user agent code keeps the platform simple and the
iteration fast — same argument the original #14 thread reached for
telescoping itself.

What the platform should provide is the *substrate*: caching with
content-addressed keys, so the same source → strategy → compact mapping is
computed once and reused. That substrate already exists, just not formally
— `data_store` with a disciplined naming convention does it today. A
platform-level promotion is small enough that it can wait until the agent
pattern has shaken out.

## MVP — entirely as an agent

A `compactor` agent with this contract:

```python
async def run(input_dict, tools):
    source_namespace = input_dict["source_namespace"]
    keys = input_dict.get("keys") or "*"           # specific keys or all
    strategy = input_dict["strategy"]              # named strategy
    level = input_dict.get("level", 1)             # which level to produce
    purpose_hint = input_dict.get("purpose", "")   # what's being audited/asked
```

Returns:

```python
{
  "compact_namespace": "compact:files:apache/airflow:v1:security_skeleton:abc12345",
  "level": 1,
  "key_map": {                                   # original key → compact key
    "airflow/dags/example.py": "summary_of_example_py",
    # ...
  },
  "stats": {"input_tokens": 89432, "output_tokens": 18211, "saved_pct": 79.6}
}
```

The agent computes the content+strategy hash, checks `data_store` for
existing compacts under `compact:{source}:{version}:{strategy}:{hash}`,
returns immediately if cached. Otherwise calls the cheap LLM with the
strategy's prompt template against each input chunk, stores the results,
and returns the namespace pointer.

Other agents in the pipeline call this once at the start of a pass, get
back a namespace pointer, and feed *that namespace* to Opus instead of the
raw. No platform change needed.

## Naming convention for cache keys

```
compact:{source_namespace}:{strategy_version}:{strategy_name}:{content_hash}
                                                              └─ first 8 chars
                                                                 of sha256 of
                                                                 input content
                                                                 + strategy
                                                                 + level
```

Hash includes the source content, strategy version, strategy name, and
level so that any of those changing produces a fresh cache key without
invalidating siblings. Same pattern the relevance-filter cache uses today
with its policy hash.

## Strategy registry — also as an agent, not platform

Each strategy is a named (prompt template, target model, target token
budget) tuple. Ship a few canned ones:

| Strategy name | What it does | Cheap LLM | Notes |
|---|---|---|---|
| `code_signature` | Functions + docstrings + critical patterns | Haiku | Skips bodies of pure-utility functions |
| `code_security` | Security-relevant code only (auth, validation, crypto, IO) | Haiku | Hot path for ASVS |
| `audit_findings` | Extracts finding objects from audit reports | Sonnet | Generalize from what `asvs_consolidate` does today |
| `narrative_summary` | Standard text summarization | Haiku | Generic fallback |

Users can register their own strategies by adding entries to a small YAML
config, or by passing the strategy inline as a prompt template. The agent
reads from `data_store:strategies` or similar.

## When to promote to platform

After the agent has been in production use for one or two pipelines and
the pattern is clear, promote *just the cache primitive* to a platform
feature:

- A `compact:` namespace prefix with first-class semantics
- Content-addressed keys enforced by the platform (key must equal hash of
  `(content, strategy_version, strategy_name, level)`)
- Optional TTL so old compactions expire automatically
- Optional metrics: per-strategy hit rate, average compression ratio,
  total tokens saved

*Then* maybe a `tools.compact()` helper as sugar. *Then* maybe
auto-compaction in `tools.call_llm()` ("if context exceeds budget, swap in
the cached compact"). All optional polish on top of the substrate.

The premature version is the one where the platform owns the strategy.
Don't do that.

## Three concrete uses driving this

1. **ASVS Opus calls.** `asvs_audit` and `asvs_bundle` would compact-first
   via `code_security`, then call Opus on the compact namespace. Estimated
   win: 70–80% token reduction on the hot path, with the option for Opus
   to drill into raw via tool call when something looks suspicious.
2. **Consolidation.** `asvs_consolidate` already does finding-extraction;
   that's structurally the same compaction pattern. Generalizing it gives
   consolidation the same tree-of-references behavior other agents would
   have.
3. **General reasoning over large corpora.** Any agent that wants to
   "read this 5MB documentation and answer questions about it" becomes
   feasible. Today that's prohibitively expensive.

## Phasing

**Phase 1 — Agent + a few canned strategies.** Build `compactor` agent.
Ship `code_security` and `narrative_summary` strategies. Wire ASVS audit
to call it. Measure cost reduction. ~400 LOC agent + strategy configs.

**Phase 2 — Strategy registry as an agent.** A `compactor_admin` agent
for adding/editing strategies. UI lives in data store configuration
screens. ~150 LOC.

**Phase 3 — Tree-of-references navigation tool.** A `compact_drilldown`
tool exposed to LLMs so they can ask "show me the raw for
`airflow/dags/example.py`" mid-reasoning. Requires LLM tool-calling support
in the call_llm path. ~250 LOC.

**Phase 4 — Platform promotion (conditional).** Only if Phases 1–3 have
shown the pattern is right. Promote cache + content-addressed keys to
first-class platform primitives. Roughly ~300 LOC on the gofannon side.

## Effort overall

Phase 1 alone is probably one focused two-week sprint with measurable cost
reduction on the ASVS pipeline. Phases 2–3 add another month. Phase 4 is
platform-team work after pattern validation.

## Open questions

1. **Cache invalidation when source changes.** Content hash handles it
   implicitly — different source = different hash = different cache key —
   but old compacts accumulate. TTL? Reference counting? Background GC?
   Recommend TTL (30 days?) as the simplest answer.
2. **Cross-agent cache sharing.** A compact computed for agent A's
   strategy is reusable by agent B if they use the same strategy. The
   naming scheme already supports this; just make sure strategies have
   human-meaningful names so reuse is visible.
3. **Quality degradation.** Compaction is lossy by design. Measuring
   whether the expensive model's final output got worse when fed compact
   vs raw is essential before relying on this in anything important. Each
   strategy should ship with an eval set that compares Opus-on-raw vs
   Opus-on-compact for representative inputs.
4. **Tool-call drill-down vs flat compact.** Phase 1 ships flat (Opus
   gets the compact, never sees raw). Phase 3 adds the drill-down tool.
   Whether the drill-down is worth the complexity depends on Phase 1
   results — if 80% reduction with no quality loss, drill-down may be
   overkill.
5. **Drill-down or flat for the MVP?** Worth deciding before any code:
   does sbp want compaction-with-drilldown specifically, or is flat
   compaction acceptable for Phase 1? If drilldown is essential, Phase 1
   scope grows. If flat is fine for now, Phase 1 ships faster and we
   learn from real use whether drilldown matters.

## Why this is high-value

Cost reduction on the hot path is the flywheel: every dollar saved on
existing pipelines funds more pipelines, more iteration, longer runs.
ASVS at full ASF coverage is constrained by Bedrock budget right now. A
70–80% reduction makes "audit every Apache project quarterly" tractable
instead of aspirational. Other LLM-heavy pipelines that can't get off the
ground today become feasible with the same change.

## Proposed next steps

1. Confirm the MVP shape (flat compact vs. drill-down) with sbp.
2. Spec the `code_security` strategy prompt and the eval set for measuring
   quality preservation on representative ASVS inputs.
3. Build the `compactor` agent and the two MVP strategies.
4. Wire ASVS audit/bundle to compact-first, measure cost reduction and
   finding parity vs the current uncompacted runs.
5. If the numbers hold up, ship strategy 2 and 3 as natural follow-ons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context compaction agent — cheap-LLM summarize, expensive-LLM consume #44

Problem

What sbp described in #14

Why this matters now

Recommendation: build it as an agent first

MVP — entirely as an agent

Naming convention for cache keys

Strategy registry — also as an agent, not platform

When to promote to platform

Three concrete uses driving this

Phasing

Effort overall

Open questions

Why this is high-value

Proposed next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Strategy name	What it does	Cheap LLM	Notes
`code_signature`	Functions + docstrings + critical patterns	Haiku	Skips bodies of pure-utility functions
`code_security`	Security-relevant code only (auth, validation, crypto, IO)	Haiku	Hot path for ASVS
`audit_findings`	Extracts finding objects from audit reports	Sonnet	Generalize from what `asvs_consolidate` does today
`narrative_summary`	Standard text summarization	Haiku	Generic fallback

Context compaction agent — cheap-LLM summarize, expensive-LLM consume #44

Description

Problem

What sbp described in #14

Why this matters now

Recommendation: build it as an agent first

MVP — entirely as an agent

Naming convention for cache keys

Strategy registry — also as an agent, not platform

When to promote to platform

Three concrete uses driving this

Phasing

Effort overall

Open questions

Why this is high-value

Proposed next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions