Skip to content

feat(llm-access): proactive Kiro auto-compaction gate using real token usage#24

Merged
acking-you merged 2 commits into
masterfrom
feat/kiro-proactive-compaction-gate
Jun 2, 2026
Merged

feat(llm-access): proactive Kiro auto-compaction gate using real token usage#24
acking-you merged 2 commits into
masterfrom
feat/kiro-proactive-compaction-gate

Conversation

@acking-you
Copy link
Copy Markdown
Owner

Summary

Anthropic clients (Claude Code) only compact a conversation when the model's context window is nearly full. Against Kiro's 1M-context models that trigger lands at ~80–92% (~800k–920k tokens), and at that point the client's reactive-compaction summary request — which itself carries the whole history — is so close to the real ceiling that it is slow and frequently fails with input + max_tokens > limit, surfacing the terminal Context limit reached instead of a clean compaction.

This adds a proxy-side gate that returns a Prompt is too long: N > M response before dispatching upstream once a request crosses a configurable token trigger. The client then compacts while there's still real headroom, so its summary request fits the window and succeeds quickly. No client changes needed.

Root cause

Prompt is too long: <actual> tokens > <limit> tokens is the error shape that drives the client's reactive compaction. Today it only appears once Kiro itself rejects the request near 1M — too late for the summary to fit. Firing it early, at a soft trigger below the real window, is the whole fix.

What changed

  • Configurable trigger: LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS (default 780000; 0/negative disables the gate — the model's real window still applies).
  • Accuracy via the anchor cache (the key design point): the gate must not threshold on the local count_all_tokens estimate alone — it drifts from Kiro's true contextUsage (bridge scaffolding, tokenizer differences, the 15k/ratio rule in resolve_input_tokens_with_threshold). So the real resolved input-token count is stored per conversation-prefix in the anchor index at record_success time and recovered pre-dispatch next turn; the gate uses max(local_estimate, recovered_real_prev).
  • No compaction deadlock: the anchor index is keyed by conversation prefix — after a compaction the prefix changes, recovery misses, and the gate falls back to the (now small) local count. Compaction-summary requests are additionally exempt (detected via the summary-instruction signature) so the reactive loop can never gate its own summary request.

Files

  • llm-access-kiro/cache_sim/anchor_index: ConversationAnchorEntry.real_input_tokens + get_real_input_tokens (peek, no recency bump).
  • llm-access-kiro/cache_sim/simulator: store on record_success_from_runtime_projection, read via recover_real_input_tokens_from_runtime_projection.
  • llm-access/provider/errors: trigger parser + kiro_proactive_compact_response builder + limit-parameterized message core.
  • llm-access/provider/kiro_dispatch: websearch-path gate (local estimate) + main-path gate (max(local, recovered real)) after anchor recovery, before the upstream call; request_is_compaction_summary detector.
  • llm-access/provider/stream_guards: compute + store resolved real input tokens at both record-success sites.

Verification

  • cargo clippy -p llm-access-kiro -p llm-access --all-targets -- -D warnings → clean
  • cargo test → llm-access-kiro 168, llm-access 287 (incl. 15 new: 4 anchor-index, 6 errors-config, 5 compaction-detector)
  • rustfmt on the 5 changed files only; deps/lance & deps/lancedb untouched

Notes for reviewers

  • LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS is read once via LazyLock; documented on the DEFAULT_KIRO_COMPACT_TRIGGER_TOKENS constant (matching how sibling provider tunables are documented — code-local, not in the runbook).
  • The recovered value is the previous turn's real consumption; max(...) with the current local estimate keeps the newest delta in play. First turn (no stored value) falls back to local — correct, since early turns are small.

🤖 Generated with Claude Code

…n usage

Claude Code (and other Anthropic clients) only compact a conversation
when the model's context window is nearly full. Against Kiro's 1M-context
models that trigger lands at ~80-92% (~800k-920k tokens), at which point
the reactive-compaction summary request — which itself carries the whole
history — is so close to the real ceiling that it is slow and frequently
fails with `input + max_tokens > limit`, surfacing the terminal
"Context limit reached" instead of a clean compaction.

This adds a proxy-side gate that returns a `Prompt is too long: N > M`
response *before* dispatching upstream once a request crosses a
configurable token trigger (default 780k via
`LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS`; 0/negative disables). The
client then compacts while there is still real headroom, so its summary
request fits the window and succeeds quickly. No client changes needed.

Accuracy: the gate must not threshold on the local `count_all_tokens`
estimate alone — it drifts from Kiro's true contextUsage (bridge
scaffolding, tokenizer differences, the 15k/ratio rule in
`resolve_input_tokens_with_threshold`). So the real resolved input-token
count is now stored per conversation-prefix in the anchor index at
`record_success` time and recovered pre-dispatch next turn; the gate uses
`max(local_estimate, recovered_real_prev)`. The anchor index is keyed by
conversation prefix, so after a compaction the prefix changes, recovery
misses, and the gate falls back to the (now small) local count — no
deadlock. Compaction-summary requests are always exempt (detected via the
summary-instruction signature) so the reactive loop can never gate its
own summary request.

Changes:
- llm-access-kiro/cache_sim/anchor_index: ConversationAnchorEntry gains
  `real_input_tokens: Option<i32>`; `insert` takes it; `get_real_input_tokens`
  peeks it without bumping recency (+ 4 tests).
- llm-access-kiro/cache_sim/simulator: `record_success_from_runtime_projection`
  stores the real value; `recover_real_input_tokens_from_runtime_projection`
  reads it. Test-only `record_success` passes None.
- llm-access/provider/errors: `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS`
  parser (default 780k), `kiro_proactive_compact_response` builder,
  limit-parameterized message core (+ 6 tests).
- llm-access/provider/kiro_dispatch: websearch-path gate (local estimate)
  + main-path gate (max(local, recovered real)) after anchor recovery,
  before the upstream call; `request_is_compaction_summary` detector
  (+ 5 tests).
- llm-access/provider/stream_guards: compute + store the resolved real
  input tokens at both (stream + non-stream) record-success sites.

Verified: cargo clippy -D warnings on llm-access-kiro + llm-access;
168 + 287 tests pass; rustfmt on the 5 changed files; deps submodules
untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a proactive auto-compaction gate for Kiro requests to trigger client-side conversation compaction before reaching the model's context window limit. It tracks and retrieves the previous turn's real input-token count using the conversation anchor index, configures a default trigger threshold of 780,000 tokens, and exempts compaction summary requests to prevent deadlocks. The review feedback highlights two key improvement opportunities: optimizing the compaction summary detection to avoid string allocations on large payloads by using an allocation-free case-insensitive search, and replacing duration_since with saturating_duration_since in the token retrieval logic to prevent potential panics from clock drift while also refactoring cache borrows.

Comment on lines +1440 to +1443
fn text_is_compaction_summary_prompt(text: &str) -> bool {
let lower = text.to_ascii_lowercase();
lower.contains("detailed summary of") && lower.contains("conversation")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The text_is_compaction_summary_prompt function is called recursively on every string value in the request payload via json_value_contains_compaction_summary. Calling text.to_ascii_lowercase() on every string block allocates a new String of the same size and copies the data. For large payloads (which can be up to 1M tokens / several megabytes of text), this causes significant CPU overhead and memory churn.

Since the search patterns (detailed summary of and conversation) are purely ASCII, we can perform a case-insensitive substring search without any allocations by using slice::eq_ignore_ascii_case on byte windows.

fn contains_ignore_ascii_case(haystack: &str, needle: &str) -> bool {
    if needle.is_empty() {
        return true;
    }
    haystack
        .as_bytes()
        .windows(needle.len())
        .any(|window| window.eq_ignore_ascii_case(needle.as_bytes()))
}

fn text_is_compaction_summary_prompt(text: &str) -> bool {
    contains_ignore_ascii_case(text, "detailed summary of")
        && contains_ignore_ascii_case(text, "conversation")
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d35b79. Replaced the per-string to_ascii_lowercase() allocation with an allocation-free contains_ignore_ascii_case (byte-window eq_ignore_ascii_case), exactly as suggested. Also tightened the scan to only the active request (last message + system) per the codex P2 below, so the haystack is bounded anyway.

Comment on lines +71 to +84
pub fn get_real_input_tokens(
&mut self,
anchor: &str,
now: Instant,
ttl: Duration,
) -> Option<i32> {
let cache = self.cache.as_mut()?;
let entry = cache.peek(anchor)?;
if now.duration_since(entry.last_touched_at) > ttl {
cache.pop(anchor);
return None;
}
entry.real_input_tokens
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of get_real_input_tokens performs a peek lookup, borrows the entry, and then conditionally calls cache.pop which requires a mutable borrow of the cache. While NLL (Non-Lexical Lifetimes) can sometimes resolve this, it is safer and more idiomatic to extract the required values first to completely release the borrow on the cache before performing any mutation.

Additionally, using now.duration_since(entry.last_touched_at) can panic if now is slightly earlier than entry.last_touched_at due to subtle monotonic clock drifts or virtualization/testing environments. Using saturating_duration_since is a safer alternative to prevent potential panics.

    pub fn get_real_input_tokens(
        &mut self,
        anchor: &str,
        now: Instant,
        ttl: Duration,
    ) -> Option<i32> {
        let cache = self.cache.as_mut()?;
        let (expired, real_tokens) = match cache.peek(anchor) {
            Some(entry) => (
                now.saturating_duration_since(entry.last_touched_at) > ttl,
                entry.real_input_tokens,
            ),
            None => return None,
        };
        if expired {
            cache.pop(anchor);
            None
        } else {
            real_tokens
        }
    }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d35b79. recover_token_counts (renamed from get_real_input_tokens) now extracts (expired, token_counts) from the peek borrow first, releasing it before the conditional pop, and uses saturating_duration_since to avoid the monotonic-clock panic. (The entry now carries both real+local counts for the current-turn-delta fix below.)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab4eb27ddd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

trigger,
"proactively returning prompt-too-long to trigger client compaction"
);
return kiro_proactive_compact_response(effective_input_tokens, trigger);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record proactive 413s before returning

When a normal Kiro request crosses the proactive compaction trigger, this branch returns a gateway-generated 413 before updating usage_meta or calling any usage recorder. That makes these intentional Prompt is too long responses disappear from the usage/error audit trail, unlike the nearby conversion preflight failures that call record_kiro_preflight_failure and upstream 413s that call record_kiro_usage; the websearch gate at line 246 has the same bypass. Capture the error metadata and record the rejection before returning so operators can see when the new gate fires.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d35b79. Both gates (websearch + main path) now capture the error message/body and call record_kiro_preflight_failure before returning the 413, so the proactive rejection appears in the usage/error audit trail like the sibling conversion preflight failures. That recorder is audit-only (zero usage summary + status record, no cooldown/failover), which is the right semantics for an intentional gate.

Comment on lines +411 to +412
let effective_input_tokens = request_input_tokens.max(recovered_real_tokens);
if effective_input_tokens >= trigger && !request_is_compaction_summary(&payload) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include the current turn when using recovered real tokens

For conversations where Kiro's real contextUsage is higher than the local estimate, max(request_input_tokens, recovered_real_tokens) can miss the turn that actually crosses the proactive threshold: e.g. a cached previous turn at 760k real tokens plus a new 30k+ user paste may still have a local total below 780k, so this branch dispatches upstream instead of triggering compaction. The stored value represents the previous turn, so the gate needs to add the current-turn delta (or otherwise compare a current real-token estimate), not just take the max, otherwise large follow-up turns can still hit Kiro's hard limit before the proxy fires.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d35b79. The anchor now caches both the real and local counts per turn (AnchorTokenCounts), and the gate estimates the current real consumption as real_prev + max(0, local_now - local_prev), gated against max(local_now, estimated_real). So your 760k-real-prev + 30k-new-paste case now estimates ~790k and trips the 780k trigger even though local_now alone is below it. Covered by effective_tokens_adds_current_delta_to_recovered_real.

Comment on lines +1464 to +1467
payload
.messages
.iter()
.any(|message| json_value_contains_compaction_summary(&message.content))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit compaction-summary detection to the active request

Because this scans every historical message, any earlier user/assistant text containing phrases like “detailed summary of ... conversation” marks all subsequent turns as compaction summaries and permanently exempts that conversation from the proactive gate. In that scenario a later large normal turn near the trigger will be sent upstream and can still hit Kiro's hard context failure; the exemption should be based on the current summary instruction (or another precise marker), not arbitrary old history.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d35b79. request_is_compaction_summary now scans only the system prompt and the last message, not the whole history, so an earlier turn that quoted "detailed summary of ... conversation" no longer permanently exempts the conversation. Regression test earlier_history_summary_phrase_does_not_exempt_later_turn asserts a later normal turn after such history is NOT exempted.

…onfig

Addresses the PR #24 review and migrates the proactive-compaction trigger
from an env var to the Postgres-backed AdminRuntimeConfig so it is editable
from the frontend admin UI (no redeploy).

CR fixes:
- (gemini, high) compaction-summary detection no longer allocates a
  lowercased copy of each string block; `contains_ignore_ascii_case` does an
  allocation-free `eq_ignore_ascii_case` byte-window search. MB-scale payloads
  no longer churn memory.
- (gemini, med) `recover_token_counts` (was `get_real_input_tokens`) extracts
  values before the conditional `pop` (clean borrow release) and uses
  `saturating_duration_since` to avoid a monotonic-clock panic.
- (codex, P2) proactive 413s are now recorded via
  `record_kiro_preflight_failure` (audit-only, no cooldown) on both gates, so
  the rejection shows in the usage/error trail like sibling preflight failures.
- (codex, P2) the gate no longer uses `max(local, recovered_real)` which
  missed the current turn. The anchor now caches both real and local counts
  per turn; the gate estimates `real_prev + max(0, local_now - local_prev)` so
  a large follow-up turn still trips the trigger.
- (codex, P2) compaction-summary detection scans only the active request (the
  last message + system), not all history, so an earlier turn that merely
  quoted the instruction can no longer permanently exempt the conversation.

Config migration (env -> PG -> frontend):
- `LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS` env var removed; replaced by
  `AdminRuntimeConfig.kiro_compact_trigger_tokens` (default 780000, 0 = off),
  propagated to `ProviderKiroRoute.compact_trigger_tokens` and the Valkey
  RouteSnapshot. Migration 0019 + 0001_init column (CHECK >= 0). Admin surface
  validates with `validate_max` (0 allowed, cap 1,000,000). Frontend api.rs +
  /admin/kiro-gateway gain a "Compaction Trigger Tokens (0 disables)" field.
- The gate reads `route.compact_trigger_tokens` at both gate points.

Verified: clippy -D warnings on core/store/migrations/kiro/llm-access;
frontend wasm check; kiro 168 + core 10 + llm-access 291 tests pass (incl. new
anchor/gate/admin tests); rustfmt on the 24 changed files; deps submodules
untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@acking-you
Copy link
Copy Markdown
Owner Author

CR addressed + trigger moved to PG runtime config (3d35b79)

All 5 review findings fixed (replied inline): allocation-free compaction detection (gemini high), recover_token_counts borrow + saturating_duration_since (gemini med), proactive-413 audit recording on both gates (codex P2), current-turn delta in the gate estimate (codex P2), and last-message+system-only detection (codex P2).

Beyond the CR — config now lives in Postgres, editable from the frontend. The LLM_ACCESS_KIRO_COMPACT_TRIGGER_TOKENS env var is gone; the trigger is now AdminRuntimeConfig.kiro_compact_trigger_tokens (default 780000, 0 disables), persisted in llm_runtime_config (migration 0019 + 0001_init column, CHECK >= 0), propagated to ProviderKiroRoute.compact_trigger_tokens + the Valkey RouteSnapshot, and exposed as a "Compaction Trigger Tokens (0 disables)" field on /admin/kiro-gateway. The gate reads route.compact_trigger_tokens at both gate points, so changes take effect without redeploy.

The current-turn-delta fix (codex P2) is what made the accuracy work concrete: the anchor caches both the real (contextUsage-derived) and local (count_all_tokens) counts per turn, and the gate estimates real_prev + max(0, local_now - local_prev).

Verified: clippy -D warnings on core/store/migrations/kiro/llm-access; frontend wasm check; kiro 168 + core 10 + llm-access 291 tests (incl. new anchor/gate/admin tests); rustfmt on the 24 changed files; deps submodules untouched.

@acking-you acking-you merged commit a14ccd1 into master Jun 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant