feat: [ENG-2518] batch abstract generation across queued files#580
feat: [ENG-2518] batch abstract generation across queued files#580danhdoan merged 4 commits intoproj/curation-enhancementfrom
Conversation
Collapse the abstract queue's per-file LLM calls (2N today: one L0 .abstract.md + one L1 .overview.md per file) into batched calls of up to BATCH_SIZE_CAP=5 files per cycle, with two parallel LLM calls per cycle (1 batched L0 + 1 batched L1). - generateFileAbstractsBatch(items, generator) in abstract-generator.ts: XML-tagged input/output, tolerant regex parser, missing paths get empty strings. - AbstractGenerationQueue: buffer-until-cap-or-drain — enqueue only triggers scheduleNext when pending >= cap or drainRequested. drain() sets the flag to flush partial batches at curate-end. - Per-batch failure → re-enqueue each item individually with its own attempts counter; existing 3x exponential backoff preserved. - Per-file content cap (20 KB) matches the non-batched path. Real-LLM verification on ByteRover paid-tier: 5-pack (5 files): 1 batch, 2 LLM calls (vs 10 non-batched, -80%) 6-distinct (6 files): 2 batches (5+1), 4 LLM calls (vs 12, -67%) 10-pack: per-batch failure + retry path exercised end-to-end
|
Claude finished @RyanNg1403's task in 3m 53s —— View job PR Review —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8fa123258
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Dedup duplicate contextPath inputs in generateFileAbstractsBatch (codex P1): when a curate touches the same file twice before drain, both queue items had the same path; the prompt emitted two `<file>` blocks the model could answer in either order, and the tag parser's Map collapsed them non-deterministically. Now collapses by path keeping the LATEST content (the FIFO queue's last-seen item) so abstracts reflect the on-disk state, not an intermediate one. - CDATA-wrap file content in batched prompts (Claude review): docs that describe XML/HTML/JSX (perfectly normal for technical knowledge) can contain `</document>` or `</file>` markers that would terminate the envelope and conflate files. Wrap in CDATA with the standard `]]]]><![CDATA[>` in-section escape for any nested terminators. - Capture this.generator in a local const at the top of processNext so type narrowing survives the await boundary, and document the batched-retry semantics inline (batch failure consumes one attempt per item, not one per call). Tests: - abstract-generator-batch: duplicate-path dedup keeps latest content - abstract-generator-batch: CDATA wrap survives `</document>` markers - abstract-generator-batch: nested `]]>` is escaped to keep CDATA valid
|
Claude finished @RyanNg1403's task in 4m 11s —— View job PR Review —
|
|
Pushed c17a0fa addressing the Claude review-agent feedback: Fixed:
Plus the codex P1 (duplicate Skipped (with reason):
|
- Re-anchor parseBatchedTags on `<file path>` openers instead of `</file>` closers (real correctness): the model output is plain text, not CDATA-wrapped, so an overview that mentions `</file>` literally in prose (perfectly normal for docs about XML, JSX, or build systems) would prematurely terminate the outer match and orphan the inner tag. The new strategy collects all opener positions, then slices each opener's body up to the next opener (or end-of-string) — a literal `</file>` in prose has no special meaning to the parser. - Improve the CDATA test to capture BOTH L0 and L1 prompts and assert the wrap independently on each builder. The previous test only validated the last-call's prompt, so a future refactor that forgot wrapCdata in one builder would slip past. - Add a parser test for literal `</file>` inside model overview prose to pin the new robustness invariant. - Move "buffers below cap", "processes up to BATCH_SIZE_CAP", and "splits oversized backlogs" tests out of the `status file` describe block into a sibling `batching behaviour` block — these verify scheduling, not status I/O, and the misgrouping made the suite harder to navigate.
|
Claude finished @RyanNg1403's task in 2m 29s —— View job PR Review —
|
|
Claude finished @danhdoan's task in 2m 29s —— View job PR Review —
|
* feat: [ENG-2518] batch abstract generation across queued files
Collapse the abstract queue's per-file LLM calls (2N today: one L0
.abstract.md + one L1 .overview.md per file) into batched calls of up
to BATCH_SIZE_CAP=5 files per cycle, with two parallel LLM calls per
cycle (1 batched L0 + 1 batched L1).
- generateFileAbstractsBatch(items, generator) in abstract-generator.ts:
XML-tagged input/output, tolerant regex parser, missing paths get
empty strings.
- AbstractGenerationQueue: buffer-until-cap-or-drain — enqueue only
triggers scheduleNext when pending >= cap or drainRequested. drain()
sets the flag to flush partial batches at curate-end.
- Per-batch failure → re-enqueue each item individually with its own
attempts counter; existing 3x exponential backoff preserved.
- Per-file content cap (20 KB) matches the non-batched path.
Real-LLM verification on ByteRover paid-tier:
5-pack (5 files): 1 batch, 2 LLM calls (vs 10 non-batched, -80%)
6-distinct (6 files): 2 batches (5+1), 4 LLM calls (vs 12, -67%)
10-pack: per-batch failure + retry path exercised end-to-end
* feat: [ENG-2485] defer summary cascade to dream
Move the LLM-bound parent-summary cascade off curate's hot path. Curate
now enqueues changed paths to a stale-summary queue persisted in
.brv/dream-state.json; the next dream cycle drains the queue, merges
with its own snapshot diff (A ∪ B), and runs propagateStaleness once
per dirty directory.
- DreamState schema gains staleSummaryPaths[]; new service methods
enqueueStaleSummaryPaths (RMW with path-level dedup, oldest enqueuedAt
preserved) and drainStaleSummaryPaths (atomic snapshot-and-clear).
- curate-executor swaps the inline propagateStaleness call for an
enqueueStaleSummaryPaths call. Manifest rebuild stays inline (pure
filesystem, no LLM).
- dream-executor's step 5 drains the queue, unions with the diff-set,
and runs propagateStaleness once per unique path. On propagation
failure the catch re-enqueues the drained snapshot so atomic drain
doesn't lose work.
- runStaleSummaryPropagation extracted as a protected seam; preserves
the ENG-2100 parentTaskId threading so summary regenerations still
share one billing session with the parent dream task.
Validation:
- validate.mjs: 15/15 BM25 invariants hold under stale _index.md
- Real-LLM E2E: 8 curates / 2 dreams on byterover paid-tier
- dedup, multi-curate accumulation, drain, re-enqueue on failure
- Full unit suite: 6981 passing
* refactor: [ENG-2485] address review-agent feedback on PR #579
- Bypass dream activity gate when the stale-summary queue has work
(codex P1): low-activity projects with 1-2 curates would otherwise
strand queued paths indefinitely. dream-trigger.ts now treats a
populated queue as enough reason to drain.
- Dedup within-batch input in enqueueStaleSummaryPaths: a caller passing
a non-unique array no longer produces duplicate queue entries.
- Split the curate-executor post-processing try block: snapshot diff,
enqueue, and manifest rebuild now have independent fail-open scopes
so a transient queue write error cannot skip the (no-LLM) manifest
scan that keeps newly-curated leaves immediately discoverable.
- Restore the runOperations JSDoc that the cherry-pick orphaned in
front of runStaleSummaryPropagation.
- Tighten StaleSummaryEntrySchema to z.string().min(1) so empty paths
cannot persist into dream-state.json.
Tests:
- dream-trigger: bypass-when-queue-populated + still-block-when-empty
- dream-state-service: within-batch duplicate dedup
* refactor: [ENG-2518] address review-agent feedback on PR #580
- Dedup duplicate contextPath inputs in generateFileAbstractsBatch
(codex P1): when a curate touches the same file twice before drain,
both queue items had the same path; the prompt emitted two `<file>`
blocks the model could answer in either order, and the tag parser's
Map collapsed them non-deterministically. Now collapses by path
keeping the LATEST content (the FIFO queue's last-seen item) so
abstracts reflect the on-disk state, not an intermediate one.
- CDATA-wrap file content in batched prompts (Claude review): docs that
describe XML/HTML/JSX (perfectly normal for technical knowledge) can
contain `</document>` or `</file>` markers that would terminate the
envelope and conflate files. Wrap in CDATA with the standard
`]]]]><![CDATA[>` in-section escape for any nested terminators.
- Capture this.generator in a local const at the top of processNext
so type narrowing survives the await boundary, and document the
batched-retry semantics inline (batch failure consumes one attempt
per item, not one per call).
Tests:
- abstract-generator-batch: duplicate-path dedup keeps latest content
- abstract-generator-batch: CDATA wrap survives `</document>` markers
- abstract-generator-batch: nested `]]>` is escaped to keep CDATA valid
* refactor: [ENG-2485] address second-pass review feedback on PR #579
- Reword the Step 1 snapshot-failure catch in dream-executor: the old
comment claimed "atomic drain already removed entries", but drain
doesn't run in this code path. Clarify that skipping drain here is
safer than draining-and-failing because the atomic-drain RMW clears
entries synchronously.
- Add a checkEligibility test for the stale-summary queue bypass so
both tryStartDream and checkEligibility pin the invariant. They
delegate to the same checkGates1to3 path; without this, a future
refactor of the shared path could regress only one call site.
* refactor: [ENG-2518] address second-pass review feedback on PR #580
- Re-anchor parseBatchedTags on `<file path>` openers instead of
`</file>` closers (real correctness): the model output is plain text,
not CDATA-wrapped, so an overview that mentions `</file>` literally
in prose (perfectly normal for docs about XML, JSX, or build systems)
would prematurely terminate the outer match and orphan the inner
tag. The new strategy collects all opener positions, then slices
each opener's body up to the next opener (or end-of-string) — a
literal `</file>` in prose has no special meaning to the parser.
- Improve the CDATA test to capture BOTH L0 and L1 prompts and assert
the wrap independently on each builder. The previous test only
validated the last-call's prompt, so a future refactor that forgot
wrapCdata in one builder would slip past.
- Add a parser test for literal `</file>` inside model overview prose
to pin the new robustness invariant.
- Move "buffers below cap", "processes up to BATCH_SIZE_CAP", and
"splits oversized backlogs" tests out of the `status file` describe
block into a sibling `batching behaviour` block — these verify
scheduling, not status I/O, and the misgrouping made the suite
harder to navigate.
* feat: [ENG-2519] enable prefix caching for providers
Ports the caching change from experiment/03-enable-caching to
proj/curation-enhancement. Two coupled behavior changes (must ship
together) that engage Anthropic prompt caching cleanly and make
OpenAI/Google auto-caches see a longer stable prefix:
- Attach providerOptions.anthropic.cacheControl: ephemeral to the
system message and the last tool definition. The system message
now enters as a SystemModelMessage in the messages array (not via
AI SDK's top-level `system` parameter, which strips providerOptions).
- Relocate <dateTime> out of the system prompt into the first user
message. DateTimeContributor was rebuilding the timestamp every
iteration, and Anthropic prefix caching does token-level matching —
one byte of churn ~1,800 tokens in poisoned every cache miss past
it. Moving dateTime to a single iteration-0 user-message prefix
keeps the system prefix byte-stable across iterations.
Verified across a 60-curate fixture-driven A/B and a 30-curate
progression run on proj/curation-enhancement (see
notes/token-usage-reduction/eng-2519-prefix-caching/REPORT.md):
- Anthropic claude-haiku-4-5 12-fixture: -21.5% normalized cost
- Anthropic claude-haiku-4-5 30-curate progression: -29.7%
- OpenAI gpt-4.1: -8.0% (auto-cache hits a longer stable prefix)
- Google gemini-3-flash-preview: ~flat (model-specific noise)
Tests:
- New unit tests for prependCachedSystemMessage edge cases (empty/
undefined systemPrompt, non-mutation invariant) and toAiSdkTools'
last-tool cache_control attachment (1, 3 tools).
- Updated internal-llm-service test for completeTask: assertion now
matches the iter-0 user-message dateTime prefix.
Out of scope (deferred):
- Cache-token field extraction in response.usage (cacheReadTokens,
cacheCreationTokens). Without telemetry to consume them they would
be dead code.
- Telemetry / UsageLogger plumbing — separate ticket.
- Anthropic daemon-stability mitigations (TPM retry-loop hangs,
daemon-readiness races on large fixtures) — operational, separate.
Builds: clean. Typecheck: clean. Lint: clean on touched files.
Tests: 7004 passing, 16 pending, 0 failing.
* refactor: [ENG-2519] address review-agent feedback on dateTime cache hazards
- Drop the unused `datetime` template variable from FileContributor's
renderTemplateVariables. There is no current YAML consumer; leaving it
in place was a landmine — any future `{{datetime}}` reference would
silently re-introduce per-iteration cache poisoning.
- Inject the dateTime block into the rolling-checkpoint continuation
prompt. clearHistory() wipes the iter-0 user message that originally
carried it, and the iter-0 guard upstream prevents re-injection, so
long curate/query runs that hit the checkpoint were losing time
context for every subsequent iteration.
- Extract `buildDateTimePrefix(now?)` as a small shared helper used by
both call sites; cover with unit tests.
- Expand the cache_control "last tool" test comment to clarify that the
contract is insertion-order, not alphabetical.
* feat: [ENG-2530] pre-pipeline recon to skip first agent iteration
Run tools.curation.recon synchronously in CurateExecutor before
spawning the agent session, inject the result both as a sandbox
variable (__recon_result_<taskIdSafe>) and as inline fields in the
agent's first prompt. Agent's first iteration then proceeds directly
to extraction instead of paying a full LLM call to invoke recon.
Recon is pure JS — no LLM, no decision — but was previously surfaced
as an agent-tool, costing a full ~13K-token iteration on every curate
just to invoke a deterministic helper.
Measurements (12-fixture A/B vs proj/curation-enhancement + ENG-2519
caching baseline, full results in notes/token-usage-reduction/
eng-2530-prepipeline-recon/REPORT.md):
OpenAI gpt-4.1 -16.3% calls -20.3% normalized cost
Google gemini-3-flash-preview -1.9% calls -17.8% normalized cost
Anthropic claude-haiku-4-5 -2.0% calls -0.9% normalized cost (flat)
ByteRover gemini-3-flash-preview 15-curate progression: -6.5% normalized
cost / -3.4% \$ cost.
Quality risk is bounded: recon itself is unchanged; only the caller
moves. Per-fixture variance follows the team's exp 04 pattern (some
fixtures regress as the agent compensates for missing recon signal in
later iterations); aggregate nets positive on OpenAI and Google.
Anthropic claude-haiku-4-5 lands flat — different model-class behavior
from the team's claude-sonnet-4-5 measurement; not a regression.
* test: [ENG-2530] cover recon pre-pipeline wiring on curate-executor
Pin the central correctness invariant of the change: recon's result is
injected as `__recon_result_<taskIdSafe>` on the task session, and the
agent prompt instructs the loop to skip the now-deterministic call.
Without this test a refactor could silently remove the synchronous
`recon()` call or the prompt instruction, defeating the whole change.
* refactor: [ENG-2530] address review-agent feedback on PR #593
- curate-executor: add comment near reconHelper(...) noting that the
`metadata` argument is currently a no-op — `recon` recomputes counts
from `effectiveContext` directly. Prevents future readers from
assuming changes to `metadata` would alter `reconResult`. Upstream
signature fix is out of scope for this PR.
- test: add stubs for FileContextTreeSnapshotService.getCurrentState
and DreamStateService.incrementCurationCount so the unit test stays
in-memory (mirrors the ENG-2485 test pattern); keeps to CLAUDE.md's
"unit tests must run completely in memory" rule.
- test: assert meta.messageCount alongside charCount and lineCount
(the prompt surfaces all three).
* merge: sync main (3.10.1) into curation-enhancement
Resolves three conflicts created by the overlap of ENG-2485 (defer summary
cascade to dream, on this branch) and ENG-2522 (detach post-curate Phase 4,
shipped on main, PR #582). The two changes touch the same lines but address
different costs — LLM cost vs user-visible latency — and compose cleanly.
curate-executor.ts
- Adopt main's {finalize, response} architecture so the daemon fires
task:completed before Phase 4 runs.
- Inside finalize, keep cascade-defer: snapshot diff →
enqueueStaleSummaryPaths + buildManifest. Drop main's
propagateSummariesUnderLock — the dream lock existed to coordinate
inline _index.md writes, which cascade-defer eliminates from curate.
- Preserve all exp 03 (cache_control) and exp 04 (recon pre-pipelining)
changes in Phases 1-3.
- Extract propagateAndRebuild private helper alongside main's
incrementDreamCounter.
dream-executor.ts
- Keep this branch's runStaleSummaryPropagation extracted seam (main did
not change dream-executor; nothing lost).
curate-executor.test.ts
- Rewrite runAgentBody/finalize split assertions to expect cascade-defer
(enqueueStaleSummaryPaths + buildManifest) instead of propagateStaleness.
- Drop the dream-lock coordination describe block — the lock is no longer
taken in curate's Phase 4 (folder-pack still uses it).
folder-pack-executor still calls propagateSummariesUnderLock; the helper
file stays.
Verified: typecheck, lint (0 errors on resolved files), 7284 tests pass.
* refactor: address review-agent feedback on PR #601
Four small reviewer-flagged issues with focused fixes:
- abstract-queue.ts: drop non-absolute contextPath at enqueue with a
trace breadcrumb. Background batch writes derive .abstract.md /
.overview.md via raw writeFile and errors are catch-suppressed, so a
relative path would silently land under process.cwd().
- tool-manager.ts: document the insertion-order contract for
CURATE_TOOL_NAMES and QUERY_TOOL_NAMES at the registration site.
toAiSdkTools attaches Anthropic cacheControl: ephemeral to the LAST
tool, so reordering this list silently shifts the cache breakpoint.
- dream-state-service.ts: skip the disk write in update() when the
updater returns the same state reference. drainStaleSummaryPaths and
enqueueStaleSummaryPaths already follow the no-op-by-identity
convention; this makes the contract observable at the disk level so
empty drains don't tmpfile + rename for nothing. New test pins it.
- curate-executor.ts: correct the catch-block comment under the
cascade-defer enqueue. The previous wording implied diffStates would
re-capture the lost paths automatically; in fact paths uniquely
changed by the failing curate are dropped until the same files are
modified again or dream's own snapshot diff touches them.
---------
Co-authored-by: Nguyễn Thuận Phát <nguyenthuanphatvl@gmail.com>
Co-authored-by: Nguyễn Thuận Phát <143022215+RyanNg1403@users.noreply.github.com>
Summary
.abstract.md+ one L1.overview.mdper file). On a 30-file pack, that's 60 independent calls each paying its own per-call fixed-overhead tax (system prompt + framing).generateFileAbstractsBatch(items, generator)inabstract-generator.tsfires two parallelstreamToTextcalls (1 batched L0 + 1 batched L1) with XML-tagged input/output.AbstractGenerationQueue.processNextdrains up toBATCH_SIZE_CAP=5items per cycle.enqueueonly triggersscheduleNextwhenpending >= cap || drainRequested;drain()(called fromcurate-executorviadrainBackgroundWorkat curate-end) sets the flag for partial-batch flush.onAfterWriteenqueue trigger; per-file 20K char content cap (matches non-batched path so per-file view is identical regardless of mode).Type of change
Scope (select all touched areas)
Linked issues
research/token-optimization/token-usage-reduction/research/06-hypotheses.md; A/B validated bynotes/token-usage-reduction/h3-batch-abstracts/RESULTS.mdRoot cause (bug fixes only, otherwise write
N/A)Test plan
test/unit/agent/map/abstract-generator-batch.test.ts(new, 5 tests covering correct response parsing, input order preservation, missing paths → empty strings, malformed-response tolerance, exactly 2 LLM calls per batch)test/unit/agent/map/abstract-queue.test.ts(added "buffers below cap", "processes up to BATCH_SIZE_CAP in single cycle", "splits oversized backlogs", and adapted retry-backoff test for batched semantics)User-visible changes
None directly. Side effect: faster abstract generation on multi-file curates; ByteRover free-tier users no longer hit 50/day quota on 30-file packs.
Evidence
process:start batchSize=5→ 2 LLM calls (vs 10 non-batched, −80%)notes/token-usage-reduction/h3-batch-abstracts/RESULTS.md(3-fixture A/B on OpenAI + ByteRover)