perf(chat): batch and pack SQLite writes for chat persistence#1686
Merged
Conversation
Reduces SQL statements, rows written, and rows scanned on replay across the chat-persistence paths in agents, @cloudflare/ai-chat, and @cloudflare/think. No user-facing behaviour changes beyond an internal, forward-compatible storage-format change for stream chunks. ResumableStream — pack stream chunks (agents) - flushBuffer() now writes ONE row per flush instead of one row per chunk. A multi-chunk segment is stored as a JSON array of chunk bodies; a single-chunk segment is stored unwrapped (legacy object shape) so large chunks avoid array-escaping inflation. - storeChunk() gains a per-segment byte cap (SEGMENT_MAX_BYTES = 512 KB): if adding a chunk would push the buffered segment over the cap, it flushes first, so a large chunk lands alone (unwrapped) and packed segments stay well under the 2 MB SQLite row limit even after JSON re-escaping. The existing >1.8 MB per-chunk skip is unchanged. - All reads transparently unpack both packed segments and legacy per-chunk rows via unpackSegmentBody(): replayChunks, replayCompletedChunksByRequestId, and getStreamChunks (which now returns a running per-chunk index, stable across calls because rows are append-only). - chunk_index is now a per-segment ordering index; restore() resumes it past max(chunk_index). Removed the now-unused multi-row INSERT machinery (buildMultiRowInsertStrings) from sql-batch.ts and the agents/chat barrel. Net effect: ~10x fewer chunk INSERT statements AND ~10x fewer chunk rows written per turn; replay/reconstruction scan ~10x fewer rows. agent-as-tool forwarding fix (ai-chat) - _getAgentToolStoredChunks() previously read the chunk table raw and filtered on chunk_index. With packing, body became a packed array and chunk_index became a segment index, breaking tailing. It now delegates to the unpacking getStreamChunks(), preserving the exact per-chunk sequence semantics that align with the in-memory live counter (_agentToolLiveSequences) so a tailing parent transitions from stored replay to live without gaps or duplicates. Batched deletes - ai-chat: stale-row pruning and maxPersistedMessages enforcement now delete via batched DELETE ... WHERE id IN (...) (capped at 100 bound params). - think: deleteSubmissions() cleanup now uses batched DELETE ... WHERE submission_id IN (...). - ai-chat & think: chat-recovery incident TTL sweep now deletes via batched storage.delete(keys) (<=128 keys/call), re-enabling DO write-coalescing. Shared helpers (agents/chat) - Export MAX_BOUND_PARAMS and buildInClauseStrings from sql-batch.ts. Tests - resumable-streaming: packing into fewer rows, single-flush packing, single unwrapped row, byte-cap splitting large chunks into their own row, and backward-compat reads of legacy per-chunk rows (+ getStreamChunkRowCount and insertLegacyChunkRows test helpers). - agent-tools: assert forwarded chunks are individual (non-array) events with contiguous per-chunk sequences and a correct afterSequence cursor. - think-session / worker fixtures updated to read via the unpacking getStreamChunks. Compatibility - Forward-compatible: new code reads existing legacy rows. Rollback is not backward-compatible for streams in-flight at rollback time (old code cannot interpret packed rows); chunks are ephemeral (24h TTL) and recovery papers over it. Accepted by design. Verification: npm run check (sherif, export checks, oxfmt, oxlint, typecheck across 92 projects) green; tests pass — ai-chat 633, think 549, agents chat 231, plus recovery suites (ai-chat 59, agents 5).
🦋 Changeset detectedLatest commit: 9e8dc8b The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reduces SQLite statements written, rows written, and rows scanned on replay across the chat-persistence paths in
agents,@cloudflare/ai-chat, and@cloudflare/think.The headline change is chunk packing in
ResumableStream: instead of writing one SQLite row per streamed chunk, each buffer flush now writes a single packed row. Combined with batched deletes elsewhere, a normal turn drops from hundreds of single-row writes to a handful.There are no user-facing behaviour changes beyond an internal, forward-compatible storage-format change for stream chunks (details below).
Motivation
Stream chunks were persisted one-
INSERT-per-chunk, one-row-per-chunk. For a medium assistant reply (~250 chunks) that is ~250 statements and ~250 rows per turn, repeated for every turn. Rows-written and rows-scanned are the meaningful SQLite cost/perf metrics, so collapsing them is a direct win for cost, latency, and replay/reconstruction time.What changed
1. Pack stream chunks into one row per flush (
agents—ResumableStream)flushBuffer()now writes one row per flush:storeChunk()gains a per-segment byte cap (SEGMENT_MAX_BYTES = 512 KB): if appending a chunk would push the buffered segment over the cap, it flushes first. So a large chunk lands alone (unwrapped), and packed multi-chunk segments stay well under the 2 MB SQLite row limit even after re-escaping. The existing >1.8 MB per-chunk skip is unchanged.unpackSegmentBody():replayChunks,replayCompletedChunksByRequestIdgetStreamChunks(now returns a running per-chunk index — stable across calls because rows are append-only)chunk_indexis now a per-segment ordering index;restore()resumes it pastmax(chunk_index).INSERThelper (buildMultiRowInsertStrings) — packing supersedes it. KeptMAX_BOUND_PARAMSandbuildInClauseStrings(used by the batched deletes), exported fromagents/chat.2. Fix agent-as-tool forwarding for the packed format (
@cloudflare/ai-chat)_getAgentToolStoredChunks()previously read the chunk table raw and filtered onchunk_index. Under packing,bodybecame a packed array andchunk_indexbecame a segment index — breaking tailing. It now delegates to the unpackinggetStreamChunks(), preserving the exact per-chunk sequence semantics that align with the in-memory live counter (_agentToolLiveSequences), so a tailing parent transitions from stored replay to live forwarding without gaps or duplicates.3. Batched deletes
@cloudflare/ai-chat: stale-row pruning andmaxPersistedMessagesenforcement now delete via batchedDELETE ... WHERE id IN (...)(capped at 100 bound params/query) instead of oneDELETEper row.@cloudflare/think:deleteSubmissions()cleanup now uses batchedDELETE ... WHERE submission_id IN (...).@cloudflare/ai-chat&@cloudflare/think: the chat-recovery incident TTL sweep now deletes via batchedstorage.delete(keys)(≤128 keys/call), which also re-enables Durable Object write-coalescing (previously defeated by per-key awaited deletes).Estimated savings (per turn)
Flush cadence is every 10 chunks (or the byte cap), so rows ≈ ⌈chunks / 10⌉.
≈ 90% fewer chunk INSERT statements and ~90% fewer chunk rows written per turn (≈
0.9 × chunkssaved on each). Replay/orphan-reconstruction scans the same ~10× fewer rows. A clean turn does no chunk reads; read savings apply per reconnect/resume and during agent-as-tool tailing.Compatibility
unpackSegmentBodyhandles both shapes). Verified by tests that seed legacy rows.INSERT(marginally more atomic than the previous multi-row write).Tests
resumable-streaming: packing into fewer rows (45 chunks → 5 rows), single-flush packing, single unwrapped row, byte-cap splitting a large chunk into its own row, and backward-compat reads of legacy per-chunk rows. AddedgetStreamChunkRowCount+insertLegacyChunkRowstest helpers.agent-tools: assert forwarded chunks are individual (non-array) events with contiguous per-chunk sequences and a correctafterSequencecursor (laterChunks === chunks.slice(1)).getStreamChunks.Verification
pnpm run check— sherif, export checks, oxfmt, oxlint, and typecheck (92 projects): green.@cloudflare/ai-chat633,@cloudflare/think549,agentschat 231; recovery-focused: ai-chat 59, agents 5.Changeset
.changeset/batch-stream-chunk-writes.md— patch bumps foragents,@cloudflare/ai-chat,@cloudflare/think.Cost impact (Durable Objects pricing)
This change targets rows written (chunk packing ~90% fewer, plus batched deletes) and rows read (~90% fewer rows scanned on replay/reconstruction). It does not materially change duration or request billing.
Why it matters: LLM streaming emits hundreds of tiny chunks per turn, each previously its own row write at $1.00 / M rows, whereas duration is $12.50 / M GB-s but only a fraction of a GB-s per turn — so for chat agents, row writes dominate the bill (~10× duration).
Worked "medium" turn (~250-chunk reply, ~15 s active streaming, hibernates between turns):
≈ 79% lower per-turn cost for a streaming-dominant turn (the dominant write term drops ~85%; unchanged duration is the floor).
Overall savings are workload-dependent:
AIChatAgent/thinkchat): ~60–80% off DO compute + storage-write cost.Scale "cliff": the included allowance is 50 M rows written/month. At ~1 M turns/month, chunk writes drop from ~280 M rows (≈ $230/mo billable) to ~40 M rows — below the free tier, i.e. potentially ~$0.
Caveats: hibernation behaviour is unchanged (idle-between-turns is already free); the duration win from
260→26INSERTs/turn is negligible (<1%); rows-read savings are real but financially tiny at $0.001/M (the win there is latency, not cost). All figures are order-of-magnitude estimates anchored to published Workers Paid rates; SQLite storage billing has been in effect since Jan 7, 2026.