fix: Complete incremental indexing on Standard S3#1245
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
t. Under pre-fix builds, the indexer Lambda burned its full 15 min execution budget and fell back to a backpressure-triggered full rebuild on every subsequent commit.BinaryIndexStorelookups from the per-property ref-class merge loop — the hot path that was triggering on-demand dictionary/index artifact loads on Standard S3.Details
Root cause
Hidden store-backed class
Sidresolution inside incremental stats merging. The per-property ref-class merge loop was reaching intoBinaryIndexStoreto resolve class IDs, which could demand-load dictionary and index artifacts mid-loop. On S3 Express that cost was masked by ~1–5 ms per-object latency; on Standard S3, with ~50–200 ms per-object latency and many small reads per merge entry, the loop effectively never made forward progress on larger ledgers.The fix moves all required resolution out of the inner merge loop so the per-property attribution merge is fully in-memory.
Before / after on a medium staged workload
Workload: ~8.5 MB JSON-LD, ~10K subjects, 5 commits (~1.5 MB chunk each). Indexer wall time measured from the indexer Lambda's own
processing_time_ms(CloudWatch), not client-side polling. Identical compiled Lambdas on both stacks; only the index bucket type differs.Pre-fix indexing on Standard S3
Post-fix indexing on Standard S3
Standard S3 indexing is now a clean ~1.7–2.5× slower than Express across all chunks, with no degradation over time (where pre-fix it was unusable past the first incremental).
Transactions
No meaningful difference between backends. Transaction wall time on both stacks lands in a ~4–7 s/commit range, dominated by the synchronous commit path through the transactor / SQS FIFO queue — index storage isn't on that critical path.
Queries
With a caught-up indexer, hot/warm query latency on Standard S3 is statistically indistinguishable from Express. Server-side median across 5 iterations after warmup, sampled after each of the 5 chunks:
Standard is occasionally faster, sometimes slower — all within normal runtime noise. The 8 GB Lambda
/tmpdisk artifact cache absorbs per-request S3 latency once warm.Cold/simple query latency on Standard generally ranges from no measurable difference up to ~30% slower. On a tiny synthetic dataset (10 iters after warmup), Standard medians were 11–33% slower than Express — the penalty scales with the number of index segments touched per query.
indexing gap. Pre-fix on Standard, a multi-hop join median climbed from 160 ms → 578 ms (3.6×) across 4 chunks as the indexer fell further behind. Post-fix, with the indexer keeping up, the same query stays flat at 144–174 ms across all 5 chunks.
When S3 Express One Zone still matters
For larger ledgers and sustained indexing throughput, S3 Express One Zone remains a meaningful optimization — observed indexing speedups of
30%+versus Standard S3, and the gap is expected to widen with:For workloads that are mostly hot queries, modest indexing volume, or cost-sensitive — Standard S3 is a viable index backend.
Docs
docs/operations/serverless-storage.md— Standard S3 vs S3 Express One Zone guidance, expected ranges, tuning notes.docs/operations/storage.md,docs/operations/README.md,docs/reference/connection-config-jsonld.md,docs/getting-started/rust-api.md,docs/SUMMARY.md(cross-links + news3MaxConcurrentRequestsfield).