Importer perf: chunk per-comic phases to bound peak memory#634
Merged
ajslater merged 4 commits intov1.11-performancefrom Apr 28, 2026
Merged
Importer perf: chunk per-comic phases to bound peak memory#634ajslater merged 4 commits intov1.11-performancefrom
ajslater merged 4 commits intov1.11-performancefrom
Conversation
The metadata keys consumed AFTER the per-comic phases — FIS,
FTS_UPDATE, FTS_CREATE, FTS_CREATED_M2MS, FTS_UPDATED_M2MS,
FTS_EXISTING_M2MS — are populated during read/query/create/link and
read by fail_imports / full_text_search later. Each was being
re-initialized to {} on entry to its writer function, which is fine
in the current single-pass pipeline but would silently erase prior
state if the writer ran twice over different path subsets.
Switch those resets to ``setdefault`` so a second pass populates the
same dict instead of replacing it. Per-pass-only keys
(EXTRACTED, SKIPPED, CREATE_COMICS, LINK_FKS, LINK_M2MS, QUERY_MODELS,
UPDATE_COMICS, CREATE_FKS, UPDATE_FKS, DELETE_M2MS) keep their direct
``= {}`` reset since they're created and consumed within a single
pass through the per-comic phases.
LINK_COVER_PKS already uses the equivalent ``if KEY not in
self.metadata`` pattern, so no change there.
No behavior change in the current single-pass pipeline; this just
prepares the metadata machinery for a chunked apply() loop.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The three counts created_and_update sets (tags, covers, comic) were written as ``self.counts.X = local_count`` — fine for a single pass, but a chunked apply() that calls create_and_update once per chunk would have each call replace the prior chunk's count. Switch to ``+=`` so chunks accumulate. Counts._init_ leaves all attrs at 0 so the single-pass semantics are unchanged (0 + N == N). Other count sites (link, moved, delete, failed) already use ``+=`` or run once outside the per-chunk loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wraps read → query → create_and_update → link in a chunk loop sized to a fraction of the available memory budget. The pre-import phases (init_apply, move_and_modify_dirs) and post-import phases (fail_imports, delete, full_text_search) still run once. Why a simple chunked loop works (vs the elaborate two-pass spill-file design from the original sub-plan): - All FK bulk_creates already use ``update_conflicts=True``, so a Publisher / Imprint / Series / Volume created by chunk N is visible to chunk N+1 as an idempotent UPSERT — no risk of duplicate-key errors and no need for a separate "reference data first" pass. - Per-chunk metadata keys (CREATE_COMICS, LINK_FKS, LINK_M2MS, QUERY_MODELS, UPDATE_COMICS, CREATE_FKS, UPDATE_FKS, DELETE_M2MS) are written and consumed inside one pass through the per-comic phases, so they reset cleanly each chunk. - Cross-chunk metadata (FIS, FTS_*) was switched to ``setdefault`` in the prep commit so populates from earlier chunks survive. - Counts use ``+=`` so each chunk's comic / tag / cover totals accumulate to a correct grand total at finish. Chunk size: derived from ``codex.librarian.memory.get_mem_limit()`` (handles cgroups + psutil fallback) times ``IMPORTER_CHUNK_MEM_FRACTION`` (default 0.25, configurable in codex.toml). Floor 1000 paths, ceiling 50000 paths. Indicative table at sub-plan 04 shows ~50k on a 4 GiB host, ~16k on a 1 GiB host, floor on anything < ~256 MiB. Path partitioning: combined files_created | files_modified is sorted deterministically and sliced into chunks. Per chunk, the original created vs modified split is restored via set-intersection so the DB-side "exists or new" detection in extract works the same per chunk as it did in the single-pass case. Implements tasks/importer-perf/04-streaming-pipeline.md (the simpler "chunk the per-comic phases" design that fell out of further code inspection — the plan's two-pass + spill-file architecture is no longer required because the existing UPSERT semantics already make parent FK creation cross-chunk-safe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 50k ceiling was throttling big-RAM hosts. With the default fraction (0.25) and a 16 KiB-per-comic estimate: 16 GiB host: budget 4 GiB / 16 KiB = ~262k raw -> capped at 50k 32 GiB host: budget 8 GiB / 16 KiB = ~524k raw -> capped at 50k 64 GiB host: budget 16 GiB / 16 KiB = ~1M raw -> capped at 50k A 600k-comic library on a 32 GiB host would have run as 12 chunks of 50k when reality permits 2 chunks of 300k (or one). Each extra chunk re-fires status init, rebuilds pk_maps, opens new transactions — wasted work. The ceiling's job is to be a safety net against get_mem_limit() returning a pathological value (misconfigured cgroup, broken sysconf), not to throttle normal operation. 500k is well above any realistic library size, so a 32 / 64 GiB host stays in a single chunk for typical runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements
tasks/importer-perf/04-streaming-pipeline.mdfrom PR #627. Final entry in the importer perf series after #628 / #629 / #630 / #631 / #632 / #633.After looking at the actual code post-#628–#633, the plan's elaborate "two-pass spill-file" design turned out to be unnecessary. A simple chunk loop around the per-comic phases (
read → query → create_and_update → link) works because every FKbulk_createalready usesupdate_conflicts=True— a Publisher / Imprint / Series / Volume created by chunk N is visible to chunk N+1 as an idempotent UPSERT, not a duplicate-key error. No reference-data preflight pass needed; no spill file needed.Three commits, each independently reviewable:
55518dc1— setdefault for cross-chunk metadata accumulatorsSwitches
FIS,FTS_UPDATE,FTS_CREATE,FTS_CREATED_M2MS,FTS_UPDATED_M2MS,FTS_EXISTING_M2MSfromself.metadata[KEY] = {}toself.metadata.setdefault(KEY, {})so a second pass over a different chunk doesn't erase the first chunk's contents. Per-pass-only keys keep their direct= {}reset. No behavior change in the single-pass pipeline.6bae6955— counts use += not =The three counts
create_and_updatewrites (tags,covers,comic) were assigned with=, which would overwrite each chunk's running total. Switched to+=.Counts.__init__zeros all attrs so single-pass semantics are unchanged (0 + N == N).9fd3b039— chunked apply()Wraps
read → query → create_and_update → linkin a chunk loop. Pre-phases (init_apply,move_and_modify_dirs) and post-phases (fail_imports,delete,full_text_search) still run once.codex.librarian.memory.get_mem_limit()(cgroups + psutil fallback) timesIMPORTER_CHUNK_MEM_FRACTION(default 0.25, configurable). Floor 1000, ceiling 50000.files_created | files_modifiedis sorted (deterministic chunk boundaries for log diffing + future resume work), sliced into chunks ofchunk_size, then the original created vs modified split is restored per chunk via set intersection.Why this design beats the original plan
The plan proposed a two-pass architecture with a JSONL spill file (~5 PRs A→E). After the surgical N+1 wins from #628–#633, the actual remaining work is small enough that:
LINK_FKS / LINK_M2MS / CREATE_COMICSholding all 600k comics' data — chunking those is the actual fixThis is one PR, ~120 LOC across 6 files, instead of 5 PRs and ~1500 LOC.
Test plan
make fixcleanmake lint-pythonclean (0 errors, 0 warnings)pytest tests/importer/ tests/test_search_fts.py— 7 passedtracemalloc; expect peak RSS < 1.5 GiB🤖 Generated with Claude Code