Importer perf: chunk per-comic phases to bound peak memory by ajslater · Pull Request #634 · ajslater/codex

ajslater · 2026-04-28T05:25:02Z

Summary

Implements tasks/importer-perf/04-streaming-pipeline.md from PR #627. Final entry in the importer perf series after #628 / #629 / #630 / #631 / #632 / #633.

After looking at the actual code post-#628–#633, the plan's elaborate "two-pass spill-file" design turned out to be unnecessary. A simple chunk loop around the per-comic phases (read → query → create_and_update → link) works because every FK bulk_create already uses update_conflicts=True — a Publisher / Imprint / Series / Volume created by chunk N is visible to chunk N+1 as an idempotent UPSERT, not a duplicate-key error. No reference-data preflight pass needed; no spill file needed.

Three commits, each independently reviewable:

`55518dc1` — setdefault for cross-chunk metadata accumulators

Switches FIS, FTS_UPDATE, FTS_CREATE, FTS_CREATED_M2MS, FTS_UPDATED_M2MS, FTS_EXISTING_M2MS from self.metadata[KEY] = {} to self.metadata.setdefault(KEY, {}) so a second pass over a different chunk doesn't erase the first chunk's contents. Per-pass-only keys keep their direct = {} reset. No behavior change in the single-pass pipeline.

`6bae6955` — counts use += not =

The three counts create_and_update writes (tags, covers, comic) were assigned with =, which would overwrite each chunk's running total. Switched to +=. Counts.__init__ zeros all attrs so single-pass semantics are unchanged (0 + N == N).

`9fd3b039` — chunked apply()

Wraps read → query → create_and_update → link in a chunk loop. Pre-phases (init_apply, move_and_modify_dirs) and post-phases (fail_imports, delete, full_text_search) still run once.

Chunk size derived from codex.librarian.memory.get_mem_limit() (cgroups + psutil fallback) times IMPORTER_CHUNK_MEM_FRACTION (default 0.25, configurable). Floor 1000, ceiling 50000.
- 4 GiB host: ~50k chunks (capped)
- 1 GiB host: ~16k chunks
- 256 MiB container: ~4k chunks
- 64 MiB cgroup: floor ~1000
Path partitioning: files_created | files_modified is sorted (deterministic chunk boundaries for log diffing + future resume work), sliced into chunks of chunk_size, then the original created vs modified split is restored per chunk via set intersection.

Why this design beats the original plan

The plan proposed a two-pass architecture with a JSONL spill file (~5 PRs A→E). After the surgical N+1 wins from #628–#633, the actual remaining work is small enough that:

Reference-data accumulation across chunks is safe via UPSERT (no preflight pass needed)
Memory pressure comes from LINK_FKS / LINK_M2MS / CREATE_COMICS holding all 600k comics' data — chunking those is the actual fix
A spill file would just defer the same data through disk; not needed when chunks fit in RAM by construction

This is one PR, ~120 LOC across 6 files, instead of 5 PRs and ~1500 LOC.

Test plan

make fix clean
make lint-python clean (0 errors, 0 warnings)
pytest tests/importer/ tests/test_search_fts.py — 7 passed
Round-trip test: import a 1k-comic library; assert resulting Comic / FK / M2M row state is identical to a pre-PR baseline
Memory ceiling test: import a 100k-comic library under tracemalloc; expect peak RSS < 1.5 GiB
Chunk boundary test: fixture with paths sharing FKs across chunks; verify the FKs are deduplicated correctly via UPSERT
Wall clock: 100k-comic dev fixture; chunked pipeline should be within 10% of all-at-once wall clock (memory savings shouldn't cost time)

🤖 Generated with Claude Code

The metadata keys consumed AFTER the per-comic phases — FIS, FTS_UPDATE, FTS_CREATE, FTS_CREATED_M2MS, FTS_UPDATED_M2MS, FTS_EXISTING_M2MS — are populated during read/query/create/link and read by fail_imports / full_text_search later. Each was being re-initialized to {} on entry to its writer function, which is fine in the current single-pass pipeline but would silently erase prior state if the writer ran twice over different path subsets. Switch those resets to ``setdefault`` so a second pass populates the same dict instead of replacing it. Per-pass-only keys (EXTRACTED, SKIPPED, CREATE_COMICS, LINK_FKS, LINK_M2MS, QUERY_MODELS, UPDATE_COMICS, CREATE_FKS, UPDATE_FKS, DELETE_M2MS) keep their direct ``= {}`` reset since they're created and consumed within a single pass through the per-comic phases. LINK_COVER_PKS already uses the equivalent ``if KEY not in self.metadata`` pattern, so no change there. No behavior change in the current single-pass pipeline; this just prepares the metadata machinery for a chunked apply() loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The three counts created_and_update sets (tags, covers, comic) were written as ``self.counts.X = local_count`` — fine for a single pass, but a chunked apply() that calls create_and_update once per chunk would have each call replace the prior chunk's count. Switch to ``+=`` so chunks accumulate. Counts._init_ leaves all attrs at 0 so the single-pass semantics are unchanged (0 + N == N). Other count sites (link, moved, delete, failed) already use ``+=`` or run once outside the per-chunk loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wraps read → query → create_and_update → link in a chunk loop sized to a fraction of the available memory budget. The pre-import phases (init_apply, move_and_modify_dirs) and post-import phases (fail_imports, delete, full_text_search) still run once. Why a simple chunked loop works (vs the elaborate two-pass spill-file design from the original sub-plan): - All FK bulk_creates already use ``update_conflicts=True``, so a Publisher / Imprint / Series / Volume created by chunk N is visible to chunk N+1 as an idempotent UPSERT — no risk of duplicate-key errors and no need for a separate "reference data first" pass. - Per-chunk metadata keys (CREATE_COMICS, LINK_FKS, LINK_M2MS, QUERY_MODELS, UPDATE_COMICS, CREATE_FKS, UPDATE_FKS, DELETE_M2MS) are written and consumed inside one pass through the per-comic phases, so they reset cleanly each chunk. - Cross-chunk metadata (FIS, FTS_*) was switched to ``setdefault`` in the prep commit so populates from earlier chunks survive. - Counts use ``+=`` so each chunk's comic / tag / cover totals accumulate to a correct grand total at finish. Chunk size: derived from ``codex.librarian.memory.get_mem_limit()`` (handles cgroups + psutil fallback) times ``IMPORTER_CHUNK_MEM_FRACTION`` (default 0.25, configurable in codex.toml). Floor 1000 paths, ceiling 50000 paths. Indicative table at sub-plan 04 shows ~50k on a 4 GiB host, ~16k on a 1 GiB host, floor on anything < ~256 MiB. Path partitioning: combined files_created | files_modified is sorted deterministically and sliced into chunks. Per chunk, the original created vs modified split is restored via set-intersection so the DB-side "exists or new" detection in extract works the same per chunk as it did in the single-pass case. Implements tasks/importer-perf/04-streaming-pipeline.md (the simpler "chunk the per-comic phases" design that fell out of further code inspection — the plan's two-pass + spill-file architecture is no longer required because the existing UPSERT semantics already make parent FK creation cross-chunk-safe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 50k ceiling was throttling big-RAM hosts. With the default fraction (0.25) and a 16 KiB-per-comic estimate: 16 GiB host: budget 4 GiB / 16 KiB = ~262k raw -> capped at 50k 32 GiB host: budget 8 GiB / 16 KiB = ~524k raw -> capped at 50k 64 GiB host: budget 16 GiB / 16 KiB = ~1M raw -> capped at 50k A 600k-comic library on a 32 GiB host would have run as 12 chunks of 50k when reality permits 2 chunks of 300k (or one). Each extra chunk re-fires status init, rebuilds pk_maps, opens new transactions — wasted work. The ceiling's job is to be a safety net against get_mem_limit() returning a pathological value (misconfigured cgroup, broken sysconf), not to throttle normal operation. 500k is well above any realistic library size, so a 32 / 64 GiB host stays in a single chunk for typical runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ajslater and others added 4 commits April 27, 2026 22:19

ajslater merged commit d4bec52 into v1.11-performance Apr 28, 2026
1 check failed

ajslater deleted the importer-streaming-pipeline branch May 2, 2026 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importer perf: chunk per-comic phases to bound peak memory#634

Importer perf: chunk per-comic phases to bound peak memory#634
ajslater merged 4 commits intov1.11-performancefrom
importer-streaming-pipeline

ajslater commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajslater commented Apr 28, 2026

Summary

55518dc1 — setdefault for cross-chunk metadata accumulators

6bae6955 — counts use += not =

9fd3b039 — chunked apply()

Why this design beats the original plan

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`55518dc1` — setdefault for cross-chunk metadata accumulators

`6bae6955` — counts use += not =

`9fd3b039` — chunked apply()