Skip to content

Importer perf: chunk per-comic phases to bound peak memory#634

Merged
ajslater merged 4 commits intov1.11-performancefrom
importer-streaming-pipeline
Apr 28, 2026
Merged

Importer perf: chunk per-comic phases to bound peak memory#634
ajslater merged 4 commits intov1.11-performancefrom
importer-streaming-pipeline

Conversation

@ajslater
Copy link
Copy Markdown
Owner

Summary

Implements tasks/importer-perf/04-streaming-pipeline.md from PR #627. Final entry in the importer perf series after #628 / #629 / #630 / #631 / #632 / #633.

After looking at the actual code post-#628#633, the plan's elaborate "two-pass spill-file" design turned out to be unnecessary. A simple chunk loop around the per-comic phases (read → query → create_and_update → link) works because every FK bulk_create already uses update_conflicts=True — a Publisher / Imprint / Series / Volume created by chunk N is visible to chunk N+1 as an idempotent UPSERT, not a duplicate-key error. No reference-data preflight pass needed; no spill file needed.

Three commits, each independently reviewable:

55518dc1 — setdefault for cross-chunk metadata accumulators

Switches FIS, FTS_UPDATE, FTS_CREATE, FTS_CREATED_M2MS, FTS_UPDATED_M2MS, FTS_EXISTING_M2MS from self.metadata[KEY] = {} to self.metadata.setdefault(KEY, {}) so a second pass over a different chunk doesn't erase the first chunk's contents. Per-pass-only keys keep their direct = {} reset. No behavior change in the single-pass pipeline.

6bae6955 — counts use += not =

The three counts create_and_update writes (tags, covers, comic) were assigned with =, which would overwrite each chunk's running total. Switched to +=. Counts.__init__ zeros all attrs so single-pass semantics are unchanged (0 + N == N).

9fd3b039 — chunked apply()

Wraps read → query → create_and_update → link in a chunk loop. Pre-phases (init_apply, move_and_modify_dirs) and post-phases (fail_imports, delete, full_text_search) still run once.

  • Chunk size derived from codex.librarian.memory.get_mem_limit() (cgroups + psutil fallback) times IMPORTER_CHUNK_MEM_FRACTION (default 0.25, configurable). Floor 1000, ceiling 50000.
    • 4 GiB host: ~50k chunks (capped)
    • 1 GiB host: ~16k chunks
    • 256 MiB container: ~4k chunks
    • 64 MiB cgroup: floor ~1000
  • Path partitioning: files_created | files_modified is sorted (deterministic chunk boundaries for log diffing + future resume work), sliced into chunks of chunk_size, then the original created vs modified split is restored per chunk via set intersection.

Why this design beats the original plan

The plan proposed a two-pass architecture with a JSONL spill file (~5 PRs A→E). After the surgical N+1 wins from #628#633, the actual remaining work is small enough that:

  • Reference-data accumulation across chunks is safe via UPSERT (no preflight pass needed)
  • Memory pressure comes from LINK_FKS / LINK_M2MS / CREATE_COMICS holding all 600k comics' data — chunking those is the actual fix
  • A spill file would just defer the same data through disk; not needed when chunks fit in RAM by construction

This is one PR, ~120 LOC across 6 files, instead of 5 PRs and ~1500 LOC.

Test plan

  • make fix clean
  • make lint-python clean (0 errors, 0 warnings)
  • pytest tests/importer/ tests/test_search_fts.py — 7 passed
  • Round-trip test: import a 1k-comic library; assert resulting Comic / FK / M2M row state is identical to a pre-PR baseline
  • Memory ceiling test: import a 100k-comic library under tracemalloc; expect peak RSS < 1.5 GiB
  • Chunk boundary test: fixture with paths sharing FKs across chunks; verify the FKs are deduplicated correctly via UPSERT
  • Wall clock: 100k-comic dev fixture; chunked pipeline should be within 10% of all-at-once wall clock (memory savings shouldn't cost time)

🤖 Generated with Claude Code

ajslater and others added 4 commits April 27, 2026 22:19
The metadata keys consumed AFTER the per-comic phases — FIS,
FTS_UPDATE, FTS_CREATE, FTS_CREATED_M2MS, FTS_UPDATED_M2MS,
FTS_EXISTING_M2MS — are populated during read/query/create/link and
read by fail_imports / full_text_search later. Each was being
re-initialized to {} on entry to its writer function, which is fine
in the current single-pass pipeline but would silently erase prior
state if the writer ran twice over different path subsets.

Switch those resets to ``setdefault`` so a second pass populates the
same dict instead of replacing it. Per-pass-only keys
(EXTRACTED, SKIPPED, CREATE_COMICS, LINK_FKS, LINK_M2MS, QUERY_MODELS,
UPDATE_COMICS, CREATE_FKS, UPDATE_FKS, DELETE_M2MS) keep their direct
``= {}`` reset since they're created and consumed within a single
pass through the per-comic phases.

LINK_COVER_PKS already uses the equivalent ``if KEY not in
self.metadata`` pattern, so no change there.

No behavior change in the current single-pass pipeline; this just
prepares the metadata machinery for a chunked apply() loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The three counts created_and_update sets (tags, covers, comic) were
written as ``self.counts.X = local_count`` — fine for a single pass,
but a chunked apply() that calls create_and_update once per chunk
would have each call replace the prior chunk's count. Switch to ``+=``
so chunks accumulate. Counts._init_ leaves all attrs at 0 so the
single-pass semantics are unchanged (0 + N == N).

Other count sites (link, moved, delete, failed) already use ``+=``
or run once outside the per-chunk loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wraps read → query → create_and_update → link in a chunk loop sized
to a fraction of the available memory budget. The pre-import phases
(init_apply, move_and_modify_dirs) and post-import phases
(fail_imports, delete, full_text_search) still run once.

Why a simple chunked loop works (vs the elaborate two-pass
spill-file design from the original sub-plan):

- All FK bulk_creates already use ``update_conflicts=True``, so a
  Publisher / Imprint / Series / Volume created by chunk N is visible
  to chunk N+1 as an idempotent UPSERT — no risk of duplicate-key
  errors and no need for a separate "reference data first" pass.
- Per-chunk metadata keys (CREATE_COMICS, LINK_FKS, LINK_M2MS,
  QUERY_MODELS, UPDATE_COMICS, CREATE_FKS, UPDATE_FKS, DELETE_M2MS)
  are written and consumed inside one pass through the per-comic
  phases, so they reset cleanly each chunk.
- Cross-chunk metadata (FIS, FTS_*) was switched to ``setdefault`` in
  the prep commit so populates from earlier chunks survive.
- Counts use ``+=`` so each chunk's comic / tag / cover totals
  accumulate to a correct grand total at finish.

Chunk size: derived from ``codex.librarian.memory.get_mem_limit()``
(handles cgroups + psutil fallback) times ``IMPORTER_CHUNK_MEM_FRACTION``
(default 0.25, configurable in codex.toml). Floor 1000 paths,
ceiling 50000 paths. Indicative table at sub-plan 04 shows ~50k on
a 4 GiB host, ~16k on a 1 GiB host, floor on anything < ~256 MiB.

Path partitioning: combined files_created | files_modified is sorted
deterministically and sliced into chunks. Per chunk, the original
created vs modified split is restored via set-intersection so the
DB-side "exists or new" detection in extract works the same per
chunk as it did in the single-pass case.

Implements tasks/importer-perf/04-streaming-pipeline.md (the simpler
"chunk the per-comic phases" design that fell out of further code
inspection — the plan's two-pass + spill-file architecture is no
longer required because the existing UPSERT semantics already make
parent FK creation cross-chunk-safe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 50k ceiling was throttling big-RAM hosts. With the default
fraction (0.25) and a 16 KiB-per-comic estimate:

  16 GiB host: budget 4 GiB / 16 KiB = ~262k raw -> capped at 50k
  32 GiB host: budget 8 GiB / 16 KiB = ~524k raw -> capped at 50k
  64 GiB host: budget 16 GiB / 16 KiB = ~1M raw  -> capped at 50k

A 600k-comic library on a 32 GiB host would have run as 12 chunks of
50k when reality permits 2 chunks of 300k (or one). Each extra chunk
re-fires status init, rebuilds pk_maps, opens new transactions —
wasted work.

The ceiling's job is to be a safety net against get_mem_limit()
returning a pathological value (misconfigured cgroup, broken
sysconf), not to throttle normal operation. 500k is well above any
realistic library size, so a 32 / 64 GiB host stays in a single
chunk for typical runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ajslater ajslater merged commit d4bec52 into v1.11-performance Apr 28, 2026
1 check failed
@ajslater ajslater deleted the importer-streaming-pipeline branch May 2, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant