Skip to content

Importer perf: capture plan in tasks/#627

Merged
ajslater merged 2 commits intov1.11-performancefrom
importer-perf-plan
Apr 28, 2026
Merged

Importer perf: capture plan in tasks/#627
ajslater merged 2 commits intov1.11-performancefrom
importer-perf-plan

Conversation

@ajslater
Copy link
Copy Markdown
Owner

Summary

Meta-plan + 6 sub-plans for taking a 600k-comic bulk import faster without sacrificing correctness. Plan-only — no code changes. Mirrors the #621 / #625 plan-capture pattern.

  • Surface map (62 files, ~5k LOC across 8 phase subdirs)
  • Confirmed perf hot spots with file:line references
  • Sub-plan ordering by impact-per-LOC, surgical wins first, structural refactor last

Sub-plans (in suggested ship order)

  1. 01-link-batching.mdlink/prepare.py:105-135 fires ~6.6M SELECTs for a 600k import. Fix: name→pk dict per model. Drops to ~14 SELECTs total.
  2. 02-create-fk-batching.mdcreate/foreign_keys.py:46-75 fires ~2M SELECTs via per-row field_model.objects.get(). Fix: pre-fetch parent FK pk-maps, pass via <field>_id=<pk> shortcut.
  3. 03-query-prune.mdmoved/comics.py:60-68 per-comic Folder N+1 (1.2M → 2), narrowed prefetch_related/select_related to fields actually referenced, hoisted status_controller.update() out of inner loops.
  4. 05-sqlite-tuning.md — phase-level transaction.atomic (no atomic exists anywhere in scribe currently), importer-scoped PRAGMAs (cache_size=512MB, wal_autocheckpoint=0), connection_created signal handler, post-import PRAGMA optimize + wal_checkpoint(TRUNCATE).
  5. 06-comicbox-side.md — comicbox-repo PRs: to_dict(keys=...) field projection, CIX-first probe order, filesystem-mtime pre-filter to skip archive opens entirely on unchanged files.
  6. 04-streaming-pipeline.md — split into reference-data phase (all-at-once, ~800MB) + chunked comic ingestion (5-10k per chunk, ~30-60MB peak). JSONL spill file. Resume-from-watermark. Memory cap drops from ~10GB peak to <1.5GB. Biggest refactor; ships last.

Test plan

  • Review 00-meta.md for accuracy of surface map and findings
  • Confirm sub-plan ordering matches your priorities
  • Flag any sub-plan that needs more detail or has a wrong premise
  • Approve as planning artifact (no code changes); implementation lands in follow-up PRs per sub-plan

🤖 Generated with Claude Code

ajslater and others added 2 commits April 27, 2026 19:51
Meta-plan + 6 sub-plans for taking a 600k-comic bulk import faster
without sacrificing correctness. Surface map, ranked findings with
line numbers, methodology, correctness invariants, sub-plan
ordering by impact-per-LOC.

Sub-plans:
- 01 link-batching: ~6.6M SELECTs in link/prepare.py -> ~14 via
  per-model name->pk batching
- 02 create-fk-batching: ~2M SELECTs in create/foreign_keys.py
  -> ~10 via pre-fetched parent pk-maps + <field>_id= shortcut
- 03 query-prune: moved/comics.py per-comic Folder N+1 (1.2M
  -> 2 SELECTs), narrowed prefetch/select_related, status update
  hoisting
- 05 sqlite-tuning: phase-level transaction.atomic, importer-
  scoped cache_size + wal_autocheckpoint=0, post-import optimize
  + checkpoint
- 06 comicbox-side: to_dict(keys=...) field projection, CIX-first
  probe order, filesystem-mtime pre-filter
- 04 streaming-pipeline: reference-data phase + chunked comic
  ingestion, JSONL spill, resume-from-watermark; biggest refactor,
  ships LAST

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes from PR #627 review:

04-streaming-pipeline.md:
- Replace fixed CHUNK_SIZE=5000 with dynamic sizing via the
  existing get_mem_limit() helper (codex/librarian/memory.py),
  which already handles cgroups + psutil fallback. Floor 1000,
  ceiling 50k, default 25% of mem budget per chunk.
- Move spill file from CONFIG_PATH to ROOT_CACHE_PATH, alongside
  other codex cache artifacts. Document /tmp rejection reasons.

06-comicbox-side.md:
- Withdraw Improvement C (CIX-first probe ordering). Comicbox
  already does the optimal cheap path: box/sources.py:155 walks
  namelist() and matches against FILENAME_FORMAT_MAP, then each
  matched format is loaded with its specific schema -- no
  try-every-format loop. Adding probe ordering would risk dropping
  data from MetronInfo/CoMet/etc. which codex officially supports.
- Reword the worker-compute-waste paragraph to clarify that the
  waste is per-format-keys, not multiple-formats.
- Drop probe-related entries from suggested commit shape, test
  plan, and correctness invariants.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ajslater
Copy link
Copy Markdown
Owner Author

Revisions in 331860aa from review feedback:

04-streaming-pipeline.md

  • Dynamic chunk size via codex/librarian/memory.py:get_mem_limit() (already handles cgroups2/cgroups1 + psutil.virtual_memory() fallback). Floor 1000, ceiling 50k, default 25% mem budget. Indicative table: ~50k on a 4GB Pi, ~16k on a 1GB Pi, ~4k on a 256MB container, floor on anything smaller.
  • Spill file moves to ROOT_CACHE_PATH / "import_cache.jsonl" (settings/__init__.py:523). Documented why /tmp is wrong (tmpfs OOM risk on Pi) and why ROOT_CACHE_PATH is right (persistent across reboots, same volume as DB, already managed).

06-comicbox-side.md

  • Withdrew Improvement C entirely. After reading comicbox/box/sources.py:155-174 and comicbox/box/load.py:118-141, comicbox already does the optimal thing: walks namelist(), matches filenames against FILENAME_FORMAT_MAP, and only loads/parses formats that are actually present — each SourceData carries the pre-selected fmt so _load_metadata skips the try-every-format fallback. Probe ordering on top of that would risk dropping MetronInfo/CoMet/CBI data that codex officially supports.
  • Reworded the worker-compute-waste framing to clarify the remaining waste is within each format (parsing keys codex doesn't read), not across formats. Improvement A (to_dict(keys=...) field projection) still stands.
  • Cleaned up cross-references to the dropped improvement in commit shape, test plan, and correctness invariants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant