Importer perf: capture plan in tasks/ by ajslater · Pull Request #627 · ajslater/codex

ajslater · 2026-04-28T02:52:02Z

Summary

Meta-plan + 6 sub-plans for taking a 600k-comic bulk import faster without sacrificing correctness. Plan-only — no code changes. Mirrors the #621 / #625 plan-capture pattern.

Surface map (62 files, ~5k LOC across 8 phase subdirs)
Confirmed perf hot spots with file:line references
Sub-plan ordering by impact-per-LOC, surgical wins first, structural refactor last

Sub-plans (in suggested ship order)

01-link-batching.md — link/prepare.py:105-135 fires ~6.6M SELECTs for a 600k import. Fix: name→pk dict per model. Drops to ~14 SELECTs total.
02-create-fk-batching.md — create/foreign_keys.py:46-75 fires ~2M SELECTs via per-row field_model.objects.get(). Fix: pre-fetch parent FK pk-maps, pass via <field>_id=<pk> shortcut.
03-query-prune.md — moved/comics.py:60-68 per-comic Folder N+1 (1.2M → 2), narrowed prefetch_related/select_related to fields actually referenced, hoisted status_controller.update() out of inner loops.
05-sqlite-tuning.md — phase-level transaction.atomic (no atomic exists anywhere in scribe currently), importer-scoped PRAGMAs (cache_size=512MB, wal_autocheckpoint=0), connection_created signal handler, post-import PRAGMA optimize + wal_checkpoint(TRUNCATE).
06-comicbox-side.md — comicbox-repo PRs: to_dict(keys=...) field projection, CIX-first probe order, filesystem-mtime pre-filter to skip archive opens entirely on unchanged files.
04-streaming-pipeline.md — split into reference-data phase (all-at-once, ~800MB) + chunked comic ingestion (5-10k per chunk, ~30-60MB peak). JSONL spill file. Resume-from-watermark. Memory cap drops from ~10GB peak to <1.5GB. Biggest refactor; ships last.

Test plan

Review 00-meta.md for accuracy of surface map and findings
Confirm sub-plan ordering matches your priorities
Flag any sub-plan that needs more detail or has a wrong premise
Approve as planning artifact (no code changes); implementation lands in follow-up PRs per sub-plan

🤖 Generated with Claude Code

Meta-plan + 6 sub-plans for taking a 600k-comic bulk import faster without sacrificing correctness. Surface map, ranked findings with line numbers, methodology, correctness invariants, sub-plan ordering by impact-per-LOC. Sub-plans: - 01 link-batching: ~6.6M SELECTs in link/prepare.py -> ~14 via per-model name->pk batching - 02 create-fk-batching: ~2M SELECTs in create/foreign_keys.py -> ~10 via pre-fetched parent pk-maps + <field>_id= shortcut - 03 query-prune: moved/comics.py per-comic Folder N+1 (1.2M -> 2 SELECTs), narrowed prefetch/select_related, status update hoisting - 05 sqlite-tuning: phase-level transaction.atomic, importer- scoped cache_size + wal_autocheckpoint=0, post-import optimize + checkpoint - 06 comicbox-side: to_dict(keys=...) field projection, CIX-first probe order, filesystem-mtime pre-filter - 04 streaming-pipeline: reference-data phase + chunked comic ingestion, JSONL spill, resume-from-watermark; biggest refactor, ships LAST Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two changes from PR #627 review: 04-streaming-pipeline.md: - Replace fixed CHUNK_SIZE=5000 with dynamic sizing via the existing get_mem_limit() helper (codex/librarian/memory.py), which already handles cgroups + psutil fallback. Floor 1000, ceiling 50k, default 25% of mem budget per chunk. - Move spill file from CONFIG_PATH to ROOT_CACHE_PATH, alongside other codex cache artifacts. Document /tmp rejection reasons. 06-comicbox-side.md: - Withdraw Improvement C (CIX-first probe ordering). Comicbox already does the optimal cheap path: box/sources.py:155 walks namelist() and matches against FILENAME_FORMAT_MAP, then each matched format is loaded with its specific schema -- no try-every-format loop. Adding probe ordering would risk dropping data from MetronInfo/CoMet/etc. which codex officially supports. - Reword the worker-compute-waste paragraph to clarify that the waste is per-format-keys, not multiple-formats. - Drop probe-related entries from suggested commit shape, test plan, and correctness invariants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ajslater · 2026-04-28T03:18:32Z

Revisions in 331860aa from review feedback:

04-streaming-pipeline.md

Dynamic chunk size via codex/librarian/memory.py:get_mem_limit() (already handles cgroups2/cgroups1 + psutil.virtual_memory() fallback). Floor 1000, ceiling 50k, default 25% mem budget. Indicative table: ~50k on a 4GB Pi, ~16k on a 1GB Pi, ~4k on a 256MB container, floor on anything smaller.
Spill file moves to ROOT_CACHE_PATH / "import_cache.jsonl" (settings/__init__.py:523). Documented why /tmp is wrong (tmpfs OOM risk on Pi) and why ROOT_CACHE_PATH is right (persistent across reboots, same volume as DB, already managed).

06-comicbox-side.md

Withdrew Improvement C entirely. After reading comicbox/box/sources.py:155-174 and comicbox/box/load.py:118-141, comicbox already does the optimal thing: walks namelist(), matches filenames against FILENAME_FORMAT_MAP, and only loads/parses formats that are actually present — each SourceData carries the pre-selected fmt so _load_metadata skips the try-every-format fallback. Probe ordering on top of that would risk dropping MetronInfo/CoMet/CBI data that codex officially supports.
Reworded the worker-compute-waste framing to clarify the remaining waste is within each format (parsing keys codex doesn't read), not across formats. Improvement A (to_dict(keys=...) field projection) still stands.
Cleaned up cross-references to the dropped improvement in commit shape, test plan, and correctness invariants.

ajslater and others added 2 commits April 27, 2026 19:51

ajslater merged commit bf007fc into v1.11-performance Apr 28, 2026
1 check failed

ajslater deleted the importer-perf-plan branch May 2, 2026 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importer perf: capture plan in tasks/#627

Importer perf: capture plan in tasks/#627
ajslater merged 2 commits intov1.11-performancefrom
importer-perf-plan

ajslater commented Apr 28, 2026

Uh oh!

ajslater commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajslater commented Apr 28, 2026

Summary

Sub-plans (in suggested ship order)

Test plan

Uh oh!

ajslater commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant