Importer perf: capture plan in tasks/#627
Merged
ajslater merged 2 commits intov1.11-performancefrom Apr 28, 2026
Merged
Conversation
Meta-plan + 6 sub-plans for taking a 600k-comic bulk import faster without sacrificing correctness. Surface map, ranked findings with line numbers, methodology, correctness invariants, sub-plan ordering by impact-per-LOC. Sub-plans: - 01 link-batching: ~6.6M SELECTs in link/prepare.py -> ~14 via per-model name->pk batching - 02 create-fk-batching: ~2M SELECTs in create/foreign_keys.py -> ~10 via pre-fetched parent pk-maps + <field>_id= shortcut - 03 query-prune: moved/comics.py per-comic Folder N+1 (1.2M -> 2 SELECTs), narrowed prefetch/select_related, status update hoisting - 05 sqlite-tuning: phase-level transaction.atomic, importer- scoped cache_size + wal_autocheckpoint=0, post-import optimize + checkpoint - 06 comicbox-side: to_dict(keys=...) field projection, CIX-first probe order, filesystem-mtime pre-filter - 04 streaming-pipeline: reference-data phase + chunked comic ingestion, JSONL spill, resume-from-watermark; biggest refactor, ships LAST Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes from PR #627 review: 04-streaming-pipeline.md: - Replace fixed CHUNK_SIZE=5000 with dynamic sizing via the existing get_mem_limit() helper (codex/librarian/memory.py), which already handles cgroups + psutil fallback. Floor 1000, ceiling 50k, default 25% of mem budget per chunk. - Move spill file from CONFIG_PATH to ROOT_CACHE_PATH, alongside other codex cache artifacts. Document /tmp rejection reasons. 06-comicbox-side.md: - Withdraw Improvement C (CIX-first probe ordering). Comicbox already does the optimal cheap path: box/sources.py:155 walks namelist() and matches against FILENAME_FORMAT_MAP, then each matched format is loaded with its specific schema -- no try-every-format loop. Adding probe ordering would risk dropping data from MetronInfo/CoMet/etc. which codex officially supports. - Reword the worker-compute-waste paragraph to clarify that the waste is per-format-keys, not multiple-formats. - Drop probe-related entries from suggested commit shape, test plan, and correctness invariants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
Author
|
Revisions in 04-streaming-pipeline.md
06-comicbox-side.md
|
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Meta-plan + 6 sub-plans for taking a 600k-comic bulk import faster without sacrificing correctness. Plan-only — no code changes. Mirrors the #621 / #625 plan-capture pattern.
Sub-plans (in suggested ship order)
01-link-batching.md—link/prepare.py:105-135fires ~6.6M SELECTs for a 600k import. Fix: name→pk dict per model. Drops to ~14 SELECTs total.02-create-fk-batching.md—create/foreign_keys.py:46-75fires ~2M SELECTs via per-rowfield_model.objects.get(). Fix: pre-fetch parent FK pk-maps, pass via<field>_id=<pk>shortcut.03-query-prune.md—moved/comics.py:60-68per-comic Folder N+1 (1.2M → 2), narrowedprefetch_related/select_relatedto fields actually referenced, hoistedstatus_controller.update()out of inner loops.05-sqlite-tuning.md— phase-leveltransaction.atomic(no atomic exists anywhere in scribe currently), importer-scoped PRAGMAs (cache_size=512MB, wal_autocheckpoint=0),connection_createdsignal handler, post-importPRAGMA optimize+wal_checkpoint(TRUNCATE).06-comicbox-side.md— comicbox-repo PRs:to_dict(keys=...)field projection, CIX-first probe order, filesystem-mtime pre-filter to skip archive opens entirely on unchanged files.04-streaming-pipeline.md— split into reference-data phase (all-at-once, ~800MB) + chunked comic ingestion (5-10k per chunk, ~30-60MB peak). JSONL spill file. Resume-from-watermark. Memory cap drops from ~10GB peak to <1.5GB. Biggest refactor; ships last.Test plan
00-meta.mdfor accuracy of surface map and findings🤖 Generated with Claude Code