Importer perf: filesystem mtime pre-filter ahead of comicbox workers#632
Merged
ajslater merged 1 commit intov1.11-performancefrom Apr 28, 2026
Merged
Conversation
Comicbox's worker already short-circuits on the embedded-metadata mtime check, but only after opening the archive. For CBR that's an unrar spawn + header parse; for CBZ it's a central-directory read. On a re-import where most archives are unchanged on disk, that's ~600k wasted archive opens. Pre-filter all_paths in the parent process by comparing the archive's stat mtime against the recorded metadata_mtime in the DB. A path whose filesystem mtime hasn't advanced since last import is guaranteed to have an unchanged embedded mtime (the latter implies the former), so dropping it never yields a false negative — only false positives slip through, where the file's mtime advanced via ``touch`` without a content change. Those still flow into the worker's existing embedded-mtime check and short-circuit there. Skipped paths land in self.metadata[SKIPPED] so the existing "skipped N comics because metadata appears unchanged" log line covers both pre-filter and worker-level skips, and the status counter still reaches total_paths at finish. Gated on ``all_old_comic_mtimes`` being non-empty (which is the ``task.check_metadata_mtime`` flag's existing gating). When the flag is off, every path goes through to the worker exactly as before. Implements Improvement D from tasks/importer-perf/06-comicbox-side.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Improvement D from
tasks/importer-perf/06-comicbox-side.md(planning PR #627). First slice of sub-plan 06; codex-only, no comicbox change required.Comicbox's worker already short-circuits on the embedded-metadata mtime check, but only after opening the archive. For CBR that's an unrar spawn + header parse; for CBZ it's a central-directory read. On a re-import where most archives are unchanged on disk, that's ~600k wasted archive opens.
The pre-filter compares each archive's
stat().st_mtimeagainst the recordedmetadata_mtimein the DB before submitting toiter_process_files. A path whose filesystem mtime hasn't advanced since last import is guaranteed to have an unchanged embedded mtime (the latter implies the former), so dropping it produces only false positives (let through, then no-op'd by the worker check) — never false negatives.Skipped paths land in
self.metadata[SKIPPED]so:total_pathsat finish (visible progress to the user).Gated on
all_old_comic_mtimesbeing non-empty, which honors the existingtask.check_metadata_mtimeflag. When that flag is off, every path goes through to the worker exactly as before.Expected speedup
For a re-import where 99% of files are unchanged (the common case for daily polling), this drops the worker submit count from 600k to ~6k. Worker startup amortizes over fewer files, and the bulk of comicbox's work — archive open, namelist scan, format probing — disappears for the unchanged files. Order-of-magnitude: minutes saved per re-import.
For a fresh import (no DB rows),
all_old_comic_mtimesis empty so the pre-filter is a no-op.Correctness
metadata_mtimealso bumps the archive file's mtime, so a file that fails the pre-filter (fs_mtime <= old_mtime) cannot have a newer embedded mtime. The reverse is not true:touch-ing the file bumps fs_mtime without touching the embedded metadata. Those slip through to the worker's existing embedded-mtime check.Test plan
make fixcleanmake lint-pythonclean (0 errors, 0 warnings)pytest tests/importer/ tests/test_search_fts.py— 7 passedlen(metadata[SKIPPED])≈ all paths and the log shows the pre-filter dropping most of them🤖 Generated with Claude Code