Skip to content

Importer perf: filesystem mtime pre-filter ahead of comicbox workers#632

Merged
ajslater merged 1 commit intov1.11-performancefrom
importer-comicbox-mtime-prefilter
Apr 28, 2026
Merged

Importer perf: filesystem mtime pre-filter ahead of comicbox workers#632
ajslater merged 1 commit intov1.11-performancefrom
importer-comicbox-mtime-prefilter

Conversation

@ajslater
Copy link
Copy Markdown
Owner

Summary

Implements Improvement D from tasks/importer-perf/06-comicbox-side.md (planning PR #627). First slice of sub-plan 06; codex-only, no comicbox change required.

Comicbox's worker already short-circuits on the embedded-metadata mtime check, but only after opening the archive. For CBR that's an unrar spawn + header parse; for CBZ it's a central-directory read. On a re-import where most archives are unchanged on disk, that's ~600k wasted archive opens.

The pre-filter compares each archive's stat().st_mtime against the recorded metadata_mtime in the DB before submitting to iter_process_files. A path whose filesystem mtime hasn't advanced since last import is guaranteed to have an unchanged embedded mtime (the latter implies the former), so dropping it produces only false positives (let through, then no-op'd by the worker check) — never false negatives.

Skipped paths land in self.metadata[SKIPPED] so:

  • The existing "Skipped N comics because metadata appears unchanged" log line covers both pre-filter and worker-level skips.
  • The status counter still reaches total_paths at finish (visible progress to the user).

Gated on all_old_comic_mtimes being non-empty, which honors the existing task.check_metadata_mtime flag. When that flag is off, every path goes through to the worker exactly as before.

Expected speedup

For a re-import where 99% of files are unchanged (the common case for daily polling), this drops the worker submit count from 600k to ~6k. Worker startup amortizes over fewer files, and the bulk of comicbox's work — archive open, namelist scan, format probing — disappears for the unchanged files. Order-of-magnitude: minutes saved per re-import.

For a fresh import (no DB rows), all_old_comic_mtimes is empty so the pre-filter is a no-op.

Correctness

  • False-positive only: any modification that updates the embedded metadata_mtime also bumps the archive file's mtime, so a file that fails the pre-filter (fs_mtime <= old_mtime) cannot have a newer embedded mtime. The reverse is not true: touch-ing the file bumps fs_mtime without touching the embedded metadata. Those slip through to the worker's existing embedded-mtime check.
  • OSError: a vanished file or filesystem hiccup at stat time falls through to the worker, which produces a FailedImport with the actual error.
  • No-record paths: paths without a DB entry (new comics) skip the pre-filter check and go straight to the worker.

Test plan

  • make fix clean
  • make lint-python clean (0 errors, 0 warnings)
  • pytest tests/importer/ tests/test_search_fts.py — 7 passed
  • Field check: re-import a 1k-comic library (no changes); assert len(metadata[SKIPPED]) ≈ all paths and the log shows the pre-filter dropping most of them
  • Wall-clock: cold import then warm re-import; expect warm to be 10x+ faster

🤖 Generated with Claude Code

Comicbox's worker already short-circuits on the embedded-metadata
mtime check, but only after opening the archive. For CBR that's an
unrar spawn + header parse; for CBZ it's a central-directory read.
On a re-import where most archives are unchanged on disk, that's
~600k wasted archive opens.

Pre-filter all_paths in the parent process by comparing the
archive's stat mtime against the recorded metadata_mtime in the DB.
A path whose filesystem mtime hasn't advanced since last import is
guaranteed to have an unchanged embedded mtime (the latter implies
the former), so dropping it never yields a false negative — only
false positives slip through, where the file's mtime advanced via
``touch`` without a content change. Those still flow into the
worker's existing embedded-mtime check and short-circuit there.

Skipped paths land in self.metadata[SKIPPED] so the existing
"skipped N comics because metadata appears unchanged" log line
covers both pre-filter and worker-level skips, and the status
counter still reaches total_paths at finish.

Gated on ``all_old_comic_mtimes`` being non-empty (which is the
``task.check_metadata_mtime`` flag's existing gating). When the
flag is off, every path goes through to the worker exactly as
before.

Implements Improvement D from tasks/importer-perf/06-comicbox-side.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ajslater ajslater merged commit 2fcccef into v1.11-performance Apr 28, 2026
1 check failed
@ajslater ajslater deleted the importer-comicbox-mtime-prefilter branch May 2, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant