Skip to content

Fix catalog import checkpoint state#17

Merged
luca-ctx merged 1 commit into
mainfrom
ctx/catalog-index-checkpoint-fix
Jul 1, 2026
Merged

Fix catalog import checkpoint state#17
luca-ctx merged 1 commit into
mainfrom
ctx/catalog-index-checkpoint-fix

Conversation

@luca-ctx

@luca-ctx luca-ctx commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #16.

This separates catalog completion metadata from incremental import checkpoint metadata for cataloged Codex transcript files.

Problem

When a cataloged transcript changes, catalog_sessions.indexed_status correctly becomes pending, but old indexed_* fields could remain populated. That is misleading because those fields describe the previous file version, not the current cataloged file.

External PR #15 identified the real stale-metadata bug. Its approach of clearing all old indexed_* values on any file change is not sufficient for ctx, because Codex append-only session JSONL refresh was using indexed_file_size_bytes as the tail-import checkpoint. Clearing that value makes append refresh reprocess old events instead of importing only the appended tail.

Data model change

  • Keeps indexed_* fields as current-file completion metadata only.
  • Adds explicit checkpoint columns on catalog_sessions:
    • last_imported_at_ms
    • last_imported_file_size_bytes
    • last_imported_file_modified_at_ms
    • last_imported_file_sha256
    • last_imported_event_count
  • Bumps the history store schema to v14.
  • Backfills existing databases from old indexed_* metadata where possible.

Behavior

  • Unchanged files preserve completion and checkpoint state.
  • Changed files clear stale current-version indexed_* completion metadata.
  • Append-only growth preserves a checkpoint only when the previous row was fully indexed to the prior EOF.
  • Codex tail refresh reads the explicit checkpoint fields, validates the checkpoint prefix hash when available, and then imports only appended bytes.
  • Migrated legacy checkpoints without hashes remain usable for compatibility; the next successful import writes a hash-backed checkpoint.
  • Shrink and same-size changed metadata clear the checkpoint so old offsets are not trusted.

Migration notes

The v14 migration adds nullable checkpoint columns and backfills last_imported_* from existing indexed metadata. New successful imports store last_imported_file_sha256 so future tail refreshes can reject rewritten prefixes before using a checkpoint.

Tests

  • cargo fmt --all --check
  • cargo test -p ctx-history-store
  • cargo test -p ctx --test cli search_refresh_auto_tail_imports_appended_codex_session_event
  • cargo test -p ctx catalog_import_checkpoint_requires_matching_hash
  • cargo test -p ctx-history-capture
  • cargo test --workspace

@luca-ctx luca-ctx merged commit 22c2fde into main Jul 1, 2026
@luca-ctx luca-ctx deleted the ctx/catalog-index-checkpoint-fix branch July 1, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix catalog import checkpoint semantics for changed transcripts

1 participant