Skip to content

documents index <collection> wipes chunks for all other collections in the same index #100

@Neikan-BSN

Description

@Neikan-BSN

Summary

codanna documents index (with --collection X or --all) wipes chunks
belonging to every other configured collection in the same .codanna/index/.
Only one collection per index has searchable chunks at any time. --all does
not actually persist all collections — it iterates per-collection sequentially,
and each iteration deletes the previous one's chunks, so only the
last-iterated collection survives.

Environment

  • codanna 0.9.19 (latest as of 2026-04-25)
  • Linux 6.6.87.2-microsoft-standard-WSL2 (x86_64)

Reproduction

.codanna/settings.toml:

version = 1
index_path = ".codanna/index"
workspace_root = "/home/user/repo"

[documents]
enabled = true

[documents.collections.alpha]
paths = ["/path/to/alpha-docs/"]
patterns = ["**/*.md"]

[documents.collections.beta]
paths = ["/path/to/beta-docs/"]
patterns = ["**/*.md"]

Direct demonstration of the wipe:

$ codanna documents stats alpha
  Chunks: 2236
$ codanna documents stats beta
  Chunks: 0

$ codanna documents index --collection beta --no-progress
Indexing collection: beta
  Files processed: 228
  Chunks created: 6084
  Chunks removed: 2236        ← removes alpha's chunks

$ codanna documents stats alpha
  Chunks: 0                   ← wiped
$ codanna documents stats beta
  Chunks: 6084

--all exhibits the same pattern: each iteration wipes the previous, and only
the last-iterated collection retains chunks.

Expected

  • documents index --collection X should only operate on chunks tagged
    collection_name == X. Chunks from other collections should be untouched.
  • documents index --all should leave every configured collection's chunks
    intact.

Diagnostic data

After documents index --all, .codanna/index/documents/state.json:

  • file_states contains entries for only the last-iterated collection
  • collection_ids correctly registers all configured collection names
  • next_chunk_id is much larger than the surviving chunk count, suggesting
    chunk IDs are allocated per collection but persistence retains only the
    most recent

The tantivy schema has a collection_name field, but the chunk-pruning step
appears to delete by global state rather than scoping to the target collection.

Secondary issue

documents stats <coll> reports the same Files: count for every
collection — it appears to show the total state.json.file_states size, not
the per-collection count. Only Chunks: reflects per-collection truth.

Workaround (verified)

Configure a single collection per index with all source paths combined:

[documents.collections.docs]
paths = [
  "/path/to/alpha-docs/",
  "/path/to/beta-docs/",
]
patterns = ["**/*.md"]

Empirically verified: 278 files across two source paths produce 8320 chunks in
one collection, all searchable, with Chunks removed: 0. Loses the ability to
scope searches by collection name, but gives stable chunking until fixed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions