Skip to content

fix: harden rag index thread safety#746

Open
grtninja wants to merge 2 commits intoamd:mainfrom
grtninja:codex/issue-455-rag-thread-safety
Open

fix: harden rag index thread safety#746
grtninja wants to merge 2 commits intoamd:mainfrom
grtninja:codex/issue-455-rag-thread-safety

Conversation

@grtninja
Copy link
Copy Markdown

What

  • add RLock protection around RAG index mutation paths in src/gaia/rag/sdk.py
  • rebuild chunk/index state before publishing it so failed rebuilds do not leak partial state
  • query against consistent snapshots so concurrent reads do not mix old indexes with new chunk maps
  • add targeted regressions for concurrent query-vs-state-swap and failed rebuild rollback

Why

Closes #455 by tightening the shared-state contract on the RAG index surface used by concurrent chat requests.

Scope

  • src/gaia/rag/sdk.py
  • tests/test_rag.py

Validation

  • python -m py_compile src/gaia/rag/sdk.py tests/test_rag.py
  • PYTHONPATH=src pytest tests/test_rag.py -k "consistent_snapshot or rebuild_fails or document_indexing or document_querying or status_reporting or cache_functionality" -q

Notes

  • kept the lane focused to the issue surface only
  • no UI or broader agent-loop changes

@github-actions github-actions bot added rag RAG system changes tests Test changes performance Performance-critical changes labels Apr 11, 2026
@grtninja
Copy link
Copy Markdown
Author

@coderabbitai review

@kovtcharov
Copy link
Copy Markdown
Collaborator

@claude review this PR.

@kovtcharov kovtcharov assigned kovtcharov and unassigned kovtcharov Apr 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 11, 2026

Claude finished @kovtcharov's task —— View job


PR Review: RAG Index Thread Safety

  • Read PR diff and changed files
  • Review src/gaia/rag/sdk.py changes
  • Review tests/test_rag.py changes
  • Provide comprehensive review feedback

Summary

This PR adds thread safety to the RAG SDK by introducing RLock protection around index mutation paths and ensuring queries operate on consistent snapshots. The implementation is well-structured and addresses the core issue of concurrent access to shared state.

Review Feedback

🟢 Strengths

  1. Snapshot-based query isolation (sdk.py:1454-1464): The _snapshot_query_state() method correctly captures a consistent read snapshot under the lock, allowing expensive embedding/search work to proceed without holding the lock.

  2. Atomic state publishing (sdk.py:1526-1530): The remove_document() method now builds new state structures first, then publishes them atomically only after _create_vector_index() succeeds. This prevents partial state from being visible.

  3. Good use of RLock: Using threading.RLock() allows nested locking (e.g., reindex_document() calling remove_document() and index_document() within the same lock).

  4. Well-targeted tests: The two new tests (test_query_uses_consistent_snapshot_during_state_swap and test_remove_document_preserves_state_when_rebuild_fails) directly validate the race conditions being fixed.

🟡 Suggestions

  1. Deadlock risk in reindex_document() (sdk.py:1579-1595): This method holds _state_lock while calling index_document(), which also acquires _state_lock. While RLock handles this correctly, the current structure holds the lock across both the remove and index operations. Consider whether this is intentional, as it prevents concurrent reads during reindexing.

  2. Lock scope in _retrieve_chunks_from_file() (sdk.py:2192-2206): The method correctly snapshots state under the lock, but then the rebuilding of a missing per-file index (sdk.py:2216-2223) doesn't cache the result back. The old code cached it:

    # Old code:
    self.file_indices[file_path] = file_index
    self.file_embeddings[file_path] = file_embeddings

    This was intentionally removed to avoid mutating shared state mid-query, but it means repeated queries for the same uncached file will rebuild the index each time. Consider documenting this tradeoff.

  3. Bounds check in _search_chunks() (sdk.py:2320-2326): Good addition of bounds validation:

    if idx < 0 or idx >= len(chunks_snapshot):
        continue

    This handles edge cases where FAISS returns invalid indices.

  4. Double capacity check (sdk.py:1888-1398 and 2028-2052): The capacity check happens twice in the cache-load path (once before checking cache, once after loading cache). This is intentional for TOCTOU safety but could benefit from a brief comment explaining why.

🟢 Minor Notes

  • The clear_cache() method now acquires the lock (sdk.py:2491-2495) - good for consistency.
  • The _retrieve_chunks_with_metadata() signature change to accept an optional search_snapshot parameter (sdk.py:2254-2255) is a clean refactor that enables snapshot reuse.

Verdict

Approve - This is a solid thread-safety improvement. The changes are focused, well-tested, and follow the correct pattern of snapshot-then-process for read operations and build-then-publish for write operations.


@kovtcharov
Copy link
Copy Markdown
Collaborator

Can you address the 4 suggestions by claude and i'll approve.

@grtninja
Copy link
Copy Markdown
Author

yeah I'm working on this now, thanks!

@grtninja
Copy link
Copy Markdown
Author

Addressing Claude's 4 suggestions now on the latest head.

What changed:

  • documented the intentional reindex_document() lock scope so readers do not observe a remove/readd gap during reindexing
  • documented the query-local uncached per-file index rebuild tradeoff in _retrieve_chunks_from_file()
  • documented the FAISS bounds guard in _search_chunks()
  • documented the double capacity check as intentional TOCTOU protection
  • cleaned the two touched-surface style nits in the same files

Validation:

  • PYTHONPATH=src python -m pytest -q tests/test_rag.py -k "consistent_snapshot or preserves_state_when_rebuild_fails"
  • ruff check src/gaia/rag/sdk.py tests/test_rag.py
  • python -m black --check src/gaia/rag/sdk.py tests/test_rag.py
  • git diff --check

kovtcharov-amd pushed a commit that referenced this pull request Apr 13, 2026
- Move v0.17.2 to Shipped section with release summary
- Add v0.17.3 as current release (RAG reliability, security, website)
- Update v0.18.x tables to reflect issue triage (moved 10 issues out of v0.17.3)
- Add v0.19.0 model fine-tuning deliverables (was missing from table)
- Add v0.22.0 App Consolidation section and Mermaid node (was missing)
- Remove duplicate #455 from v0.18.2 (addressed by PR #746 in v0.17.3)
- Add Detailed Plan links to v0.17.3
- Set sequential weekly deadlines: Apr 17 through Jun 16
- Update timestamp to April 13, 2026

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add thread safety to RAG index operations

2 participants