feat(celery Wave 3): T3.1 hard-cut legacy Celery indexing layer#1729
Merged
Conversation
… NOT NULL + rename to canonical Wave 3 hard-cut schema migration per architect msg=4a801b2b (Wave 1 Bug 2 ruling that locked the temporary v2 suffix) + msg=498b12f0 (Wave 2 informational item ruling that promoted dispatch columns NOT NULL in Wave 3) + PM acceptance msg=5939e394 item 1. Migration revision d0f4c1b9a8e2 chains off c2e8d5a1f3b9 and: 1. DROP TABLE document_index CASCADE — the legacy Celery-era table that lived alongside the Wave 1 v2 table during the transition. Pre-launch + no callers in Wave 3 (the dependent code is hard- deleted in subsequent commits of this same PR). 2. ALTER COLUMN collection_id, source_path → NOT NULL on document_index_v2. Wave 1 fixtures used NULL for back-compat; Wave 3 orchestrator + reconciler always populate them (per architect msg=498b12f0 Lock). 3. Rename every index *_v2_* → *_*. The partial-unique uniq_document_index_v2_serving is dropped + re-created (PG ALTER INDEX RENAME does not regenerate the WHERE predicate symbol map per Postgres quirk; SQLite would silently keep the old reference). 4. RENAME TABLE document_index_v2 → document_index — back to the §F.1 canonical name (architect msg=4a801b2b lock). The downgrade reverses every step in mirror order so a rollback can replay subsequent migrations cleanly. The recreated legacy ``document_index`` table on downgrade is intentionally schema-less (only the id PK column) because the legacy class was deleted in the Wave 3 PR alongside this migration — operators rolling back past this point must restore the legacy ORM file before re-running upgrades. There is no production scenario for that. This is commit 1/5 of T3.1; subsequent commits land the FastAPI wire-in, knowledge_base/tasks.py Pattern A/B/C migration of the 6 remaining Celery tasks, the 9 production caller migrations, and the legacy file-layer hard-delete + audit allowlist removal + pyproject Celery/kombu dep removal. Design pack §F.1 + §F.5 amends (per architect msg=498b12f0 + msg=3890c9d7 path C ruling) are deferred to a follow-up commit once PR #1725 (which owns docs/modularization/indexing-redesign- design-pack.md) merges — flagged in the channel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ion-deletion cascade)
Wave 3 wire-in helpers per architect msg=268f9022 (Wave 3 spec) +
msg=3890c9d7 (Pattern A path C ruling). Adds the upload-side
dispatcher + the cleanup worker's third path so commits 3-5 can
wire FastAPI + migrate the 6 knowledge_base/tasks.py Celery tasks
without inventing new abstractions.
aperag/indexing/dispatcher.py (301 LOC, NEW):
- DispatchRequest dataclass — collection_id / document_id /
parse_version / source_path / tenant_scope_key / modalities tuple
- IndexingMode enum — ASYNC (queue + worker pool) / INLINE
(synchronous derive + sync per modality, for tier-1 private
deployments per design pack §L)
- dispatch_indexing() async helper — INSERTs N PENDING rows in one
transaction (collection_id + source_path + tenant_scope_key are
populated per the design pack §F.1 amended NOT NULL columns) +
finalizes per mode (queue.push for ASYNC; process_one_task call
for INLINE)
- modalities_for_collection() helper — maps per-modality enable
flags to a canonical-order modality tuple, useful for HTTP
handlers
- Fail-fast on missing dependency: raises ValueError if mode=ASYNC
with no queue, or mode=INLINE with empty workers (catches
config bugs at the HTTP boundary, not mid-INSERT)
aperag/indexing/cleanup.py (extended +131 LOC):
- New "Path C" cleanup_for_deleted_collections() per architect
msg=3890c9d7 Pattern A. Three-step cascade:
1. Find all distinct document_ids in document_index referencing
each deleted collection_id
2. Cascade to path B (cleanup_for_deleted_documents) for those
documents — that path already handles modality fan-out (graph
lineage cleanup vs flat backend delete)
3. Sweep any remaining document_index rows by collection_id
(covers the edge case where a row was orphaned earlier or the
collection had rows queued before any document indexed)
- Idempotent: a partial cascade that crashes mid-way is resumed on
the next call (Pattern B reconciler scan that sweeps tombstoned
collections)
- Counts dict adds collections_cleaned key
- Module docstring rewritten to describe THREE paths (was TWO)
aperag/indexing/__init__.py:
- Re-exports cleanup_for_deleted_collections + 6 dispatcher symbols
(DispatchRequest, IndexingMode, DEFAULT_MODALITIES,
dispatch_indexing, modalities_for_collection, all_modalities)
tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py (8 cases):
- dispatcher_async: INSERTs N rows + pushes payloads to per-modality
queue + leaves DB rows PENDING with correct scoping fields
- dispatcher_async_requires_queue: fail-fast on None queue
- dispatcher_inline: INSERTs + invokes process_one_task → row ends
ACTIVE + is_serving=TRUE in one TX (§F.3)
- dispatcher_inline_requires_workers: fail-fast on empty workers
- modalities_for_collection: canonical order + subset selection
- path_c_cascades_via_path_b: 3 collection rows (2 doc + 1 ghost) →
3 backend deletes + 3 row deletes; other-collection row untouched
- path_c_handles_empty_input: counts dict zeroed
- path_c_idempotent_on_re_run: second call returns rows_deleted=0
Local pytest: tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py
8/8 passed. Lint + format clean across new + extended files.
Note: this commit does not yet wire dispatcher into the FastAPI app
(commit 3) or migrate the 6 knowledge_base/tasks.py Celery tasks per
Pattern A/B/C (commit 4). Bryce can now start T3.2 + T3.3 on top of
this branch — the dispatcher shape is the stable API both lanes
depend on (T3.3 inline mode reuses dispatch_indexing(mode=INLINE)
unchanged; T3.2 search API does not depend on dispatcher).
Branch is rebased on main HEAD f370dc6 (PR #1725 design pack merged,
so subsequent commits can amend §F.1 / §F.5 directly if any new spec
drift surfaces during implementation).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…+ INDEXING_MODE=inline smoke Per docs/modularization/indexing-redesign-design-pack.md §G.5 + §L + architect msg=268f9022 (Wave 3 spec) + msg=3890c9d7 (path-C ruling) + msg=c685f83e (PR #1725 §F.1/§F.5 amendments merged). Two Wave 3 lanes shipped together because they share no production- code surface with chenyexuan T3.1 commits 1-2 (this commit's diff is purely additive: 1 new helper + 1 schema extension + 1 docs file + 2 test files): # T3.2 — SearchResultMetadata §G.5 extension aperag/domains/retrieval/schemas.py: * New typing aliases ``IndexerModality`` (vector/fulltext/graph/ summary/vision) + ``IndexStateValue`` (ACTIVE/FAILED/NOT_ENABLED/ INDEXING). * Three new optional fields on SearchResultMetadata: ``parse_version``, ``index_modality``, ``index_state_per_modality``. ``extra="forbid"`` config preserved — the §G.5 additions widen the allowlist by exactly three entries; a typo / future shadow field still fails Pydantic validation loudly. * ``modality`` (D10.h-locked content shape: text/image) kept as-is. The §G.5 spec uses bare ``modality`` for the indexer modality, but the existing public surface already binds that name to content shape; renaming would break D10.h. We chose ``index_modality`` for the indexer modality to disambiguate at the schema level. (Spec narrative §G.5 may want a follow-up to use the same name; not blocking.) * ``from_raw()`` extracts the three new fields from upstream raw metadata, with shallow validation that drops malformed entries (unknown keys / non-string values) before they leak to clients. Accepts both ``index_modality`` and the legacy ``indexer`` key for backward compat with vector/fulltext/graph indexers that haven't been rewired. aperag/indexing/index_state.py (NEW, 165 lines): * Pure-read helper ``query_index_state_for_documents(engine, collection_id, document_ids)`` returning the ``{document_id: {modality: state}}`` shape SearchResultMetadata expects. Single batched read against ``document_index`` so the search pipeline can hydrate metadata for an entire result page in one DB round-trip rather than N+1. * Translation contract pinned: ``status=ACTIVE AND is_serving=TRUE`` → ``ACTIVE``; ``status=FAILED`` → ``FAILED``; everything else (PENDING / RUNNING / ACTIVE-but-not-serving §F.3 cutover transit) → ``INDEXING``; missing row → ``NOT_ENABLED``. Per §F.4 the cutover transit window reads as INDEXING for client purposes. * Dense result map: every document_id key always carries every modality. Stable shape so clients don't have to reason about "field missing means what?". * Module-local re-declaration of ``IndexStateValue`` so ``aperag.indexing`` does not import from ``aperag.domains.retrieval`` (dependency runs in the other direction). Two literals MUST stay in sync. tests/unit_test/indexing/test_t3_2_index_state.py (NEW, 20 cases): * Schema validation: §G.5 fields accepted / extra="forbid" still rejects unknown / IndexerModality + IndexStateValue Literals reject unknown values. * from_raw extraction: §G.5 fields populated / legacy ``indexer`` key fallback / malformed entries dropped silently / D10.h-locked fields unchanged / empty input returns None. * DB helper: empty-input fast path / dense NOT_ENABLED for un-enqueued docs / ACTIVE+serving → ACTIVE / ACTIVE-but-not- serving → INDEXING (§F.3 cutover transit) / PENDING + RUNNING → INDEXING / FAILED → FAILED / per-collection_id filtering / serving row wins over PENDING sibling under §F.3 cutover model / per-modality independence under partial failures. # T3.3 — private deployment docs + INDEXING_MODE=inline smoke docs/private-deployment.md (NEW, 249 lines): * §L Tier 1 / Tier 2 / Tier 3 deployment guide for operators. * Highlights "deploy and forget" mechanisms — every resource that would rot has a corresponding self-heal (§F.5 Path A/B/C, §I.2 retry, §H.5 quota fallback, §C.7 atomic write). * Tier 1: ``pip install aperag && aperag serve`` with SQLite + LocalFS + ``INDEXING_MODE=inline``; no Redis, no separate worker. * Tier 2: docker-compose with PostgreSQL + Redis + MinIO + 5 modality workers + reconciler + cleanup loop; standard customer install on a single VM. * Tier 3: Tier 2 spread across multiple VMs sharing Redis + DB + S3-compatible store. No code change between tiers. * §J.1 SLI table for operators wiring OTLP collectors. * "When to escalate" section: which signals indicate the steady- state self-heal is not converging. tests/integration/test_inline_mode_smoke.py (NEW, 2 cases): * End-to-end smoke for ``IndexingMode.INLINE`` — parse → dispatch → every requested modality at status=ACTIVE + is_serving=TRUE, driven synchronously through chenyexuan T3.1 dispatcher ``9aef2a7``. No Redis, no queue, no separate worker process. * Vision intentionally excluded from the multi-modality smoke because vision derive consumes a JSON list of image records (not chunks.jsonl) and the dispatcher takes a single source_path; the per-modality source_path resolution is the FastAPI lifespan layer's job (chenyexuan T3.1 commit 3, out of scope for T3.3). * Subset-modality test: ``DispatchRequest.modalities`` lets a Tier 1 deploy turn off expensive modalities (e.g., no GPU → skip vision) and only the requested rows finalise. * Stays in default PR-gate suite (no @pytest.mark.slow) since in-memory backends finish in ~1 s. # §G hard-gate self-audit * #1 contract shape: 5 net-new files + schemas.py +93 lines (allowlist widening only). No existing API surface narrowed; the D10.h-locked content modality field is preserved. * #4 caller migration: search pipeline integration is intentionally deferred to chenyexuan T3.1 commit 3 (FastAPI lifespan + caller migration); the read helper in this commit is the seam that pipeline.py will call once wire-in lands. * #5 cross-stack: write set strictly disjoint from chenyexuan T3.1 commits 1-2 (alembic + dispatcher.py + cleanup.py); chenyexuan commit 3-5 changes orchestrator/reconciler/FastAPI app/legacy deletes — also disjoint from this commit's writes. # Lint + tests * ``uvx ruff check + ruff format --check`` across aperag/ + tests/ clean. * ``pytest tests/unit_test/indexing/ tests/integration/ test_inline_mode_smoke.py tests/load/ tests/unit_test/ test_phase3_reexport_audit.py`` → 136 passed, 0 failed (84 Wave 1+2 + 8 T3.1 dispatcher path-c + 20 new T3.2 + 2 new T3.3 + 2 load + 2 phase3 audit). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… wire-in for indexing runtime
Wave 3 wire-in step per architect msg=268f9022 §K T3.1 spec item 4.
Adds the runtime entry point that launches the per-modality worker
pool + reconciler + cleanup loop on app startup when
``INDEXING_MODE=async`` (default), and the in-process ``WorkQueue`` +
``Engine`` references that future request-handler dispatchers will
import via ``app.state``.
aperag/config.py:
- Add ``Config.indexing_mode: str = Field("async", alias="INDEXING_MODE")``.
Two values per design pack §L:
* "async" → orchestrator + reconciler + cleanup loops launched at
app startup; upload handlers RPUSH to per-modality
queue; workers BLPOP and process. Production / tier-2/3.
* "inline" → upload handlers call ``dispatch_indexing(mode=INLINE)``
which runs derive + sync + cutover synchronously within
the request coroutine; no worker pool, no Redis.
Tier-1 single-process private deployments.
aperag/app.py:
- Extend ``combined_lifespan`` to launch the indexing runtime under
``settings.indexing_mode == "async"``:
* 5 per-modality worker tasks (run_vector / run_fulltext / run_graph
/ run_summary / run_vision)
* 1 reconciler loop task (run_reconcile_loop)
* 1 cleanup loop task (run_cleanup_loop)
All as ``asyncio.create_task()`` background tasks owned by the
FastAPI process — matches the §E.2 "one Python process per modality"
architecture for the in-process deployment topology. Tier-3
horizontal scale-out runs separate worker processes; that wiring
lives in a future ops launcher (out of T3.1 scope).
- Single process-local ``InMemoryWorkQueue`` is the default transport.
Tier-3 production swaps for a Redis-backed ``WorkQueue`` (RPUSH /
BLPOP) by injecting via ``app.state`` at deploy time — Wave 3
follow-up.
- Stash ``app.state.indexing_queue`` + ``app.state.indexing_engine``
for upload-side dispatchers to reach (commit 4 wire-in target).
- Worker registry passed to cleanup loop is empty by default; T3.3
follow-up wires concrete production backends per modality. The
cleanup loop tolerates an empty registry (path A logs warning +
skips backend delete; row still GC'd from DB).
- ``_placeholder_worker_factory`` raises NotImplementedError on
invocation — T3.1 ships the queue-side scaffolding (commits 4-5
wire concrete factories per modality). The orchestrator's
per-task BLPOP loop only invokes the factory when a payload is
popped; until commit 4 wires the upload path nothing pushes, so
the placeholder is never reached at runtime.
- Shutdown drain: on lifespan exit, set ``shutdown`` event +
``await asyncio.gather`` all 7 background tasks with
``return_exceptions=True`` so a SIGTERM does not abort mid-task.
Test impact:
- Existing 136 indexing + load + Phase 3 audit tests still pass
(lifespan code is opt-in via env var; no test imports it).
- Commit 4 (upload-route migration to dispatch_indexing) and commit 5
(hard-delete legacy + concrete backend factories) build on this.
Bryce's vision-modality smoke (deferred at T3.3 commit 5325788
because per-modality source path resolution = lifespan-layer concern)
is now unblocked: ``app.state.indexing_queue`` is the seam through
which a follow-up smoke can wire concrete VisionModality with the
correct synthetic source_path per dispatch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per chenyexuan msg=164efd52 / msg=f70d1288 + architect msg=7fd8f348 post chenyexuan T3.1 commit 3 ``c941526`` (FastAPI lifespan + INDEXING_MODE wire-in). The original T3.3 smoke (commit ``53257881``) excluded vision because vision's ``derive`` consumes a JSON list of image records, not chunks.jsonl, and the dispatcher takes a single ``source_path`` per request — single-call coverage for all 5 modalities was incompatible with that contract. This follow-up adds a vision-only smoke (with a per-modality source_path resolution example) so vision modality regressions are covered at the inline-mode layer. The production upload path (chenyexuan T3.1 commit 4 caller migration) will resolve per- modality source paths upstream of the dispatcher and issue per-modality ``DispatchRequest`` calls — this test demonstrates exactly that pattern. Test addition (1 case): seed an image-records JSON list under ``collections/<cid>/documents/<did>/source/images.json``, dispatch with ``modalities=(Modality.VISION,)`` + ``source_path=<images.json path>``, assert the row reaches ``status=ACTIVE`` AND ``is_serving=TRUE``. 3/3 tests in tests/integration/test_inline_mode_smoke.py now pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… aperag/indexing/keyword_extract.py
Per architect msg=3890c9d7 commit-4 split (chenyexuan = Pattern A/B/C
+ extract_keywords; Bryce = 9 caller schema-aware migration), this
commit lands the extract_keywords subsystem move that decouples the
search-time keyword extraction helpers from the soon-to-be-deleted
``aperag/domains/indexing/fulltext_index.py`` (commit 5 hard-cut
target).
aperag/indexing/keyword_extract.py (NEW, 337 lines):
- ``KeywordExtractor`` (abstract base for backward-compat with
callers that may type-annotate the abstract type)
- ``IKKeywordExtractor`` (Elasticsearch IK analyzer, default
fallback, always available when ES is reachable)
- ``LLMKeywordExtractor`` (optional LLM extractor with structured
JSON parsing + simple-line fallback)
- ``extract_keywords(text, ctx)`` (public entry point with
LLM-then-IK fallback chain, signature unchanged from legacy)
- ``_es_client_config()`` (private helper, inlined to keep the new
module dependency-free of legacy fulltext_index.py)
- Module docstring explains the SEARCH-side helper vs Wave 1
``aperag/indexing/fulltext.py`` (write-side modality worker) split
aperag/indexing/__init__.py:
- Re-exports the 4 new symbols (KeywordExtractor + IKKeywordExtractor
+ LLMKeywordExtractor + extract_keywords)
Caller migration (extract_keywords import sites):
- ``aperag/domains/retrieval/pipeline.py:41`` — swap from legacy
``aperag.domains.indexing.fulltext_index`` to new
``aperag.indexing.keyword_extract``
- ``aperag/service/search_pipeline_service.py:34`` — same swap.
This file's docstring explicitly notes the import alias is kept
writable for ``monkeypatch.setattr("aperag.service.search_pipeline_service.extract_keywords", ...)``
test fixtures, so the new path is preserved as a writable alias.
The legacy ``extract_keywords`` symbol still exists in
``aperag/domains/indexing/fulltext_index.py`` until commit 5 deletes
the file — both sites work simultaneously, so any caller I missed is
not silently broken in this intermediate state.
Other DocumentIndex / FulltextSearchDegradedError / fulltext_indexer
imports in ``aperag/domains/retrieval/pipeline.py`` (line 293) +
elsewhere in pipeline.py are Bryce's commit-4a write set per the
agreed split (msg=9d5d54b5 coordination note). chenyexuan changed
ONLY the extract_keywords import line, leaving Bryce's hunks
untouched.
Local pytest: 137 passed (Wave 1 + T2.1 + T2.2 + T3.1 + T3.2 + T3.3
+ Phase 3 audit), 0 failed. Lint + format clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per architect msg=ab8d473c pre-blessed split + chenyexuan msg=be26ebf3
+ PM authorization msg=df9ea8d2: schema-aware migration of legacy
``aperag.domains.indexing.db.models.DocumentIndex`` callers to the
new ``aperag.indexing.models.DocumentIndex`` (§F.1 canonical schema
post Wave 3 commit 1 alembic ``930cf20``).
# Field translation contract
Wave 1+2+commit-1 merged the following schema deltas; this commit
flips every production caller to the new shape:
| Legacy (gone in Wave 3 commit 5) | New (§F.1) |
|----------------------------------------------|-------------------------------------------------------|
| ``DocumentIndex.index_type`` (enum) | ``DocumentIndex.modality`` (string) |
| ``DocumentIndexType.GRAPH`` (Python enum) | ``Modality.GRAPH.value`` (lowercase string) |
| ``DocumentIndexStatus.ACTIVE`` (Python enum) | ``IndexStatus.ACTIVE.value`` (string) + is_serving=TRUE |
| ``DocumentIndex.gmt_created`` / ``gmt_updated`` | ``created_at`` / ``updated_at`` (mixin-aligned) |
| ``DocumentIndex.index_data`` (JSON blob) | per-modality ``derived/parse_<v>/`` artifact paths |
The "currently-serving" semantic now requires
``status=ACTIVE AND is_serving=TRUE`` per §F.3 cutover model — a row
at ``status=ACTIVE`` but ``is_serving=FALSE`` is in the cutover
transit window and is NOT yet user-visible.
# Files migrated (7 of 9 in commit 4a list)
* ``aperag/db/repositories/document_index.py`` — repository mixin:
``has_recent_graph_index_updates`` query rewritten + return type
switched from ``DocumentIndexType`` enum to ``Modality`` /
string. ``query_documents_with_failed_indexes`` now returns
modality string values (lowercase) per the §F.1 column type.
* ``aperag/domains/agent_runtime/runtime.py`` — inlined
``generate_processing_token`` (3-line stdlib uuid wrapper) since
``aperag.tasks.processing_lease`` is in chenyexuan's commit 5
hard-delete list. Per architect msg=3890c9d7 Item 1 Option B
("提取小 helper 到 agent_runtime 自己 module").
* ``aperag/domains/knowledge_base/db/models.py`` —
``Document.get_overall_index_status()`` rewritten: the legacy
``CREATING`` / ``DELETION_IN_PROGRESS`` intermediate states are
gone in §F.1 (a single ``RUNNING`` covers in-flight work);
``COMPLETE`` now requires ``is_serving=TRUE`` per §F.3.
* ``aperag/domains/knowledge_base/service/document_service.py`` —
schema migration spans ``_get_index_types_for_collection`` (now
returns ``Modality`` values), the document JOIN query (legacy
``index_type`` / ``index_data`` / ``gmt_*`` columns translated
to ``modality`` / None placeholder / ``created_at``/``updated_at``),
rebuild_failed_indexes (modality string compare instead of enum),
rebuild_document_indexes (Modality enum list instead of
DocumentIndexType). The legacy ``index_data`` JSON-blob reads in
``get_document_chunks`` / ``get_document_vision_chunks`` are
replaced with ``derived_artifact_path`` probes that exercise the
§F.1 partial-unique invariant; the actual chunk-list response is
routed through a "return empty list" placeholder until chenyexuan
T3.1 commit 4b plumbs the object-store read path. HTTP response
shape stays stable (``index_data=None`` populated where callers
previously read JSON). Service-layer ``document_index_manager``
calls remain — those are chenyexuan commit 5 hard-delete scope.
* ``aperag/domains/knowledge_base/service/collection_summary_service.py``
— same ``index_data`` deprecation pattern: query touches the §F.1
serving rows for the partial-unique invariant probe, returns
empty document_summaries until the object-store read path lands.
* ``aperag/mcp/tools/get_document_metadata.py`` — ``DocumentIndex``
/ ``DocumentIndexStatus`` import migrated; ``index_data`` JSON
parse replaced with ``derived_artifact_path`` probe, chunk_count
surfaced as 0 (placeholder until object-store read path lands).
* ``aperag/mcp/tools/list_documents.py`` — same migration as
get_document_metadata (page-level ``DocumentIndex`` lookup +
chunk_count placeholder).
# Out of scope (chenyexuan commit 4b / 5 lane)
* ``aperag/domains/retrieval/pipeline.py`` + ``aperag/service/search_pipeline_service.py``
— chenyexuan handles ``extract_keywords`` import + Pattern A/B/C
legacy task migrations there per the split agreement.
* ``aperag/domains/knowledge_base/tasks.py`` — chenyexuan commit 4b
Pattern A/B/C migration (collection_delete / cleanup_expired /
collection_summary / collection_summary_reconciler / collection_init
/ export_collection).
* ``document_index_manager.create_or_update_document_indexes`` /
``delete_document_indexes`` calls inside document_service —
chenyexuan commit 5 hard-deletes the manager module so these
callers will need switching to the new ``dispatch_indexing()`` /
cleanup paths (chenyexuan's lane).
# Lint + tests
* ``uvx ruff check + ruff format --check`` clean across aperag/.
* ``pytest tests/unit_test/indexing/ tests/integration/
test_inline_mode_smoke.py tests/load/ tests/unit_test/
test_phase3_reexport_audit.py`` → 137 passed, 0 failed.
* Tests covering legacy ``aperag.domains.indexing.*`` modules
(which chenyexuan commit 5 deletes) are not in the test set
above; they are chenyexuan's commit 5 sweep scope.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…owledge_base Celery tasks
Per architect msg=3890c9d7 Pattern A/B/C ruling, the 6 Celery tasks
in aperag/domains/knowledge_base/tasks.py are migrated off Celery
without losing their semantics. The decorators + Celery imports
(``from celery import current_app`` + ``from config.celery import
app``) are removed; each function is now plain Python that callers
invoke per its category:
aperag/domains/knowledge_base/tasks.py (-Celery, +Pattern A/B/C):
- Module docstring rewritten — Pattern map for the 6 tasks
- ``reconcile_collection_summaries_task`` (Pattern B, periodic) —
no decorator; commit 5 wires into reconciler 30-s loop
- ``collection_delete_task`` (Pattern A, durability-required) —
caller invokes synchronously from HTTP handler; on failure raises
HTTP 500 + the periodic Path C cleanup loop sweeps tombstoned rows
- ``collection_init_task`` (Pattern C, idempotent) — no decorator;
caller wraps in asyncio.create_task; failures log + reconciler
picks up
- ``collection_summary_task`` (Pattern C, regenerable) — no
decorator; ``self.retry(...)`` removed (Celery-specific); failures
flow through ``collection_summary_callbacks.on_summary_failed``
+ reconciler picks up next cycle
- ``cleanup_expired_documents_task`` (Pattern B, periodic) — no
decorator; commit 5 merges into cleanup.py 5-min loop
- ``export_collection_task`` (Pattern C) — ``self`` parameter
removed; ``soft_time_limit`` / ``time_limit`` decorator args
removed (now enforced via §H.6 ``bulkhead_timeout`` async ctx
manager wrapped at the dispatch site)
- Removed unused ``Any`` typing import + unused ``TaskConfig``
reference (was only used by removed ``self.retry()`` calls)
- Function bodies still call legacy ``aperag/tasks/collection.py:
collection_task.<method>()`` and ``aperag/tasks/reconciler.py:*``
helpers — commit 5 moves / inlines those helpers when it deletes
the legacy ``aperag/tasks/`` layer entirely.
aperag/domains/knowledge_base/service/collection_service.py:
- ``collection_init_task.delay(...)`` (line 215) → Pattern C:
``asyncio.create_task(asyncio.to_thread(collection_init_task,
instance.id, document_user_quota))`` so the HTTP response
returns immediately. Failures log + the reconciler picks up.
- ``collection_delete_task.delay(...)`` (line 438) → Pattern A:
``await asyncio.to_thread(collection_delete_task, collection_id)``
synchronous in the HTTP handler — durability-required per
architect ruling msg=3890c9d7 (NOT fire-and-forget — losing this
work = orphan rows + DB corruption).
- Added ``import asyncio`` to module imports.
aperag/domains/knowledge_base/service/export_service.py:
- ``export_collection_task.delay(...)`` (line 104) → Pattern C:
``asyncio.create_task(asyncio.to_thread(export_collection_task,
task.id))`` so the HTTP response returns immediately. The body
is sync I/O (object-store + ZIP); the ExportTask DB row tracks
progress; users retry from the UI on failure.
Pattern B integration (cleanup_expired_documents_task +
reconcile_collection_summaries_task into the existing 5-min /
30-s loops in aperag/indexing/{cleanup,reconciler}.py) is deferred
to commit 5 — the functions still exist as plain Python, just no
longer invoked via Celery beat schedule (config/celery.py beat
schedule entries to be removed in commit 5 alongside the periodic
loop integration).
Local pytest: 137 passed (Wave 1 + T2.1 + T2.2 + T3.1 + T3.2 + T3.3
+ Phase 3 audit), 0 failed. Lint + format clean across all changed
files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…remove flower dep Wave 3 hard-cut Part 1 per architect msg=64fd506a fallback split (Part 2 atomic = next session). Two safe pieces that decouple the last knowledge_base-domain dependency on legacy ``aperag/tasks/processing_lease.py`` + drop a Celery-monitor dep that has no remaining production caller. aperag/domains/knowledge_base/tasks.py: - Removed ``from aperag.tasks.processing_lease import ...`` line (last surviving caller; Bryce commit 4a `39aad24` already inlined the agent_runtime caller) - Inlined the 4 public symbols from ``aperag/tasks/processing_lease.py`` (84 LOC verbatim): * ``DEFAULT_PROCESSING_LEASE_TTL_SECONDS`` * ``DEFAULT_PROCESSING_LEASE_RENEW_INTERVAL_SECONDS`` * ``generate_processing_token()`` * ``build_lease_expires_at()`` * ``ProcessingLeaseRenewer`` class (background lease-renewal thread) - Added ``import threading``, ``import uuid``, ``from typing import Optional`` to support the inlined symbols - Module section header explains Part 1 / Part 2 split — the legacy ``aperag/tasks/processing_lease.py`` file itself stays in Part 1 (Part 2 atomic deletes it together with the rest of ``aperag/tasks/`` after CollectionSummaryCallbacks + CollectionTask methods are inlined to their service-layer homes) pyproject.toml: - Removed ``flower<3.0.0,>=2.0.0`` dep (Celery monitoring dashboard, no production code import; verified ``grep -rn "import flower\| from flower" aperag/ tests/ config/`` returns 0) - Other Celery deps (``celery``, ``django-celery-beat``, ``kombu``) stay until Part 2 atomic — they are still imported by 4 files in Part 2's delete list (``aperag/tasks/scheduler.py``, two files in ``aperag/domains/indexing/``, and ``config/celery.py``) Notes scoped OUT of Part 1 (per architect msg=64fd506a): - ``aperag/concurrent_control/redis_lock.py`` deletion deferred: architect spec said "no production caller" but recon found internal callers in ``concurrent_control/__init__.py`` + ``concurrent_control/manager.py`` (the package itself uses it even though zero EXTERNAL imports of the package exist). Cleaner fix is to delete the whole ``aperag/concurrent_control/`` package in Part 2 atomic alongside the other dead-code sweeps. Local pytest: 137 passed (Wave 1 + T2.1 + T2.2 + T3.1 commits 1-4b step 2 + T3.2 + T3.3 + Bryce caller migration + Phase 3 audit), 0 failed. Lint + format clean. This is a partial commit 5; Part 2 (inline CollectionTask / CollectionSummaryCallbacks / Pattern B reconcilers + tablename rename + audit allowlist removal + legacy file-layer deletion + remaining Celery dep removal + legacy test deletion + final grep validation) is the next-session atomic push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…allbacks Per architect msg=70a20f0e + msg=54063106 fallback ratify (Bryce takes Part 2) + PM msg=ef2e97b9 minimal-chunk-1 GO. Move legacy ``aperag/tasks/reconciler.py:CollectionSummaryCallbacks`` (~234 LOC) to its true owner: ``aperag/domains/knowledge_base/ service/collection_summary_service.py``. The class is the terminal callback hook the summary generation task invokes on success / failure to flip the ``CollectionSummary`` row's lifecycle (GENERATING → COMPLETE / FAILED) and propagate the generated text to ``Collection.description``. It belongs to the summary service layer, not the legacy task / reconciler layer that Wave 3 commit 5 deletes. * ``CollectionSummaryCallbacks`` class — three static methods (``_describe_summary_callback_mismatch``, ``on_summary_generated``, ``on_summary_failed``) inlined verbatim. No semantic changes; the query/update logic, token/version mismatch tolerance, and Collection.description propagation are preserved exactly. * Module-level ``collection_summary_callbacks`` singleton mirrors the legacy ``aperag.tasks.reconciler.collection_summary_callbacks`` attribute so callers can swap import path without changing the call shape. * ``aperag/domains/knowledge_base/tasks.py:373`` import switched to the new location. Removes the last `aperag.tasks.reconciler` callback import; the periodic-reconciler imports (``collection_summary_reconciler`` + ``collection_gc_reconciler``) remain pending for Part 2 chunks 1b / 2 / 3. This is the safe, surgical first chunk per architect msg=f3de18a0 chunked-OK ruling: intermediate-red CI is fine; the final HEAD must be green + grep 0 + alembic reversible before task #14 → ``in_review``. The next session will continue Part 2 chunks 1b (remaining inline migrations: CollectionTask methods, periodic reconcilers) → chunk 2 (deletions + tablename rename) → chunk 3 (verify + wire). Tests: 137 indexing/load/audit tests still green; lint clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ttern B loop integration Wave 3 hard-cut continuation per architect msg=3890c9d7 Pattern A/B/C ruling and PM msg=206eec7b chunk 1b spec (~300 LOC scope). aperag/domains/knowledge_base/tasks.py: - collection_delete_task: Pattern A — replace legacy collection_task.delete_collection() with sync UPDATE Collection.status =DELETED + gmt_deleted=NOW(); periodic Path-C cleanup_for_deleted_collections sweep cascades the deletion (5-min worst-case latency acceptable for low-frequency op) - collection_init_task: Pattern C — replace legacy collection_task.initialize_collection() with sync UPDATE Collection.status=ACTIVE; per-modality index provisioning is implicit lazy in the new modality-worker model (per architect hint msg=54063106) - cleanup_expired_documents_task: Pattern B — replace legacy CollectionTask.cleanup_expired_documents with inlined SQL tombstone scan (Document.status==UPLOADED AND gmt_created < now-1d) + best-effort object-store delete + soft-delete to EXPIRED - reconcile_collection_summaries_task: Pattern B — convert to thin sync shim around the new aperag.indexing.reconciler hook - Drop unused legacy import: from aperag.tasks.collection import collection_task (no remaining call sites in this file) - Update module docstring to point at new Pattern B hook locations aperag/indexing/cleanup.py: - Add cleanup_expired_documents_hook() async helper (lazy import + asyncio.to_thread wrapper) wired into the existing 5-min run_cleanup_loop. Hook failures are logged + cycle continues. - Update module docstring to describe Pattern B integration alongside the original orphan-parse-version GC aperag/indexing/reconciler.py: - Add reconcile_collection_summaries_hook() async helper that inlines the legacy CollectionSummaryReconciler.reconcile_all() logic: reclaim stale GENERATING leases → PENDING; select PENDING summaries with version != observed_version; atomically claim each; fire collection_summary_task as Pattern C asyncio.create_task fire-and- forget background task (never blocks the loop on summary generation duration). Wired into existing 30-s run_reconcile_loop with best-effort try/except so hook failure cannot crash the loop. Tests: 132 passed (tests/unit_test/indexing/ + tests/load/); ruff check + format clean on all 3 modified files. Pre-existing test_phase3_reexport_audit.py circular-import error is unchanged (independent of this chunk; will resolve in chunk 2 when legacy aperag/domains/indexing/db/models.py is deleted). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…+ indexing layers + tablename rename
Wave 3 hard-cut continuation per architect msg=3890c9d7 + PM @不穷
msg=313caed3 chunk 2 spec (delete-focused, intermediate red CI OK).
DELETIONS (~3.5k LOC removed):
- aperag/tasks/* — entire dir (collection / document / models /
processing_lease / reconciler / scheduler / utils): legacy Celery
state machine + reconciler + scheduler infrastructure
- aperag/concurrent_control/* — entire dir (manager / protocols /
redis_lock / threading_lock / utils + 2 READMEs): no remaining
production caller after Wave 1+2 modality workers replaced lock
semantics with per-row §F.1 partial-unique invariant
- aperag/domains/indexing/{tasks,orchestration,manager,vector_index,
fulltext_index,graph_index,summary_index,vision_index}.py +
aperag/domains/indexing/db/models.py — legacy ABC + 5 modality
workers + Celery orchestration + legacy DocumentIndex schema
- config/celery.py — Celery app + beat schedule
- tests/unit_test/concurrent_control/* + tests/unit_test/tasks/* —
contract tests for now-deleted modules
TABLENAME RENAME (matches existing alembic d0f4c1b9a8e2 post-state):
- aperag/indexing/models.py: __tablename__ + 5 index names from
*_v2 → canonical (no new alembic revision needed; the migration
already does the rename at upgrade)
AUDIT ALLOWLIST + 15-symbol map updates:
- tests/unit_test/test_phase3_reexport_audit.py: drop
WAVE_1_2_TEMPORARY_DUP_ALLOWLIST DocumentIndex entry; remap
PHASE3_SYMBOL_TO_MODULE['DocumentIndex'] from
aperag.domains.indexing.db.models → aperag.indexing.models;
remove DocumentIndexStatus/DocumentIndexType (legacy enums gone,
replaced by IndexStatus + Modality which are not Phase-3-canonical)
- Add explicit aperag.indexing.models import after the per-domain
bootstrap loop so Base.metadata['document_index'] is populated
PYPROJECT — drop Celery deps:
- celery<6.0.0,>=5.3.1
- django-celery-beat<3.0.0,>=2.5.0
(kombu was a transitive only; no explicit entry to remove)
CONSUMER PATCHES (minimum to keep imports working — chunk 3 wires
real new-API replacements):
- aperag/domains/knowledge_base/service/document_service.py: stub
document_index_manager + no-op _trigger_index_reconciliation
- aperag/domains/knowledge_base/service/collection_summary_service.py:
drop unused SummaryIndexer init
- aperag/domains/retrieval/pipeline.py: stub _fulltext_search to
return empty (Bryce T3.2 lane wires real
aperag.indexing.fulltext backend)
- aperag/domains/evaluation/tasks.py + services.py: drop @app.task
decorator + asyncio.create_task fire-and-forget Pattern C
- aperag/domains/knowledge_graph/tasks.py + graph_curation/service.py:
same Pattern C migration
CIRCULAR IMPORT FIXES (uncovered when stub re-exports were dropped):
- aperag/indexing/__init__.py: drop keyword_extract re-exports (eager
import pulled LLM completion stack mid-module-load); the 2 callers
already import from aperag.indexing.keyword_extract directly
- aperag/indexing/parser.py: lazy-import compute_parse_version inside
parse_document body (was triggering full mcp.tools registry load)
- aperag/indexing/keyword_extract.py: lazy-import db_ops inside LLM
extractor body
- aperag/domains/knowledge_base/db/models.py: lazy-import DocumentIndex
+ IndexStatus inside Document.{get_document_indexes,
get_overall_index_status} method bodies (was triggering
knowledge_base→indexing→mcp→knowledge_base cycle)
GATES:
- pytest tests/unit_test/indexing/ + tests/load/ +
test_phase3_reexport_audit.py + agent_runtime_openapi_contract:
136 passed
- Wider sweep (tests/unit_test/ excluding pre-existing missing-moto
+ just-deleted concurrent_control/tasks suites): ~896 passed,
4 failed (3 expected — Celery-specific assertions in
evaluation_v2_worker / graph_curation that chunk 3 deletes; 1
format_drift caught + auto-formatted)
- ruff check + format clean on all 13 modified .py files
REMAINING FOR CHUNK 3:
- Wire document_service.py 5 call sites + retrieval/pipeline.py
fulltext to real new-API helpers
- Selective deletion of legacy Celery-specific tests (evaluation_v2,
graph_curation enqueue-raises path)
- Final grep validation: from aperag.tasks / from aperag.domains.
indexing / from celery / import celery = 0 hits in production
- Alembic upgrade/downgrade smoke
- task #14 → in_review
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…0 + alembic smoke + selective test delete
Wave 3 hard-cut FINAL chunk per architect msg=3890c9d7 + PM @不穷
msg=de7b6834 + msg=fdb6cd28 chunk 3 spec.
NEW MODULE — IndexingRuntime singleton:
- aperag/indexing/runtime.py: process-local triple holder
(engine + queue + workers) populated by FastAPI lifespan,
consumed by service-layer code that doesn't have a Request
handle for app.state. Tests can install a fixture runtime
via set_runtime + reset.
- aperag/app.py: lifespan calls set_runtime after building the
triple; passes None on the sync-only branch + on shutdown.
DOCUMENT_SERVICE — wire 5 callsites to new dispatcher + cleanup:
- aperag/domains/knowledge_base/service/document_service.py:
Replace the chunk-2 ``_DocumentIndexManagerStub`` with two
real adapters:
- ``_create_or_update_document_indexes`` → calls new
``aperag.indexing.dispatcher.dispatch_indexing()`` with
deterministic ``parse_version`` (compute_parse_version on
document.content_hash + canonical chunking config) +
``source_path = document.object_store_base_path()`` +
tenant_scope_key per user.
- ``_delete_document_indexes`` → calls new
``aperag.indexing.cleanup.cleanup_for_deleted_documents()``
(handles modality fan-out + DB row cleanup).
Both adapters consume the IndexingRuntime singleton; if the
runtime is absent (test environment / sync-only mode), they
log a warning + no-op rather than crash.
All 5 production callsites swapped:
- line 532 create_documents
- line 687 _delete_document
- line 787 rebuild_document_indexes
- line 831 rebuild_failed_indexes
- line 1346 confirm_documents
- ``_trigger_index_reconciliation`` stays as a no-op shim — the
new ``run_reconcile_loop`` runs continuously every 30s.
RETRIEVAL PIPELINE — inline ES fulltext search:
- aperag/domains/retrieval/pipeline.py: ``_fulltext_search``
was a chunk-2 empty stub. Now executes the same ES query
shape as the legacy ``FulltextIndexer.search_document`` —
bool/should/match on content+title, filter by collection_id,
optional chat_id filter — directly through ``AsyncElasticsearch``
(no longer wrapped in a domains/indexing/* class). T3.2 lane
did not introduce a new search backend abstraction; the inline
query against whatever ``aperag.indexing.fulltext.FulltextModality``
wrote is the canonical path.
ALEMBIC env.py — drop deleted-module bootstrap import:
- aperag/migration/env.py: remove
``import aperag.domains.indexing.db.models # noqa: F401``
(module hard-deleted in chunk 2). The canonical
``aperag.indexing.models`` import a few lines down already
registers ``DocumentIndex`` against ``Base.metadata`` for
autogen.
SELECTIVE TEST DELETION (per architect msg=3890c9d7 Item 4):
- tests/unit_test/test_es_p0_contract.py — DELETE (tested
legacy ES ``aperag/domains/indexing/fulltext_index.py`` shape)
- tests/unit_test/test_es_shared_index_rollout.py — DELETE
(same)
- tests/unit_test/test_evaluation_v2_worker.py:
``test_evaluation_run_service_launch_run_dispatches_celery_task``
removed (Celery-specific assertion; new path is asyncio
fire-and-forget; the 13 ``test_execute_evaluation_run_*``
tests above lock the worker behaviour)
- tests/unit_test/graph_curation/test_service.py:
``test_start_run_marks_failed_when_enqueue_raises`` removed
(asyncio.create_task doesn't synchronously raise on schedule
so the assertion no longer maps to reachable behaviour)
LEGACY MIGRATION SCRIPT DELETED:
- scripts/migrate_es_fulltext_shared_index.py — one-time Wave-1-era
ES per-collection → shared rollout migration that referenced the
hard-deleted ``aperag/domains/indexing/fulltext_index.py``. Not
production runtime code; the rollout already happened.
T3.2 CONTRACT TEST UPDATE:
- tests/unit_test/service/test_search_graph_contract.py:
``test_search_result_metadata_is_public_allowlist`` add
expected ``index_modality: "vision"`` field (Bryce T3.2
commit 5325788 §G.5 ``SearchResultMetadata.from_raw()``
derives it from ``indexer`` raw key — the test predates the
schema extension and would have failed once T3.2 merged).
GATES (FINAL HEAD):
- ``grep "from aperag.tasks\|import aperag.tasks\|
from aperag.concurrent_control\|from aperag.domains.indexing.
(tasks|orchestration|manager|*_index|db.models)\|from config.celery\|
^from celery\|^import celery"`` over aperag/ + config/ + scripts/
→ **0 hits in production code** ✅
- ``alembic upgrade head`` → succeeds (5 indexing migrations
including T3.1 ``d0f4c1b9a8e2`` rename) ✅
- ``alembic downgrade -1`` then ``upgrade head`` → reversible
round-trip ✅
- ``ruff check + format --check`` over aperag/ tests/ scripts/
→ **clean** (491 files formatted) ✅
- ``pytest tests/unit_test/ tests/load/ --ignore=objectstore``
(objectstore needs moto extra, pre-existing) → **900 passed
/ 29 skipped / 0 failed** ✅
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…source_path} to NOT NULL in model to match alembic d0f4c1b9a8e2 post-state CI ``alembic check`` (drift detector) caught a Wave-1-era stale model declaration. The migration ``d0f4c1b9a8e2`` correctly ALTERs both columns to NOT NULL (per architect msg=498b12f0), but ``aperag/indexing/models.py:108-109`` still declared ``Mapped[str | None] ... nullable=True`` from the original Wave 1 fixture-back-compat era. After ``alembic upgrade head`` the DB was NOT NULL but ``Base.metadata`` was nullable, so autogen wanted to emit ``ALTER COLUMN ... DROP NOT NULL`` to revert the DB. The PM directive (``msg=0dd76df9``) read the autogen log "Detected NULL on column" as "DB has NULL" and asked to add the ALTER NOT NULL to the migration; the migration already does that. The actual fix is to align the model with the migration's post-state (NOT NULL), not the other way around — Wave 3 lifted the back-compat the original ``nullable=True`` was protecting. Changes: - aperag/indexing/models.py:108-109: ``Mapped[str | None] ... nullable=True`` → ``Mapped[str] ... nullable=False`` for both columns + comment refresh pointing at the alembic NOT-NULL promotion - tests/unit_test/indexing/test_t2_1_runtime.py: ``test_reconciler_skips_pending_rows_missing_source_path`` deleted — the fixture ``_insert_row(... source_path=None)`` now raises IntegrityError before reconcile_pending_dispatch is ever called, so the scenario is unreachable from a clean schema. The defensive ``if not row.source_path`` branch in ``aperag/indexing/reconciler.py`` is kept as a zero-cost guard but no longer reachable without manual SQL bypass. Gates: - ``uv run alembic -c aperag/alembic.ini check`` → "No new upgrade operations detected" ✅ - pytest tests/unit_test/ tests/load/ --ignore=objectstore → 899 passed / 29 skipped / 0 failed ✅ - ruff check + format --check clean on the 2 modified files ✅ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…adapter + drop celery infra CI e2e-http-provider caught two Wave-3-induced regressions on PR #1729 HEAD `5d50ca5`: **Blocker 1 — rebuild_indexes 500 DATABASE_ERROR**: The chunk-3 ``_create_or_update_document_indexes`` adapter calls ``dispatch_indexing()`` which INSERTs new ``document_index`` rows. ``rebuild_indexes`` re-invokes the adapter with the same ``(document_id, parse_version, modality)`` triple (content unchanged → parse_version unchanged), so the §F.1 ``uq_document_index_triple`` UNIQUE constraint fails the INSERT with IntegrityError → 500. Pre- DELETE matching rows (any status / serving state) before INSERT so the dispatcher's INSERT lands cleanly. The §F.3 cutover-on-sync- completion re-establishes the serving state once the new dispatch's worker finishes; brief unavailability between DELETE and cutover is acceptable for an explicit rebuild op. Test failure traced from `tests/e2e_http/hurl/full/11_document_full. hurl:204` POST `/api/v2/collections/.../documents/.../rebuild_indexes` expecting HTTP 200, getting 500. **Blocker 2 — celerybeat container `celery: not found`**: chunk 2 dropped ``celery`` + ``django-celery-beat`` from ``pyproject.toml`` and deleted ``aperag/tasks/`` + ``config/celery.py``, but the docker-compose ``celeryworker`` / ``celerybeat`` / ``flower`` services + helm chart ``celeryworker-deployment.yaml`` / ``celerybeat-deployment.yaml`` / ``flower-deployment.yaml`` + the ``scripts/start-celery-{worker,beat,flower}.sh`` entry scripts were left behind. CI e2e-aperag spins up the docker-compose stack, the ``celerybeat`` container tries to ``exec celery`` and fails (binary not in image since pyproject dropped the dep). The new in-process ``aperag.indexing`` runtime (worker pool + reconciler + cleanup loops) is spawned by the FastAPI lifespan inside the ``aperag-api`` container, so no separate worker / beat / monitoring pods are needed. DELETED: - docker-compose.yml: ``celeryworker`` / ``celerybeat`` / ``flower`` service blocks (replaced with explanatory comment block) - scripts/start-celery-{worker,beat,flower}.sh - scripts/test/celery-{call-task,with-local-queue}.sh - scripts/celery/trigger_trask.sh + the ``scripts/celery/`` dir - deploy/aperag/templates/celeryworker-deployment.yaml - deploy/aperag/templates/celerybeat-deployment.yaml - deploy/aperag/templates/flower-deployment.yaml - deploy/aperag/values.yaml: ``celery-worker`` + ``celerybeat`` + ``flower`` value blocks (replaced with explanatory comment) - deploy/aperag/templates/aperag-secret.yaml: ``CELERY_FLOWER_*`` env entries (no flower pod to consume them) - deploy/aperag/templates/_helpers.tpl: ``celeryworker.labels`` template (no chart consumes it) - deploy/aperag/values.yaml api podAffinity-with-celery-worker rule (the api pod no longer needs to co-locate with a non-existent worker pod; the soft anti-affinity for spreading api replicas across nodes is preserved) - deploy/aperag/templates/api-deployment.yaml: comment "shared uploaded files between api and celery" → "uploaded files volume consumed solely by the in-process ``aperag.indexing`` runtime" Local gates: - ruff check + format --check on the changed files → clean ✅ - pytest tests/unit_test/indexing/ tests/load/ test_phase3_reexport_audit.py → 133 passed ✅ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rkflow + Makefile Wave 3 chunk 2 + 144c3f1 deleted the docker-compose ``celeryworker`` / ``celerybeat`` / ``flower`` services + helm charts, but a few infra-side scripts that explicitly referenced those service names were missed. CI e2e-http-smoke caught it: ``docker compose up -d celeryworker`` failed with ``no such service: celeryworker``. This PR plugs the four straggler call sites: - tests/e2e_http/runners/compose/up.sh:8: ``E2E_COMPOSE_SERVICES`` default drops ``celeryworker celerybeat`` → just ``postgres redis qdrant es api``. The api container's FastAPI lifespan spawns the in-process indexing runtime, so no separate worker container. - tests/e2e_http/scripts/provider_diagnostic.sh:63: failure-diag log-dump loop drops ``celeryworker celerybeat`` from the service list. - .github/workflows/e2e-http-smoke.yml:68,173: ``docker compose logs`` in the failure-dump steps drops ``celeryworker celerybeat``. - Makefile: deleted ``serve-worker`` / ``serve-beat`` / ``serve-flower`` targets + their help-string entries (the binaries are gone since pyproject dropped ``celery``). Local sanity: ``grep -rn 'celery|celerybeat|celeryworker|flower' tests/ e2e_http/ .github/ Makefile docker-compose.yml deploy/`` returns only explanatory comment lines (the in-process runtime replacement narrative); no live service / command references remain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…with ProductionWorkerFactory + harden orchestrator factory-error path Wave 3 hard-cut deleted the legacy Celery indexers but left the FastAPI lifespan wiring ``run_*_worker`` with a placeholder factory that raised ``NotImplementedError`` on every dispatch. e2e-http- provider stalls on ``wait_for_document_indexes`` because the row never advances past PENDING (PM msg=dc13c4a5 root cause). Per architect msg=7782ebe0 spec lock: - ``aperag/indexing/worker_factory.py`` (new): per-task lazy ``ProductionWorkerFactory`` resolving ``Collection`` from the payload, building the right ``ModalityWorker`` with real Qdrant / Elasticsearch backends + the configured embedder / completion model. Composes existing helpers (``get_collection_embedding_service_sync`` / ``get_vector_db_connector`` / ``get_object_store`` / ``build_collection_llm_callable``) so this is wiring, not re-implementation. Failures raise ``WorkerFactoryError`` so the operator gets a meaningful ``error_message``. Graph modality is intentionally minimal (in-memory lineage store + no-op extractor) pending Wave 4 Nebula-side §D.3.6 lineage adapter — documented as a known gap, not a regression; the e2e-http-provider gate only blocks on vector ACTIVE. - ``aperag/indexing/orchestrator.py``: harden ``_runner`` to claim the row + finalise FAILED on factory error so the §I.2 retry- with-backoff schedule kicks in. Without this, factory errors got silently swallowed by the asyncio.Task and the row sat at PENDING forever. - ``aperag/app.py``: replace the placeholder closure with a ``ProductionWorkerFactory`` instance. - ``tests/integration/test_worker_factory.py``: 3 tests pinning factory-failure → FAILED-finalize, collection-not-found path, and missing-collection-id path. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 909 passed / 41 skipped / 0 failed (+3 from this commit). ruff check + format --check clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…al to §F.2 4-state + drop SKIPPED sentinel + skip vector when disabled Per architect msg post-pass-5 + PM msg=79683cc0 ruling. Two e2e-http- smoke bugs surfaced after the worker_factory wire-in lands: **Bug 1 — Pydantic 400 on GET document.** orchestrator claims a row to ``RUNNING`` (the §F.2 canonical 4-state) before the worker finishes; the ``Document`` view model's per-modality status Literal still listed the legacy 6-state vocabulary (``CREATING``/``DELETING``/``DELETION_IN_PROGRESS``/``SKIPPED``) which never includes ``RUNNING`` — so any GET racing the claim returned ``ValidationError``. The Wave 3 hard-cut migrated the DB enum but missed this view-model layer (CR step-0 lesson #6: schema-touching PR must trace enum references through every deserialise surface, not just the write path). The fix collapses the 5 per-modality status Literals to the §F.2 4-state ``Optional[Literal["PENDING", "RUNNING", "ACTIVE", "FAILED"]]``. "Modality not enabled" is now expressed by the field being absent (``None``) rather than the sentinel ``"SKIPPED"`` — the row simply does not exist in ``document_index``. Friendly client-facing mapping (``NOT_ENABLED`` / ``INDEXING``) lives in §G.5 ``SearchResultMetadata.index_state_per_modality`` for the read-path response. **Bug 2 — collection without embedder triggers FAILED loop.** ``_get_index_types_for_collection`` always added ``Modality.VECTOR`` regardless of the collection's ``enable_vector`` flag. A collection without an embedding-model config (smoke test fixture) then dispatched a vector job, the production worker factory raised ``WorkerFactoryError`` (no embedder), the orchestrator finalised ``FAILED``, the reconciler re-dispatched, repeat. The fix honours ``enable_vector`` symmetric with ``enable_fulltext``: explicitly disabled means no row in the document_index table for that modality. Files: - ``aperag/domains/knowledge_base/schemas.py``: 5 status fields → ``Optional[Literal["PENDING", "RUNNING", "ACTIVE", "FAILED"]]`` - ``aperag/domains/knowledge_base/service/document_service.py``: ``_build_document_response`` returns ``None`` when index row missing (instead of ``"SKIPPED"``); ``_get_index_types_for_collection`` honours ``enable_vector`` flag. - ``tests/e2e_http/hurl/{smoke/03_document_basic,full/11_document_full}.hurl``: 6 assertions migrated from ``== "SKIPPED"`` to ``== null``. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 909 passed / 41 skipped / 0 failed (unchanged from 579b32a). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ne + drop asyncio.to_thread caller wrap Wave 3 chunk 2 Pattern C migration moved 5 ``.delay()`` callsites to ``asyncio.create_task(asyncio.to_thread(run_evaluation_run, run_id))``, but ``run_evaluation_run`` was still a sync wrapper that called ``asyncio.run(execute_evaluation_run(run_id))`` inside the worker thread — spawning a *fresh* event loop each invocation. Any asyncpg connection borrowed by ``execute_evaluation_run`` is bound to the FastAPI lifespan loop's connection pool; running the coroutine on a brand-new loop made every connection-pool ``ping`` fail with ``RuntimeError: got Future attached to a different loop``, which corrupted the asyncpg shared pool. Subsequent DB calls from unrelated code paths (every later e2e-http-provider hurl test that touched Postgres) tripped the same error → CI exit 1 (per huangheng pass-6 followup msg + PM msg=37da5249). Fix per huangheng option (a): * ``aperag/domains/evaluation/tasks.py``: ``run_evaluation_run`` becomes ``async def``, awaits ``execute_evaluation_run`` directly. No fresh loop. Docstring spells out the failure mode so a future reader does not regress. * ``aperag/domains/evaluation/services.py``: caller drops ``asyncio.to_thread`` and schedules the coroutine directly via ``asyncio.create_task(run_evaluation_run(run_id))``. The task shares the FastAPI lifespan loop, keeping asyncpg pool affinity. Pattern C contract preserved (fire-and-forget at the request handler boundary); only the inner mechanism changes from "thread + new loop" to "coroutine on shared loop". The other 4 ``.delay()`` callsites in chunk 2 were genuine sync work and stay on ``asyncio.to_thread`` — only evaluation's body was async-native under the hood, which is why this was the one that blew up. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 909 passed / 41 skipped / 0 failed (unchanged). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
…or Pattern C in-process dispatch CI run 24976479158 (PR #1729 head e1f2325) failed at ``16_evaluation_v2.hurl:218`` with the assertion ``$.items[0].status == "queued"``; the actual response showed ``status="running"`` because the post-pass-7 evaluation cross-loop fix (e1f2325) made dispatch effectively immediate — the ``asyncio.create_task(run_evaluation_run(run_id))`` worker starts on the next event-loop tick, so by the time the GET arrives the run has already left "queued". The test was written for Celery ``.delay()`` semantics where "queued" was a stable, externally-observable transient state thanks to broker round-trip + worker pickup latency. Pattern C in-process collapses that latency to microseconds, so "queued" is no longer reliably observable on a follow-up GET. Fix per huangheng option (a) + PM ack: relax 4 timing-sensitive assertions to accept any in-flight or terminal state via ``matches "^(queued|running|completed|cancelled)$"`` (item status uses the correspondingly-broader ``pending|...|failed|cancelled``). The contract this test pins is "the run shows up correctly in list / detail endpoints with the right ids", not "dispatch is slow enough to observe a specific transient state". POST-response asserts (lines 183, 207) keep the strict ``status == "queued"`` value because those are synchronous returns built before the ``create_task`` fires. Also relaxes: - ``summary.pending == 3`` → drop (kept ``summary.total == 3``, which is fixed by dataset cardinality) - ``progress.percent == 0`` → drop (now race-window-dependent) - ``items[0].status == "pending"`` → matches in-flight set - ``items[0].attempt_count == 0`` → drop (worker may have attempted already) - ``attempts body contains "items":[]`` → ``$.items exists`` (envelope shape only, ignore population timing) Local gates: pytest 161 passed (evaluation worker + openapi contract + indexing + integration + load + phase3 audit). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…patch_indexing in document_service PR #1729 head 30b3489 e2e-http-provider failed at the scripted ``run_chat_collection_flow.sh`` business flow because vector + fulltext modality workers reported "found no chunks at user-X/colY/docZ; treating as derive-incomplete and skipping" on every claim — the chunks.jsonl artifact never existed at the dispatcher's ``source_path``. Root cause (architect msg=c605037e ruling): chunk 2's hard-cut deleted ``aperag/domains/indexing/{tasks,orchestration,manager, *_index}.py`` whose former ``process_document_task`` ran :func:`aperag.indexing.parse_document` and wrote the canonical ``derived/parse_<v>/{markdown.md,outline.json,chunks.jsonl}`` artifacts before enqueuing modality workers. The new dispatch path never picked up that responsibility — every modality worker.derive pulled an empty derived path and the row stayed in the §C.7 reschedule loop forever. Fix per architect option (1) — Wave 3 minimal scope, not skip: ``aperag/domains/knowledge_base/service/document_service.py`` ``_create_or_update_document_indexes`` now: 1. Resolves the upload object path from ``document.doc_metadata.object_path`` (the upload handler already stashes it there). 2. Reads the source bytes from the object store on a worker thread. 3. Calls :func:`parse_document` synchronously on a worker thread so the canonical ``derived/parse_<v>/`` artifacts exist before any modality dispatch. 4. Uses ``parsed.parse_version`` and ``parsed.chunks_path`` as the dispatcher's parse_version / source_path (replaces the previously-computed-locally values that pointed at the document base prefix, not the chunks.jsonl file). This keeps §E.2 "parse-as-first-stage" intact; the parse step runs inside the request task instead of a separate ``parse_worker`` queue process. Wave 4 follow-up may promote parse to ``q:parse`` once observed parse latency starts blocking HTTP requests; the sync path is acceptable for current latencies. Parse failure raises and propagates → HTTP 500 → no modality rows created (per architect ruling: "fail loudly, no half-state"). New integration test ``tests/integration/test_dispatch_with_parse.py`` pins the canonical post-fix flow: parse first → dispatch with chunks.jsonl path → modality workers reach ``status=ACTIVE`` AND ``is_serving=TRUE`` (uses ``IndexingMode.INLINE`` so no lifespan / async queue needed; the same data-flow contract). Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 910 passed / 41 skipped / 0 failed (+1 new test). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d in worker_factory PR #1729 head 8ca396f e2e-http-provider failed with vector worker hitting Qdrant 400 ``Format error in JSON body: value f766a946575ec3b4:0000 is not a valid point ID``. Qdrant only accepts unsigned-integer or UUID point ids; the T1.1 parser produces chunk ids of the form ``<sha-prefix>:<index>`` (16 hex + ``:`` + 4-digit) which violate that constraint. Fulltext reaches ACTIVE because Elasticsearch is happy with any string ``_id``. Vector hits 400 on the very first ``client.upsert(...)`` call. Fix in ``aperag/indexing/worker_factory.py _QdrantPointBackend.upsert_point``: map the caller-supplied chunk_id to a deterministic ``uuid.uuid5(NAMESPACE_OID, chunk_id)`` for the Qdrant point id (stable across retries → idempotent §D.1) and stash the original id in the point payload so the read path can still echo it to clients. Vector / summary / vision share the same ``_QdrantPointBackend`` adapter so this fix covers all three modalities. Local gates: pytest tests/integration/test_worker_factory.py tests/integration/test_dispatch_with_parse.py tests/unit_test/indexing/ tests/load/ = 135 passed. ruff clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
added a commit
that referenced
this pull request
Apr 27, 2026
…pplement (#1730) Per @earayu2 / @不穷 msg=981ae30d standby task: supplement the existing per-modality acceptance tests in ``test_t1_2_graph.py`` / ``test_t1_3_vector_fulltext.py`` / ``test_t1_4_summary_vision.py`` with **cross-modality contract tests** that exercise invariants the spec promises across all 5 modalities. Per @不穷 msg=0d35f537 scope decision: this lands as a follow-up PR (NOT into the current Wave 3 PR #1729). The branch is based on ``origin/main`` and verified passing against the pre-Wave-3 codebase; no Wave 3 dependency. Coverage added (11 tests): 1. **§C.7 reschedule contract** (2 tests): summary + vision ``derive()`` with a missing upstream source returns ``DeriveResult(derived_artifact_path="")`` (the empty-string signal the orchestrator interprets as "derive incomplete, leave PENDING for next reconciler cycle"). Vector + fulltext are pass-through and covered by existing per-modality "no-op on missing chunks" tests. 2. **N-call replay convergence** (4 tests, parametrized across vector/fulltext/summary/vision): 5 consecutive ``sync()`` calls produce a backend state byte-equivalent to a single sync (extends existing 2-call tests under arbitrary retry storms — §D.4). 3. **Cross-document parse_version isolation** (4 tests, same params): ``sync()`` of doc-A's ``(doc_a, parse_version)`` slot must NOT touch doc-B's backend state. Locks the §D.1 DELETE scope: ``WHERE document_id=A AND parse_version=V`` only. Uses two distinct doc bodies because the parser's content-hashed ``chunk_id`` would otherwise collide cross-doc — fixture limitation, not a production-code invariant violation; called out in the test docstring. 4. **All-5-modality enum discriminator** (1 test): each ``ModalityWorker`` subclass binds the class-level ``modality`` attribute to the matching ``Modality`` enum value (orchestrator route key — a misbind would silently mis-route work). Graph modality is intentionally NOT in the parametric sweeps because ``test_t1_2_graph.py`` already covers the §D.3 lineage semantic exhaustively (D3.6 5-step scenario + Nebula race + byte-equivalent re-sync + tenant_scope_key propagation). The all-5-modality enum test is the only graph touch here. Local gates: - pytest tests/unit_test/indexing/test_modality_worker_contract.py → 11 passed - ruff check + format --check clean Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…y + parser markdown-only docs Per architect msg=c79e9a3f gap-report ruling: Wave 3 ships production-ready vector + fulltext + summary + vision modalities; graph and the binary-format parser path are explicitly gated until Wave 4. The previous fix-cycle was patching the visible layers (alembic / Celery residue / view model / hurl timing / cross-loop / Qdrant id format) while the real production-readiness gaps (InMemory graph store, no-op extractor, simulator parser) sat under the surface. Without this gate the graph rows would reach ``status=ACTIVE`` with zero entities written — silent broken UX for graph search. Four changes: * ``aperag/indexing/worker_factory.py``: ``_build_graph_worker`` now detects the :class:`InMemoryLineageGraphStore` placeholder and raises :class:`WorkerFactoryError` with an explicit message pointing at ``enable_knowledge_graph=false``. The orchestrator runner already finalises factory errors as ``FAILED`` with the message persisted to ``error_message`` — operators / clients see a clear refusal rather than a misleading ACTIVE-with-empty graph. When Wave 4 swaps in a Nebula adapter, the ``isinstance`` check naturally stops matching and the gate self-disables. * ``aperag/schema/common.py``: ``CollectionConfig.enable_knowledge _graph`` default flipped from ``True`` to ``False`` so new collections do not opt into the gated path by accident. Wave 4 release flips it back. * ``docs/private-deployment.md``: adds a "Wave 3 release scope" section before tier selection, naming both gates explicitly (graph backend + extractor; markdown-only parser) and the Wave 4 backlog that lifts them. * ``tests/e2e_http/scripts/run_full.sh``: skips ``run_graph_index_flow.sh`` until Wave 4 with a comment pointing at the architect ruling and the docs section. The graph e2e flow would otherwise time out on the explicit ``WorkerFactoryError`` finalising the row to FAILED instead of reaching ACTIVE, masking the legitimate scope cut behind a generic CI red. This is the closing commit for PR #1729 per architect's "no more fix-cycle" lock; everything past this point is Wave 4 backlog. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 910 passed / 41 skipped / 0 failed (unchanged from a11df3c). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chema + gate vision modality to Wave 4 Per architect msg=69df0779 closing ruling on the two production gaps Bryce diagnosed (msg=8953bd05) after the previous closing attempt (4b0eaf3) still tripped run_chat_collection_flow.sh: **Finding 1 — fulltext field-shape mismatch** (real production bug, Wave 3 must fix): ``aperag/domains/retrieval/pipeline.py:_fulltext_search`` queries the legacy ES schema (``content``/``title``/``collection_id`` filter — the shape the now-deleted ``aperag/domains/indexing/fulltext_index.py`` wrote pre Wave 3 hard-cut). The Wave 1 simulator's ``FulltextModality.sync`` wrote ``text`` and no ``collection_id``, so every fulltext search after Wave 3 returned 0 hits silently. The chat-flow business test exited on ``jq -e items.length > 0``. * ``aperag/indexing/fulltext.py``: ``FulltextModality.__init__`` takes an optional ``collection_id`` (kept optional so existing in-memory contract tests continue to work without it). ``sync`` now writes ``content`` (queried by ``_fulltext_search``) and ``collection_id`` (filtered by ``_fulltext_search``) into every chunk record. ``text`` is kept as an alias of ``content`` so existing in-memory backend assertions do not regress. * ``aperag/indexing/worker_factory.py``: ``_build_fulltext_worker`` passes ``collection.id`` so production rows always carry the filter field. * ``tests/integration/test_fulltext_roundtrip_fields.py`` (new): pins the post-fix invariant — every document the ``FulltextModality.sync`` writes carries the field names the retrieval pipeline depends on (``content`` + ``collection_id``). A regression that drops either trips the first assertion. **Finding 2 — vision modality is fake** (silent broken, gate to Wave 4): The previous ``_build_vision_worker`` built ``VisionModality`` with ``embedding_service.embed_query(f"{image_id}|{alt_text}")`` — a text embedding on a string-concat. That produces deterministic per-image vectors with no actual image-content awareness. Same silent-broken pattern as the gated graph modality (architect ruling msg=c79e9a3f); same Wave 4 gate is the correct response. * ``aperag/indexing/worker_factory.py``: ``_build_vision_worker`` now requires ``embedding_service.is_multimodal()`` to be ``True`` and otherwise raises :class:`WorkerFactoryError` with the same shape as the graph gate. ``CollectionConfig.enable_vision`` is already ``False`` by default, so the gate only fires when an operator explicitly opted in. When Wave 4 ships a real multimodal embedding model + the operator configures it, the ``is_multimodal`` check passes and the gate self-disables. * ``docs/private-deployment.md``: "Wave 3 release scope" section adds a vision-modality gate paragraph alongside the existing graph and parser-markdown-only paragraphs. Wave 4 backlog item #9 is the real multimodal vision-LLM wiring. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 912 passed / 41 skipped / 0 failed (+2 new roundtrip tests). ruff check + format clean. This is the (real) closing commit for PR #1729 per architect's "no more fix-cycle" lock — every remaining gap is now in the Wave 4 backlog (9 items, hard-locked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 27, 2026
earayu
added a commit
that referenced
this pull request
Apr 28, 2026
The license-check infrastructure (`check-license`, `install-addlicense`, `install-hooks`, the bundled `addlicense` binary, and the git hooks install) was already removed in commit 00ae644 (PR #1729). The `add-license` target left behind only echoes a one-line message and does nothing else, so it is dropped here along with its `.PHONY` entry. License headers in source files are unchanged. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
earayu
pushed a commit
that referenced
this pull request
Apr 29, 2026
按 task #17 PR 协作 review 模式,fold-in 冬柏 msg=d56bb0f7 review 反馈: 1. §二 6 hard gate 加具体测试文件 mapping 子表(每个 gate 有可执行 test 文件锚点 + 默认 owner,CR verdict 表 §七要求填具体 test commit / 行号) 2. §五 加 5.5 失败注入方法规范(kubectl scale / iptables drop / kubectl delete pod 真路径 vs 禁止 mock 客户端绕过;Lesson 沉淀 来源 Wave 3 PR #1729 mock 路径过 + 真路径 fail) 3. §七 verdict 表新增 3 行(多文档并发压测阈值 by Planetegg #22 + 黄章书 / smoke regression diff = 0 对照 6 套 hurl baseline / 失败注入用真路径不允许 mock 引用 §5.5) 4. §八 加 checklist 修订记录(追踪后续 fold-in) verdict 中文化(§5.2)已 align 冬柏建议(✅ 同意通过 / ⏳ 阻塞 / 💡 小修建议),无需调整。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
added a commit
that referenced
this pull request
Apr 29, 2026
* docs(task-17): 提交 task #17 代码改造方案章节进仓库 (协作 PR base) 按 earayu2 directive msg=53a0b121 (开 PR) + msg=d18b6887 (所有人可改 PR) + ziang msg=4ea65100/76f6f465 (代码改造文档必须入仓不能引用 agent 本地路径) + 团队 5 方共识 v8 final (候选 B + 一次性 hard cut + 解 503 最小改造). 本提交是协作式 PR 的初始 base, 含 Bryce 写的 §3 file-by-file 代码改造细节 (后 ziang msg=4ea65100 7 项 CR 修正 — 实施时按 ziang 简化版走, 见 commit 后 续 push). v8 final 整体方案 (§1/2/4-12) 由架构师 push 进来; 部署 / 发布 / 回滚 (§4-6) 由 huangzhangshu push; 状态机 / 失败场景 (§7) 由 ziang push; Helm + 代码 commits 由 Bryce 按 ziang 7 项简化版逐文件 push. 不进 task #17 主线 (按 Weston msg=d2c46eb7 + ziang msg=cd4761aa 收紧): - 包名 ``aperag/tasks/`` 大搬迁 - task #18-21 (collection 配置校验 / graph_extractor fail-loud / 图谱 GC sweep / store bulk upsert API) 作独立切片候选, 不默认进 task #17 - Dramatiq/Celery/RQ 引入 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): add v8 final architecture spec — task system hard cut 按 5 方独立 evidence-based 共识 + 全部 BLOCKER 收紧 + ziang msg=bb5a9e2e 最后修正 (删除外部 agent 路径引用,改成本 PR diff 引用)后的 ratify-ready v8.2 final 版本归档进 docs/zh-CN/architecture/task-system-hard-cut-v8.md。 包含: - Executive Summary 推荐候选 B + Reject Cards (A/C/D) - §1-2 上下文 + 不可变 hard gate (黄章书 7 条 + 2 层 invariant) - §3 代码改造 8 项 runtime 主线(不含 module 搬迁、不含 hardening) - §4 部署改造 (huangzhangshu §4) - §5 发布计划 8 步 - §6 回滚策略 + 双执行面 hard gate - §7 状态机/失败场景验收 (ziang §7) - §8 架构评审 8 块 review checklist (Weston) - §9 Spec lock fold (6 YAGNI + 4 escape hatch + PgBouncer 后续选项) - §10 CR mandatory checklist (5 cross-check + Lesson #11/12/extension v3 + Mini-pattern 17) - §11 节奏 + 团队分工 - §12 待 earayu2 ratify Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: add task 17 deployment release runbook * docs(task-17): add state machine validation gates * docs(task-17): add architecture invariants (task system) 锁定 ApeRAG 异步任务系统不可变 invariants 进 architecture 文档: - 部署架构 invariant (API/worker 进程隔离 + probe 不放大连接池 + 连接池公式 + 回滚执行面唯一性) - 业务状态 invariant (DocumentIndex SoT + cleanup intent SoT 在 DB + API 不拥有重型执行面 + 旧任务防写回) - 任务系统选型 invariant (6 YAGNI + 4 escape hatch + PgBouncer 后续) - CR mandatory checklist (5 cross-check + Lesson #11/12/extension v3 + Mini-pattern 17 + 6 hard gate) 未来 PR review 必引用本文档防 incremental drift。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): add huangheng CR mandatory checklist 按 earayu2 msg=c1c4ba2f directive「每个人补充文档」+ task #17 PR collaboration draft 协作要求,新增 huangheng CR 角色专属 mandatory checklist 文档 docs/zh-CN/architecture/task-17-cr-review-checklist.md。 包含: - 5 条 cross-check(粒度等量 / 节间一致 / 数字合理 / framework claim 分级 / 推荐 evidence-grounded) - 6 条架构 hard gate(ziang msg=4ea65100 + msg=ad6a610d 收敛版:API 不启 重型执行面 / cleanup intent 真源 DB / object store delete 迁出 / API readiness 轻量 / 连接池 Helm 层映射 / 回滚执行面唯一) - 7 条实现修正(ziang msg=4ea65100 + msg=76f6f465 + Bryce msg=981960cd accept 版:settings 现有 module 实例 / QuotaPolicyRegistry 直接创建 / 连接池 Helm 映射 / 删除 helper 不嵌套 transaction / object store delete 迁出 / cleanup loop 补 deleted Document scan / diagnostics 鉴权 + sync URL) - CR 必应用 lessons(Lesson #11 / Lesson #12 / extension v3 / Mini-pattern 17 / 一次性不分阶段) - CR 工作流(PR 上线后 CR 顺序 + verdict 表述规范 + false positive 自我修正 + 团队协作边界) - CR 历史 sediment 引用 - task #17 PR final review verdict 表(待 PR 合并前填) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): align scope and architecture doc links * deploy: add indexing worker helm topology * docs: clarify quota scope for task 17 deployment * feat(task-17): hard cut API/worker — 新增 indexing-worker CLI + 删 lifespan worker startup (#20) 按 task #17 v8.2 final + ziang msg=4ea65100 7 项 CR 修正实施: 1. 新增 ``aperag/cli/__init__.py`` + ``aperag/cli/indexing_worker.py`` - ``python -m aperag.cli.indexing_worker`` 独立进程入口, Helm ``indexing-worker-deployment.yaml`` (huangzhangshu commit f4be52f) 调用 - 跟 ``aperag/app.py`` 老 lifespan 行为完全等价: 选 queue (redis/inmemory) / quota backend (RedisQuotaBackend(quota_redis, QuotaPolicyRegistry())) / metrics emitter (otlp/noop) 同款 dispatch - 启动 10 个 asyncio 后台任务: 7 modality worker (vector/fulltext/graph/ graph_facts/graph_vectors/summary/vision) + parse + reconciler + cleanup - SIGTERM/SIGINT graceful shutdown: 等所有 in-flight 任务 drain + 关闭 RedisWorkQueue / quota redis client 2. ``aperag/app.py`` lifespan hard cut - 删除所有 ``asyncio.create_task(run_*_worker)`` (vector/fulltext/graph/ graph_facts/graph_vectors/summary/vision 7 个 modality worker) - 删除 ``run_parse_worker`` / ``run_reconcile_loop`` / ``run_cleanup_loop`` 启动 - 删除 ``ProductionWorkerFactory(engine=engine)`` 构造 (API 不需要消费 task) - 删除 ``indexing_runtime_tasks`` list + ``indexing_shutdown`` event (没有 task 需要 drain) - 保留: queue (push_parse / dispatcher 用) + quota_backend (API 检查租户配额) + metrics_emitter (上报 SLI) + IndexingRuntime singleton (service 层 enqueue) - ``IndexingRuntime.cleanup_worker_factory=None`` (per ziang msg=cecb0d88 hard gate: API 请求路径不能直接调 ``cleanup_for_deleted_documents`` / ``delete_objects_by_prefix``, cleanup 由 worker cleanup loop 异步执行) - 老 ``/health`` endpoint 暂保留 (line 497-501), task #21 (@申栋栋) 实施 ``/health/live`` / ``/health/ready`` / ``/health/diagnostics`` 时改造 新加坡 503 根因: API + 重型 indexing worker 共进程, graph 索引压力把 ``/health`` / 事件循环 / 线程池 / DB 连接池一起拖死, kubelet 杀 pod, ALB 503. 本 commit 把 worker 进程从 API lifespan 拆出来 — API pod 只做 HTTP 路由 + 轻量入队, ``/health`` 不再受 worker 资源压力影响. ziang msg=4ea65100 7 项 CR 修正 fold-in: 1. ✅ 用现有 ``settings`` (module-level), 不引 ``get_settings()`` helper 2. ✅ ``ProductionWorkerFactory`` 从 ``aperag.indexing.worker_factory`` import 3. ✅ ``RedisQuotaBackend(quota_redis, QuotaPolicyRegistry())`` / ``InMemoryQuotaBackend(QuotaPolicyRegistry())`` 跟 app.py 现有写法一致 4. ✅ 连接池只在 helm 层映射现有 ``DB_POOL_SIZE``/``DB_MAX_OVERFLOW`` (huangzhangshu commit f4be52f 已 push), 不引应用层双 env alias 5. ⏳ ``_delete_document`` 不嵌套 transaction (留 task #17.4 @明书 cleanup 迁出 pair 处理) 6. ⏳ Object store prefix delete 也迁出 (留 task #17.4 处理) 7. ⏳ ``run_cleanup_loop`` deleted Document scan 补全 (留 task #19 @ziang 处理) 8. ⏳ ``/health/diagnostics`` 鉴权 + sync URL (留 task #21 @申栋栋 处理) ⏳ 项不在 task #20 (#17.1+17.2) scope 内 (huangheng msg=f97b7c5f 6 lane 协作分工). 本 commit 严格只覆盖 cli + lifespan 改造, 不越界. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(task-17): co-implementer add multi-doc burst e2e + probe-layering smoke Sub-task #22 co-implementer contribution (chenyexuan; Planetegg main owner) covering two non-overlapping slices that the deployment hard cut needs end-to-end coverage for: 1. ``tests/load/test_concurrent_doc_upload_e2e.py`` — multi-document concurrent HTTP upload + index ACTIVE assertion. Skipped by default (``RUN_TASK_17_E2E=1`` to enable). Runs DOC_COUNT documents in parallel through the public API, polls until vector / fulltext / graph indexes all reach ACTIVE within ``POLL_BUDGET_SECONDS``, and samples ``/health/live`` every 2s during the burst — every sample must be 200 with p95 latency under 500ms, which is the cascading- failure-mode that the API/worker hard cut prevents. 2. ``tests/e2e_http/hurl/smoke/00_health.hurl`` — extend existing smoke with the probe-layering pins for sub-task #21 (申栋栋 main owner): ``/health`` (compat), ``/health/live`` (liveness, must not touch PG/Redis/Qdrant), ``/health/ready`` (readiness, must not consume main DB pool). ``/health/diagnostics`` is intentionally NOT asserted in smoke per Planetegg msg=64f33ceb — that endpoint is gated by admin auth / intranet and is covered from Planetegg's release-validation lane. Coordination trail: thread #indexing优化:5e959a2d msg=abccc676 (split proposal) → msg=3a7953a6 (Planetegg confirms split) → msg=64f33ceb (diagnostics nit absorbed). Cross-checks against task #18 (黄章书 deployment) and task #21 (申栋栋 health endpoints) — both still in flight; this commit pins the contract callers will assert against once those land. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): fold-in 冬柏 msg=d56bb0f7 4 条测试维度补充 按 task #17 PR 协作 review 模式,fold-in 冬柏 msg=d56bb0f7 review 反馈: 1. §二 6 hard gate 加具体测试文件 mapping 子表(每个 gate 有可执行 test 文件锚点 + 默认 owner,CR verdict 表 §七要求填具体 test commit / 行号) 2. §五 加 5.5 失败注入方法规范(kubectl scale / iptables drop / kubectl delete pod 真路径 vs 禁止 mock 客户端绕过;Lesson 沉淀 来源 Wave 3 PR #1729 mock 路径过 + 真路径 fail) 3. §七 verdict 表新增 3 行(多文档并发压测阈值 by Planetegg #22 + 黄章书 / smoke regression diff = 0 对照 6 套 hurl baseline / 失败注入用真路径不允许 mock 引用 §5.5) 4. §八 加 checklist 修订记录(追踪后续 fold-in) verdict 中文化(§5.2)已 align 冬柏建议(✅ 同意通过 / ⏳ 阻塞 / 💡 小修建议),无需调整。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * deploy: pass postgres env to indexing worker * test(task-17): add local stability acceptance helper * feat(task-17): add split health endpoints * refactor(task-17): mount health router under prefix * test(task-17): add hard gate #1 grep gate (API lifespan no workers) The task #17 hard cut moved every ``run_*_worker`` / ``run_*_loop`` entrypoint off ``aperag/app.py`` and onto ``aperag/cli/indexing_worker.py`` to fix the Singapore 503 root cause (API + worker sharing one process, graph indexing pressure starving ``/health``). Pin that contract at the source level so a future PR cannot re-merge the runtimes by accident. The new ``tests/unit_test/test_app_lifespan_no_workers.py`` runs in the standard PR-gate suite (no deployment) and asserts: Negative on ``aperag/app.py``: - No invocation of any of the 10 worker entrypoints (vector / fulltext / graph / graph_facts / graph_vectors / summary / vision / parse / reconciler / cleanup). - No ``ProductionWorkerFactory(...)`` construction. - ``IndexingRuntime.cleanup_worker_factory`` only ever assigned ``None`` (per ziang msg=cecb0d88 + huangheng msg=f97b7c5f #6 hard gate forbidding API request path heavy backend cleanup). Positive on ``aperag/cli/indexing_worker.py``: - All 10 worker entrypoints invoked (the hard cut is symmetrical: what the API stops doing, the worker must start doing — otherwise the deployment boots but consumes nothing). - ``ProductionWorkerFactory(...)`` constructed worker-side. This is the executable companion to huangheng's ``task-17-cr-review-checklist.md`` hard gate #1 (per cuiwenbo msg=f7868d2c hard-gate-to-test-file mapping). Owner nominally @bryce per the checklist; @chenyexuan co-implementer covers this slice (msg=6627eb69) so Bryce can focus on the upcoming task #17.4 ``document_service.delete`` cleanup migration. Verified: 5/5 tests pass against current PR branch head. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(task-17): avoid legacy health redirect * chore(task-17): clean whitespace in docs and health files * docs: document task 17 pool budget defaults * test(task-17): assert legacy health does not redirect * fix(task-17): align health hurl smoke with dongdong's actual response shapes After dongdong's task #21 commits d5a2ff9 + 0c8b03b + b65f470, the real per-endpoint ``status`` values differ from what my initial hurl extension (commit f5ba44d) had pinned: - ``/health`` → ``{"status": "healthy", "service": "aperag-api"}`` - ``/health/live`` → ``{"status": "live", "service": "aperag-api"}`` - ``/health/ready`` → ``{"status": "ready", "service": "aperag-api"}`` The original extension copy-pasted the legacy ``healthy`` value across all three. Match the real shape now that #21 is in review (cuiwenbo ✅ msg=5b7b4892, Weston ✅ msg=f0b020b4) so the smoke does not break the moment the deployment lands. Also pin the ``service`` field on each endpoint to catch any future regression that drops the service-id label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(task-17): fold worker shutdown drain + service.delete cleanup - indexing_worker: 25s asyncio.wait_for + cancel fallback so a stuck loop cannot hold the kubelet 30s grace into SIGKILL; flush OTLP MeterProvider on exit so PeriodicExportingMetricReader samples are not dropped (matches app.py finally semantics); reword the quota_backend / metrics_emitter retain-comment so the intent is legible (real acquire wires up in task #24). - document_service: drop the orphan _delete_document_indexes helper — its only consumer (_delete_document) now leaves cleanup to the worker cleanup loop, so the helper would otherwise still pull cleanup_for_deleted_documents into the API request path and break the hard gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(task-17): add DB-backed deleted document cleanup scan * test(task-17): add sustained health observation window * test(task-17): wire concurrent upload e2e fixtures * test(task-17): remove obsolete app lifespan worker invariant --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: huangheng <huangheng@aperag.local>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 3 hard-cut of the legacy Celery indexing stack — the final phase of the celery indexing redesign per
docs/modularization/indexing-redesign-design-pack.md. Replaces Wave 1 (foundation + 5 modalities, PR #1726) and Wave 2 (runtime + cutover/quota + load test, PR #1727) into a single deployable surface with no Celery / noaperag.tasks/*/ noaperag.domains.indexing/*_index.py.chenyexuan/celery-wave3-cutoverbranch by chenyexuan + Bryce (T3.2 / T3.3 lane disjoint write-set per architect msg=70a20f0e)Per-commit highlights
930cf209aef2a7dispatcher.py+ cleanup path C (collection-deletion cascade)5325788SearchResultMetadata§G.5 + private-deploy + INDEXING_MODE=inline smokec941526Config.INDEXING_MODE+ FastAPI lifespan wire-in for indexing runtimec44a2dee602f1dextract_keywordshelper toaperag/indexing/keyword_extract.py39aad24a076a135583e63processing_leasehelpers + removeflowerdep94c1d2cCollectionSummaryCallbacks5b691db4173af4d254dd6Architectural changes
Pattern A/B/C task migration (per architect msg=3890c9d7):
collection_delete_task— sync DB tombstone + periodic Path-C cascadecleanup_expired_documents_task+reconcile_collection_summaries_taskmerged into the existing 5-min cleanup loop / 30-s reconciler loop via new hooks (cleanup_expired_documents_hookinaperag/indexing/cleanup.py+reconcile_collection_summaries_hookinaperag/indexing/reconciler.py)collection_init_task,collection_summary_task,export_collection_task, evaluation/knowledge_graph tasks —asyncio.create_task(asyncio.to_thread(...))§F.3 cutover: 3-statement TX (status=ACTIVE → demote-old → promote-new) lives in
aperag.indexing.orchestrator._finalize_active_with_cutover, runs inside the worker's own session immediately aftersync()succeeds. Reconciler does NOT split this (per architect msg=492315e8 Ruling 1 — splitting introduces an ACTIVE-but-not-is_servinginconsistency window the spec forbids).Service → dispatcher layering:
aperag/domains/knowledge_base/service/document_service.py5 callsites consume the newaperag.indexing.dispatcher.dispatch_indexing()+aperag.indexing.cleanup.cleanup_for_deleted_documents()through the process-localIndexingRuntimesingleton (aperag/indexing/runtime.py). The FastAPI lifespan installs the singleton; service layer reads it. No service-layer code imports FastAPI.Tablename rename:
aperag/indexing/models.py__tablename__ = "document_index_v2"→"document_index"+ 5 index names*_v2_*→ canonical. Matches alembicd0f4c1b9a8e2post-state. No new alembic revision needed.Files deleted
aperag/tasks/*— entire dir (collection / document / models / processing_lease / reconciler / scheduler / utils)aperag/concurrent_control/*— entire dir (manager / protocols / redis_lock / threading_lock / utils + READMEs)aperag/domains/indexing/{tasks,orchestration,manager,vector_index,fulltext_index,graph_index,summary_index,vision_index}.py+aperag/domains/indexing/db/models.pyconfig/celery.pytests/unit_test/{concurrent_control,tasks}/*— contract tests for now-deleted modulestests/unit_test/test_es_*.py— legacy ES contract testsscripts/migrate_es_fulltext_shared_index.py— one-time Wave-1-era migration scriptpyproject.toml: droppedcelery<6.0.0,>=5.3.1+django-celery-beat<3.0.0,>=2.5.0+flowerFinal-HEAD gates
grep "from aperag.tasks\|import aperag.tasks\|from aperag.concurrent_control\|from aperag.domains.indexing.(tasks|orchestration|manager|*_index|db.models)\|from config.celery\|^from celery\|^import celery"overaperag/ + config/ + scripts/→ 0 hits in production code ✅alembic upgrade head→ succeeds ✅;alembic downgrade -1thenupgrade head→ reversible round-trip ✅ruff check + format --checkover aperag/ tests/ scripts/ → clean (491 files) ✅pytest tests/unit_test/ tests/load/ --ignore=objectstore→ 900 passed / 29 skipped / 0 failed ✅ (objectstore needsmotoextra, pre-existing)Test plan
(document_id, parse_version, modality)row + backend statedocs/modularization/private-deployment.md(T3.3 lane)🤖 Generated with Claude Code