Skip to content

feat: add OTLP-first observability foundation#1702

Merged
earayu merged 8 commits into
mainfrom
cursor-cloud/observability-design-dc2b
Apr 25, 2026
Merged

feat: add OTLP-first observability foundation#1702
earayu merged 8 commits into
mainfrom
cursor-cloud/observability-design-dc2b

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 25, 2026

Summary

  • Add a future-facing observability design document for ApeRAG and a concise AGENTS.md pointer for future agents.
  • Introduce aperag.observability as the new OTLP-first observability foundation.
  • Configure API and Celery processes with unified JSON logging, trace/span correlation, process-wide tracing, and optional OTLP export.
  • Add Celery task publish/run trace context propagation and a retrieval pipeline span.
  • Update env templates and Helm values to default to APERAG_OBSERVABILITY_MODE=local with optional OTLP endpoint configuration.
  • Remove the obsolete Jaeger path: docker-compose service/profile, Makefile toggles, JAEGER envs, Jaeger exporter dependency, and the legacy aperag/trace package.
  • Link the new Chinese observability design from README-zh.md.
  • Merge latest origin/main and resolve retrieval pipeline conflicts against the new LLM runtime rerank flow.

Testing

  • python3 -m compileall aperag/observability aperag/app.py config/celery.py
  • PATH="$HOME/.local/bin:$PATH" make lint
  • PATH="$HOME/.local/bin:$PATH" uv run python - <<'PY'\nimport aperag.app\nimport config.celery\nfrom aperag.observability import build_observability_config\nprint(aperag.app.app.title)\nprint(config.celery.app.main)\nprint(build_observability_config().mode)\nPY
  • PATH="$HOME/.local/bin:$PATH" uv run pytest tests/unit_test/tasks/test_document_graph_curation_contract.py tests/unit_test/test_es_p0_contract.py tests/unit_test/vectorstore/test_qdrant_filter_translation.py -q
  • GitHub CI is being monitored.
Open in Web Open in Cursor 

cursoragent and others added 7 commits April 25, 2026 14:38
Co-authored-by: earayu <earayu@163.com>
Co-authored-by: earayu <earayu@163.com>
Co-authored-by: earayu <earayu@163.com>
Co-authored-by: earayu <earayu@163.com>
Co-authored-by: earayu <earayu@163.com>
Co-authored-by: earayu <earayu@163.com>
…ility-design-dc2b

# Conflicts:
#	aperag/domains/retrieval/pipeline.py

Co-authored-by: earayu <earayu@163.com>
@cursor cursor Bot changed the title docs: add future observability design feat: add OTLP-first observability foundation Apr 25, 2026
Co-authored-by: earayu <earayu@163.com>
@earayu earayu marked this pull request as ready for review April 25, 2026 16:02
@earayu earayu merged commit 7dbc709 into main Apr 25, 2026
5 checks passed
@earayu earayu deleted the cursor-cloud/observability-design-dc2b branch April 25, 2026 16:43
earayu added a commit that referenced this pull request Apr 26, 2026
…ite proposal (#1725)

* docs(indexing): add indexing redesign design pack — first-principles rewrite proposal

Per earayu2 directive (#celery msg=56812dd6 + msg=d8080c08): full redesign of
the document indexing system, prioritizing simplicity and reliability over
feature breadth, targeting 100 concurrent docs, with hard-cut authorization
(pre-launch / no users / no migration).

Design pack contents (1049 lines, 11 sections):

- §A — Current system analysis with file:line evidence (3-layer ownership skew,
  Python lease thread tied to worker process, graph index NOT replace-idempotent
  per nebula.py:354 upsert_entities, ~995 lines in tasks.py mixing infra +
  business)

- §B — First principles (single SoT in DB, idempotent convergence, source/
  derived/index three-layer separation, concurrency bounded by external
  capacity, simple > complex)

- §C — Three-layer document model (collections/<id>/documents/<id>/source/ +
  derived/parse_<v>/{markdown.md, chunks.jsonl, kg.jsonl, summary.json,
  vision/} + backend index stores)

- §D — Idempotency contract per modality (DELETE-by-(document_id, parse_version)
  before INSERT for all 5 modalities; fixes graph index append bug)

- §E — Concurrency model decision matrix (HTTP-only / lightweight task /
  Celery refactor); recommends lightweight Redis-backed asyncio worker pool
  per modality (5 worker processes, ~80-line reconciler, no Celery / no chord /
  no Python lease thread)

- §F — State machine + atomic flip (4 status values vs current 6;
  document.active_parse_version + pending_parse_version with transactional
  flip; deletion via async cleanup worker)

- §G — Multi-modal unified pipeline (Modality ABC with derive() + sync()
  contract; collapses earayu2's "Celery task 绕来绕去最后又绕回 graph index"
  complaint into 2 functions in 1 file)

- §H — Multi-tenant isolation (recommend simple — required tenant context +
  bulkheads, defer fairness machinery until observability shows real
  noisy-neighbor signal)

- §I — Failure recovery (3 modes: worker crash, transient backend, permanent
  failure; exponential backoff retry; Redis token bucket for LLM rate-limit
  backpressure)

- §J — Observability (4 SLI: index_lag_seconds, index_failure_rate,
  queue_depth, worker_utilization; OTLP wire; aligns with PR #1702)

- §K — Migration plan (7 PRs: observability primitives → idempotent indexers →
  object store layout → worker pool → atomic flip → cutover → availability
  discriminator; feature-flagged dual-stack during PR-D/E; cutover deletes
  ~3000 lines of Celery infrastructure)

Net delta: roughly +4150 / -4850 lines across 7 PRs — net subtraction despite
adding functionality. Indexing layer drops from ~2500 lines to ~1500.

Three open decisions deferred to earayu2:
1. Concurrency model: lightweight Redis-asyncio (recommended), HTTP-only, or
   Celery refactor
2. Atomic flip contract: all-modalities-ACTIVE-required (recommended) vs
   per-modality independent
3. PR sequence: 7-PR cut (recommended) vs combined

Sibling reference: Bryce msg=791082a4 + msg=38fbf962 first-principles analysis
+ architect msg=19f283d5 + msg=2ee66c89 4-blind-spot synthesis. This design
pack is the single canonical deliverable per earayu2's owner directive
(@符炫炜 sole author of the final design).

* docs(indexing): redesign pack v2 — incorporate earayu2 拍板 + 答 derived/MinIO

Driven by earayu2 msg=cc0a00d7 + PM consolidation msg=32463d64.

- Drop Celery → lock Redis + asyncio (§E, decision matrix removed)
- Drop atomic flip → per-modality independent is_serving cutover (§F)
  Accept short eventual-consistency window per earayu2 directive.
- Answer derived/parse_<v>/ contents per modality (§C.6) — chunks.jsonl
  shared by vector+fulltext, kg.jsonl, summary.json, vision/manifest.jsonl
  + images/, markdown.md + outline.json.
- Answer MinIO/object-store suitability (§C.7) — ~150 MB / 100-doc burst,
  trivial; LocalFS / MinIO / S3-compatible all work; small-file + LIST
  caveats addressed.
- Add §L private/on-premise "deploy-and-forget":
  Tier 1 inline (SQLite + LocalFS, ~10 docs/hour),
  Tier 2 docker-compose (~100 concurrent),
  Tier 3 horizontal scale-out — same code.
- §H tenant_scope_key forward-compat hook for future organization concept;
  simple Redis token-bucket quota that won't lock future fairness.
- §K restructure: 7 PRs → 3 waves (Foundation / Runtime / Cutover) with
  per-wave parallel-writability map.
- §G.5 SearchResultMetadata extends w/ parse_version + index_state_per_modality
  (becomes structurally required under per-modality independent flip).

PR #1725 v2; awaiting earayu2 final 拍板 on Wave 1 kickoff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): v2 amendment — Bryce review deltas (chunking trade-off / serving invariant / graph entity lineage)

Bryce v2 review msg=7ccb176f surfaced 3 substantive technical deltas all
agreed-as-must-address by PM msg=fc307bbf. Folded into v2:

§C.6 — chunks.jsonl shared-by-vector-and-fulltext is now framed as conscious
trade-off (vector wants larger chunks, fulltext wants smaller) with explicit
shadow-file extension hook (chunks.fulltext.jsonl + namespaced sub-IDs)
preserved so future split is unblocked.

§F.1 — partial unique index added at the schema layer:
  CREATE UNIQUE INDEX uniq_document_index_serving
      ON document_index (document_id, modality)
      WHERE is_serving = TRUE;
This makes the "at most one serving row per (doc, modality)" invariant DB-
enforced, not orchestrator-enforced. SQLite 3.8+ supports the same syntax
(Tier 1 deploy stays consistent).

§D.3 — graph entity lineage model rewritten. Cross-document shared entities
("Linus" mentioned in 100 docs) cannot be cleared by simple DELETE-by-doc
without losing other docs' contributions. New model:
  - source_lineage: SET<{document_id, parse_version, chunk_ids[]}>
  - description_parts: SET<{document_id, parse_version, text}>
  - sync = lineage-level DELETE+INSERT, entity GC when lineage empty
Includes per-entity serialization invariant for Nebula (read-modify-write
without native list ops). 5-step idempotency self-test extension specified.

PR #1725 v2; ready for earayu2 / Bryce final ack.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): §D.3.2 amendment — lineage cleanup by document_id only

Bryce v3 implementation review (msg=464d5b70) caught a spec bug in
§D.3.2 step 1b: the pseudocode used `(document_id, parse_version)`
exact-match for lineage filter, which contradicts §D.3.6 narrative
step 3 ("doc_A v2 写入(覆盖 doc_A 旧 lineage)"). Strict exact-match
would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a
re-parse, violating the expected supersede semantic.

Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter
by `document_id` only (not parse_version). This makes sync(doc, v_new)
self-contained for supersede; orchestrator does not need to do
explicit clear-then-sync.

§D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation
follows the corrected algorithm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): §J.1 amendment — failure_total + success_total counter pair (was single rate gauge)

huangheng Wave 1 CR (msg=8e67bf0e) flagged §J.1 spec drift: T1.5
implementation emits index_failure_total + index_success_total counter
pair, not the single index_failure_rate gauge spec called for. Architect
ruling: amend spec to match implementation. Counter pair is OTLP-
idiomatic, preserves raw events, re-aggregates across workers without
sliding-window state, and the rate is trivially computable downstream.

§J.1 spec amended; §K Wave 1 acceptance bullet updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): §F.1 + §F.5 amendments — collection_id/source_path/tenant_scope_key columns + cleanup Path C

Three Wave 3 amendments folded in to unblock chenyexuan T3.1 (msg=afe345a9):

§F.1 schema:
- Add collection_id VARCHAR(64) NOT NULL (denormalized from
  document.collection_id, populated by orchestrator at INSERT for
  self-contained dispatch payload — per huangheng Wave 2 CR finding
  msg=c94b57fe + architect ruling msg=498b12f0).
- Add source_path TEXT NOT NULL (pointer to source/ artifact, worker
  derive reads directly without JOIN).
- Add tenant_scope_key VARCHAR(64) NOT NULL (forward-compat for future
  organization concept per §H.2; required key for §H.5 quota bucket).
  Was implicit-but-not-listed in the schema before; now explicit.
- Add idx_document_index_collection + idx_document_index_tenant_scope
  indexes for cleanup / quota scans.

§F.5 cleanup worker:
- Restructure to three paths (A/B/C) with explicit semantics.
- Path A: orphan parse_version GC (existing); now notes graph backend
  no-op via §D.3 lineage supersede + graph_noop counter for telemetry.
- Path B: single-document deletion cascade — explicit graph dispatch
  via remove_entity_lineage_member(document_id) per §D.3 amended canonical
  (by document_id only).
- Path C (NEW): collection deletion cascade — Collection.deleted_at
  scan + Path B per child document + final Collection row + storage
  tree cleanup. Replaces legacy Celery collection_delete task with
  state-driven recovery (no asyncio.create_task() durability gap).

These amendments unblock T3.1 commit 1+ since chenyexuan needs the
spec head to reference for the audit allowlist removal and the
caller-migration patterns. Wave 3 task #14 acceptance criteria
(per PM msg=5939e394) now references this head.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: 符炫炜 <fuxuanwei@apecloud.io>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants