refactor(task-31-a3): description-free graph_curation 7 call sites#1941
Conversation
Wave 5 description-NULL invariant (task #31 spec § 3.1.5): graph extractor stopped emitting `description` text post Wave 5 task #5 (facts/vectors split). The dedup detection / scoring / snapshot / accept-apply paths still read `entity.description` / `compacted_description` / `description_parts` and would either silently degrade scoring (always-empty bag-of-tokens) or leak stale fragments from pre-Wave-5 rows into reviewer-facing suggestions. Fix the 6 detector / snapshot call sites + 1 apply path enumerated in the spec, plus 1 service-layer helper surfaced by the boundary test grep gate: 1. candidate_generation.py:38 entity_snapshot — drop description 2. candidate_generation.py:179 _lexical_signals — drop description Jaccard token overlap 3. candidate_generation.py:196 _pair_score — drop description scoring weight (signal no longer emitted; branch is dead) 4. dto.py CurationEntity.from_lineage — set description="" instead of deriving from compacted / description_parts; keep field on the dataclass for back-compat with callers that still pass it 5. merge_candidate_detector._description_text_for_scoring → _embedding_query_text — embed `<name> (<entity_type>)` (mirror of how the graph_vectors worker writes the entity vector, Wave 5 task #5 / #7); the legacy method always short-circuited to "" post Wave 5 so detection produced zero candidates 6. merge_candidate_detector._to_legacy_entity — pass description="" instead of reading from entity 7. merge_candidate_detector._snapshot — drop description key from persisted entity_snapshots payload +1 lineage_merge.py — add merge_entities_apply_description_free variant for the async accept-apply worker (task #31 § 3.1.5). Skips LLM unified description / Compactor pass / __curation_merge__ sentinel description write / vector embed write per the spec «不调» list. Legacy merge_entities path is preserved for manual sync API back-compat (Lesson #14 multi-iteration cleanup follow-up). +1 service._fetch_shadow_neighbors — replace `entity.description or entity.name` with `entity.name`; post Wave 5 the description is always "" so the fallback was a no-op, and reading description here violates the boundary gate. Boundary gate (tests/boundaries/test_graph_curation_description_free.py, 4 AST-level assertions per spec § 5.2.a): - graph_curation_modules_do_not_read_entity_description - merge_candidate_detector_does_not_read_entity_description - lineage_merge_apply_description_free_does_not_read_entity_description - lineage_merge_apply_description_free_does_not_call_llm_or_compactor Allowlist: - lineage_merge.merge_entities (legacy back-compat) excluded by file - dto.py field declaration excluded (annotation, not a read) - LineageMergeResult.compacted_description (non-entity result shape used by legacy sync handle_action API) excluded by base name Wave-5 invariant codify pattern (Lesson #18 candidate, per huangheng PR #1932 + chenyexuan PR #1933 first-application demo): lesson sediment (cr-checklist § 四 Wave 5 description-NULL family) + mechanical gate (this boundary test) — paired so future regressions fail at CI not at review time. Tests: 1466 unit + 104 boundary all green. Risk: 0 production behavior change for legacy sync handle_action API (merge_entities preserved); new accept-apply async path uses the description-free variant exclusively. Spec: docs/zh-CN/architecture/task-31-graph-node-merge-spec-v1.md § 3.1.5 Task: task #77 (Phase A3) under task #31 umbrella Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Testing CR on PR #1941: implementation direction looks right, but one boundary gate gap blocks my LGTM. Local targeted verification passed:
Result: 19 passed. BLOCKER — the description-free boundary gate excludes Spec §3.1.5 explicitly includes if path.name in {"lineage_merge.py", "dto.py"}:
continueThat means a future regression like reintroducing The whole-file Once |
Per @huangheng cr-checklist Lesson #14 + #18 候选 cross-link verify (msg=be330423) — 2 non-blocker NITs on PR #1941 fix-forwarded: NIT 1 (service.py:244 deprecation marker): Add deprecation comment on the legacy sync ``handle_action()`` API return-shape line that reads ``merge_result.compacted_description``. Aligns with Lesson #14 «老 path 保留 + 标 deprecation» pattern (matches the ``lineage_merge.merge_entities`` deprecation marker added by the main commit), and explicitly cross-links the boundary test allowlist mechanism (``NON_ENTITY_BASE_NAMES``) so future grep-based audits don't dispatch on the read. NIT 2 (boundary test docstring bonus catch cross-link): Add explicit Lesson #18 候选 second-application demo trail in ``tests/boundaries/test_graph_curation_description_free.py`` module docstring — cite the ``service.py:845`` bonus catch (``text = entity.description or entity.name`` inside ``GraphCurationService._fetch_shadow_neighbors``) as canonical proof of the «lesson sediment + mechanical gate 双 layer codification» value. The spec § 3.1.5 ratify (符炫炜 + Bryce + ziang + huangzhangshu + Weston multi-source review) listed exactly 6+1 sites and every reviewer + spec author missed this 7th hidden read; the boundary gate caught it on first run, turning ``reviewer-as-detector`` into ``CI-as-detector`` per the Lesson #18 thesis. 0 production code change beyond comment / docstring text. Tests: 4/4 boundary test pass + ruff format / check clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
left a comment
There was a problem hiding this comment.
ziang graph_curation domain verify ✅ LGTM.
Verified:
CurationEntity.from_lineagenow hard-setsdescription="", andcandidate_generationno longer emits/reads description in snapshot, lexical signals, or scoring.MergeCandidateDetectorno longer depends on description-derived text;_embedding_query_text(name + type)fixes the post-Wave-5 zero-recall path, and_snapshotno longer persists description into suggestion payloads.LineageEntityMerger.merge_entities_apply_description_free()is correctly isolated as the async apply variant: alias rows + lineage re-anchor + source L1 delete only, no LLM, no compactor, no__curation_merge__, no vector write/delete. Legacymerge_entities()remains back-compat only.- The AST boundary gate is useful and already proved value by catching
service.pyhidden read; targeted tests passed locally:
uv run --extra test pytest tests/boundaries/test_graph_curation_description_free.py tests/unit_test/indexing/test_merge_candidate_detector.py -q→ 15 passed.
Non-blocking NITs:
merge_candidate_detector.pymodule docstring still says it embeds “the description”;service._fetch_shadow_neighbors()docstring still says vector-recall “by entity description”. The implementation is fixed, but those docstrings should be updated when convenient so future readers do not reintroduce the old path.- Sync detector query text is
name + entity_type, while full-sweep_fetch_shadow_neighbors()usesentity.nameonly. This is safe for A3 and not a merge blocker, but Phase B should either align the two recall query shapes or explicitly document why full sweep intentionally stays name-only.
|
Update to my previous domain-verify comment: I agree with @huangzhangshu's boundary-gate blocker. The implementation sites are fine, but After that fix-forward, my graph_curation domain LGTM can stand. |
Per @huangzhangshu BLOCKER (PR #1941 testing-lane CR, msg=2deb5407) + @ziang second-source ratify (msg=f485803c) + @不穷 PM dispatch (msg=a6cd42c9): the boundary gate ``test_graph_curation_modules_do_not_read_entity_description`` was whole-file excluding ``aperag/graph_curation/dto.py`` to avoid flagging the dataclass field declaration. But spec § 3.1.5 item 4 explicitly lists ``CurationEntity.from_lineage`` as one of the 6 description-free call sites, so the gate must catch future regressions that re-introduce ``entity.compacted_description`` / ``entity.description_parts`` reads inside ``from_lineage``. The whole-file exclusion was a false-positive prevention that turned out to be unnecessary: the AST walker matches ``ast.Attribute`` reads only, and dataclass field annotations (``description: str = ""``) are ``ast.AnnAssign`` nodes with ``target=ast.Name``, while constructor keyword args (``cls(description="")``) are ``ast.keyword`` nodes — neither is an ``ast.Attribute`` access on an entity object. Drops the whole-file exclusion and adds two reinforcing sister-tests so future maintainers do not regress this: * ``test_dto_module_is_in_boundary_scope`` — synthetic-AST positive control: feeds a fake ``from_lineage`` body that reads ``entity.compacted_description`` through the same offender detector and asserts the offender is surfaced. If a future refactor breaks the AST walker, this test catches the silent protection-loss. * ``test_dto_field_declaration_is_not_a_false_positive`` — live negative control: confirms the production ``dto.py`` produces zero offenders, with a docstring directing future maintainers to fix the walker (NOT re-allowlist the file) if a false- positive is ever observed. 6/6 boundary tests pass + ruff format / check clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
left a comment
There was a problem hiding this comment.
Final pass after 8116639 ✅
The boundary blocker is fixed: dto.py is now in scope, lineage_merge.py remains the only whole-file legacy exclude, and the new positive/negative controls correctly protect the gate itself:
- synthetic
entity.compacted_descriptionread is detected; - live
dto.pyfield declaration /description=""constructor write do not false-positive.
Targeted verification:
uv run --extra test pytest tests/boundaries/test_graph_curation_description_free.py -q → 6 passed.
My earlier graph_curation domain LGTM now stands. The remaining two NITs from my first comment are non-blocking doc/alignment follow-ups.
|
Testing final pass on PR #1941 at The blocker is fixed:
Local targeted verification:
Result: 21 passed.
|
task #31 Phase A4 implementation (dongdong) - FE merge suggestion UI + dismiss action. ## Scope - backend: dismiss action support (schemas.py + service.py:handle_action dismiss branch + 三 success return 补 message field per Weston BLOCKER msg=c1595745) - FE typed schema sync: SUGGESTION_ACTIONS 加 dismiss + MergeSuggestionStatus 7 active + 3 legacy union - FE rendering: 7-state UI display (4 new APPLY_PENDING/APPLYING/APPLIED/APPLY_FAILED + DISMISSED + legacy ACCEPTED/EXPIRED/SUPERSEDED) + dismiss button + canonical fields 切换 + panel-open implicit run → read-only list + 显式触发扫描 button - FE-derived P1 fields (per architect lock msg=80054596 simple-stable + 不扩后端): observed_types/type_conflict/affected_doc_count/suggested_entity_type - 顺手 catch fix: 删 target_entity_data extension (backend extra=forbid 422 兼容 bug) - legacy ACCEPTED label "Applied (legacy)" / "已应用 (历史)" semantic alignment (per cuiwenbo NIT 1 + spec § 3.1.2 line 131) - test_suggestion_action_response_requires_valid_success_shapes 单测覆盖 3 path (per huangzhangshu建议 + future schemas.py field add 漂移防护) ## CR collected (4/4 LGTM) - @符炫炜 architect ratify ✅ msg=36a0bbe4 + BLOCKER condition met msg=44813b59 - @cuiwenbo CR pair final final pass ✅ msg=9e503bda - @huangzhangshu testing final pass ✅ msg=bf233776 - @weston architect cross-CR final pass ✅ msg=0fe380bf ## Architect own-up sediment - SuggestionActionResponse.message required field gap = 第二个 architect ratify trust-framing miss (Weston catch via first-principles trace) - Lesson #12 v9 fifth-application demo - mini-pattern 19 升级范围: spec → impl 边界 + impl → response_model contract 边界 + impl catch path → upstream raise points 边界 ## CI - lint-and-unit ✅ - e2e-http-smoke 3/3 ✅ - provider-preflight 3/3 ✅ - e2e-http-provider 3/3 ✅ (re-run after JSON error injected into SSE stream flake confirmed - cross-PR same signature with PR #1941 A3 = systematic flake per ci-flake-policy § 2.2 single-shape signature waiver) 🤖 Architect ratify by Claude Code
#1943) * docs(cr-checklist): task #31 Phase A 全闭环后 sediment fold-in 子 PR 2 § 四 加 6 lesson sediment(task #31 Phase A 4 PR + task #33 P3 PR #1933 codify 累计实证 + multi-PR same-hour multi-source first-principles catch trust-framing miss)+ § 六 sediment 引用追加 5 PR commit cross-link + § 八 修订记录追加本 PR fold-in 完整 trail。 新增 lesson: - Lesson #12 v9 third + fourth + fifth-application demos (PR #1935 ziang DISMISSED enum impl-side catch + dongdong response_model legacy field filter BLOCKER 双 same-PR / PR #1938 Weston worker fail-safe BLOCKER upstream raise points trace / PR #1940 Weston SuggestionActionResponse.message required field catch) — sediment 升级 systemic 信号 reviewer chain 必独立 first- principles re-verify - Migration chain 时序 second-application demo (PR #1935 复用 table extend pattern 跟 PR #1910 新建 enum hard-cut migration 时序约束不同; 5 new enum value APPLY_PENDING/APPLYING/APPLIED/APPLY_FAILED/DISMISSED + evidence_refs JSON column + ACCEPTED legacy zero-write grep gate) - Lesson #17 second-application demo (PR #1935 backend 收敛 canonical contract 时同 PR fold-in legacy projection layer 保 backward-compat - suggestion_ batch_id=run_id alias 等 - 跟 deprecation marker Lesson #14 family 配) - Lesson #18 formally established: lesson sediment + mechanical gate 双 layer codification 「一记一 enforce」(first-app PR #1933 4-source default value parity / second-app PR #1941 description-free read scope + service.py:845 bonus catch / third-app PR #1941 fix-forward sister tests 防 whole-file exclude 静默削弱 gate) - mini-pattern 19: spec lock pre-check grep main 实证 enum/contract assumption (architect own-up 升级版三层: spec→impl / impl→response_model / impl catch path→upstream raise points) - mini-pattern 20: PR adds response_model wire-up 必跑 model_validate(actual_ handler_return_shape) boundary gate (PR #1940 first-application demo) per architect dispatch msg=b6726ac9 + msg=420ca548 sediment trigger A 满足 (task #31 Phase A 4/4 done) 启动 + Phase B B1 lane huangheng owner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(cr-checklist): fix cite accuracy NIT per Weston msg=7690b723 2 cite accuracy fixes (Weston framing CR catch): 1. response_model validation failure 状态码: 422 -> 500 - response_model validation fails 抛 FastAPI ResponseValidationError - 通常映射到 HTTP 500,不是 request body 校验的 422 - 影响 line 745 + line 850 描述 PR #1940 BLOCKER 时的状态码引用 2. GraphMergeSuggestionItem canonical schema 字段实证修正 - 原写: ... / observed_types / type_conflict / suggested_entity_type - 实际 main aperag/domains/knowledge_graph/schemas.py::GraphMergeSuggestionItem 不含这三字段 - A4 (PR #1940) 这些字段是 FE-derived display (FE 从 entities / suggested_target_entity / evidence_refs 推导),不是 PR #1935 backend projection - 影响 line 781 sect 4 Lesson #17 second-application demo 描述 per Weston PR #1943 framing CR (msg=7690b723) - sediment cite accuracy 要求把事实漂移修干净,避免 future onboarding reference 时 confuse 422/500 状态码语义 + backend/FE field source attribution。 不阻塞 main fold-in scope - 6 lesson sediment + 5 PR commit cross-link 其他 framing 全 accurate (Weston verified)。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…ty matrix (#1949) * feat(collection): task #61 P1-D3 — vector backend identity + capability matrix Project the deployment-wide ``settings.vector_db_type`` onto every collection detail read so the FE can render a "what does this vector backend actually support" panel without per-collection migration or runtime probe. Backend (output-only projection): - ``aperag/schema/common.py``: ``VectorBackendCapabilities`` + ``VectorBackendInfo`` + ``_STATIC_VECTOR_BACKEND_CAPABILITIES`` dict + ``project_vector_backend_info()`` helper. - ``aperag/domains/knowledge_base/schemas.py:Collection``: add ``vector_backend: Optional[VectorBackendInfo]``. **Intentionally NOT on ``CollectionConfig``** so the OpenAPI ``CollectionCreate`` / ``CollectionUpdate`` input shapes do not let callers mistake a deployment-wide setting for a per-collection editable knob (per dongdong msg=c2593fdd + PM msg=caf7e4df + architect msg=0044261f read-only projection lock). - ``aperag/domains/knowledge_base/service/collection_service.py``: populate ``vector_backend`` in ``build_collection_response`` from ``settings.vector_db_type``; ``None`` for unknown backends so the FE can render a placeholder without a hard failure. Cross-PR consistency with task #83 / PR #1948 (Bryce, vector adapter behavior fixes): - Bryce's connector-layer ``BACKEND_CAPABILITIES`` ClassVar declares 2 truth flags (``supports_atomic_batch_upsert`` + ``supports_legacy_mode``); this PR's schema-layer Pydantic model mirrors those values plus a 3rd schema-layer-only flag ``supports_filter_or_with_empty_parts`` which is uniformly False across adapters after task #83 P1-V3 (translator-level defense-in-depth rejects empty Or parts). - The 3rd flag stays in the schema so the FE can declare the uniform reject explicitly per spec § 2.3 P1-D3 「显示『允许差异但显式』」 — Lesson #17 backend 收敛 contract simple-stable family pattern (cite PR #1930 SearchHit normalize, PR #1935 GraphMergeSuggestionItem projection layer). Mechanical gate (per Lesson #18 lesson-sediment + mechanical-gate 双 layer codification — first established by chenyexuan PR #1933 / PR #1941, then PR #1940 ``model_validate`` boundary): 13-case unit suite in ``tests/unit_test/contracts/test_vector_backend_capability_matrix.py`` pins each capability flag, normalizes inputs, and round-trips Pydantic ``model_dump`` so future drift between schema, projection helper, and FE-consumed shape fails fast at unit-test time. FE (read-only display): - ``web/src/features/collection/types.ts``: typed mirrors ``VectorBackendInfo`` / ``VectorBackendCapabilities`` / ``VectorBackendType``. - ``web/src/app/workspace/collections/[collectionId]/settings/collection-vector-backend-card.tsx``: new component that surfaces backend identity + capability matrix in the collection settings page (above the edit form). dongdong picks up rendering polish (responsive + dark mode + final copy) on the same PR per the joint A4-style split (cuiwenbo contract layer + dongdong rendering polish + CR pair). - ``web/src/i18n/{en-US,zh-CN}/page_collections.json``: copy strings. - ``web/src/api-v2/schema.d.ts`` regenerated via ``yarn api:v2:types``. Local verification: - ``uv run --extra test pytest tests/unit_test/contracts/test_vector_backend_capability_matrix.py tests/unit_test/contracts/test_collection_v2_openapi_contract.py -q`` → 23 passed - ``make openapi-check`` → ok - ``yarn type-check --pretty false`` → 0 new errors on this PR's files (pre-existing graph-lab cosmograph + agent-runtime errors unchanged) - ``yarn lint --quiet`` → 0 warnings/errors - ``yarn i18n:check`` → ok - ``git diff --check`` → ok Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(collection): task #87 P1-D3 — convert vector_backend to computed_field Per dongdong msg=fa88e97b BLOCKER + huangzhangshu msg=5b7cba0f / msg=ee6e7af2 + Weston msg=057f642c re-final framing verify gate + PM msg=03c821b0 fix-forward direction lock: the previous regular-field ``Optional[VectorBackendInfo]`` implementation leaked the deployment projection onto every input shape that referenced ``Collection``, including ``Collection-Input`` itself, ``Agent-Input.collections``, and ``CreateTurnRequest.collections``. That contradicted the read-only output projection lock from architect msg=0044261f. Move ``Collection.vector_backend`` to a Pydantic v2 ``@computed_field`` property so OpenAPI input/output schemas auto-split: - ``Collection-Output`` now lists ``vector_backend`` with ``readonly: true`` (verified in regenerated ``web/src/api-v2/schema.d.ts``). - ``Collection-Input`` no longer carries ``vector_backend`` (verified by grep + new contract test). - ``CollectionCreate`` / ``CollectionUpdate`` / ``Agent-Input.collections`` / ``CreateTurnRequest.collections`` all inherit the cleaned ``Collection-Input``, so the deployment-wide setting can no longer be passed as a per-collection override on agent / chat-turn requests. The ``build_collection_response`` constructor no longer passes ``vector_backend`` (computed fields are not accepted as input); the property reads ``settings.vector_db_type`` lazily on each serialization. Two new contract tests: - ``test_collection_input_schema_does_not_expose_vector_backend``: pin the input/output JSON Schema split + ``readOnly`` flag on the output side. Asserts ``CollectionCreate`` / ``CollectionUpdate`` also do not surface ``vector_backend``. - ``test_collection_constructor_ignores_vector_backend_input``: defensive — even if a malicious caller stuffs ``vector_backend`` into a ``model_validate`` payload, Pydantic ignores it and the computed property still reflects the deployment setting. Sediment: cuiwenbo own-up CR miss — implement-time only verified the ``CollectionConfig`` placement (one defense layer) and missed the ``Collection`` self-reuse-as-input second layer. dongdong + Weston + huangzhangshu independently caught via OpenAPI generated-schema gate. mini-pattern 19 layer 5 candidate: "Pydantic schema placement verify must grep ``references Collection`` to catch input/output reuse risk, not only direct form-input shape" (continuing the trust-framing-miss family from PR #1935 / #1938 / #1940). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): consolidate vector_backend_capability_matrix imports for ruff Combine the two from aperag.schema.common import ... statements into a single block so ruff's import organization rule is satisfied. No code-behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): apply ruff format to vector_backend test + common.py Run `uv run ruff format` on ApeRAG/aperag/schema/common.py and ApeRAG/tests/unit_test/contracts/test_vector_backend_capability_matrix.py so `make lint` (`ruff format --check`) passes. Pure formatting; no behavior change. Other unrelated files reverted to keep this PR scope clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Wave 5 description-NULL invariant fix for graph_curation: the dedup
detection / scoring / snapshot / accept-apply paths still read
entity.description/compacted_description/description_partseven though Wave 5 task #5 (facts/vectors split) stopped emitting
description text. Reading these fields would either silently degrade
scoring (always-empty bag-of-tokens) or leak stale fragments from
pre-Wave-5 rows into reviewer-facing suggestions.
This is task #77 (Phase A3) under the task #31 umbrella, per
spec
docs/zh-CN/architecture/task-31-graph-node-merge-spec-v1.md§ 3.1.5.Scope
6 detector / snapshot call sites + 1 apply-path variant (spec
§ 3.1.5 enumeration) + 1 service-layer helper surfaced by the
boundary test grep gate:
<name> (<entity_type>)mirror of how graph_vectors worker writes the entity vector (Wave 5 task #5 / #7)entity.description or entity.name→entity.name(description always "" post Wave 5, fallback was a no-op)CurationEntity.descriptionfield is kept on the dataclass (default\"\") for back-compat with existing callers; the boundary test gates the read surface, not the field declaration.lineage_merge.merge_entities(legacy description-bearing variant) is preserved unchanged for manual sync handle_action API back-compat per spec § 3.1.5 («老 lineage_merge.py 路径保留作 manual API 兼容,标 deprecation Lesson #14 multi-iteration cleanup follow-up»).Boundary gate
tests/boundaries/test_graph_curation_description_free.py(new) — 4 AST-level assertions per spec § 5.2.a:test_graph_curation_modules_do_not_read_entity_description— full scopeaperag/graph_curation/**(excl.lineage_merge.pylegacy variant +dto.pyfield declaration)test_merge_candidate_detector_does_not_read_entity_description—aperag/indexing/merge_candidate_detector.pytest_lineage_merge_apply_description_free_does_not_read_entity_description— the new variant body must not read descriptiontest_lineage_merge_apply_description_free_does_not_call_llm_or_compactor— the new variant body must not invoke LLM unified description / Compactor pass / vector embed write per spec «不调» listAllowlist:
Lesson family applied
Test plan
Risk
0 production behavior change for legacy sync API:
0 schema change — CurationEntity.description field kept with default ""; no migrations.
CR
Not blocking
🤖 Generated with Claude Code