feat(graph-extractor): prompt v2 — 7 hard requirements + window-aware parser (task #30-A3) by earayu · Pull Request #1920 · apecloud/ApeRAG

earayu · 2026-04-29T20:56:03Z

Summary

Task #30-A3 / sub-task #53. Implements the prompt v2 contract from docs/zh-CN/architecture/task-30-graph-chunk-window-spec-v1.md §3.1.3 — every entity / relation now carries a non-empty source_chunk_ids list, the prompt prepends a [[chunk_id=<id> index=<n>]] boundary marker per chunk, and the parser validates returned ids against the window allowlist.

7 hard requirements landed

Per-chunk boundary markers in the rendered prompt.
source_chunk_ids REQUIRED on every entity + relation in the JSON schema.
Cross-chunk relations encouraged in the prompt rules.
Dedup / canonical-name guidance in the prompt rules.
Fail-safe / no-fabrication clause.
Cap rules now talk about per-window output (the max_entities and max_relations numbers are still rendered as-is from caller args; @huangheng's task feat: page framework #30-A2 / task fix: husky hooks #54 fills the co-scale formula).
Graph extractor still passes response_format={"type": "json_object"} to the LLM callable (kept verbatim from task feat: socket debug #14 / PR feat(llm): wire response_format=json_object through CompletionService for graph extractor (issue #1861, task #14) #1877).

Implementation

aperag/indexing/llm.py — ENTITY_RELATION_EXTRACTION rewritten with the v2 rules; render_extraction_prompt now takes window_chunks instead of a single input_text and renders one [[chunk_id=...]] block per chunk. New few_shot_locale opt-in (default None) pulls from a zh / cross_chunk example library for collection configs that want a non-English or cross-chunk hint without paying the prompt-token tax by default.
aperag/indexing/graph_extractor.py — new _extract_one_window taking a sequence of chunks; the legacy _extract_one_chunk is kept as a thin compat shim that wraps a single-chunk window so @ziang's task feat: page framework #30-A1 (PR feat(indexing): add graph extraction window assembler #1918) dispatcher + existing callers stay green until A1 lands. Parser rewired: _parse_extraction_response now takes allowed_chunk_ids, and the new _resolve_source_chunk_ids enforces the spec invariant — single-chunk window falls back to the lone allowed id when the field is missing, multi-chunk window requires the LLM to populate it (otherwise the record is skipped + warned), and any out-of-allowlist id is silently dropped before the non-empty check so hallucinated chunk_ids do not poison provenance.

Tests

Rewire every _parse_extraction_response call to the new allowed_chunk_ids keyword.
Update the rendered-prompt smoke test to the new window_chunks signature.
New tests: per-chunk markers, few-shot opt-in, empty-window guard, single-chunk fallback, multi-chunk required-provenance branch, out-of-allowlist drop, cross-chunk provenance preservation.
The build_collection_llm_callable monkeypatch stubs now accept **_kwargs so the patched lambdas tolerate the response_format keyword that build_collection_graph_extractor has been passing since task feat: socket debug #14 (these stubs were silently accumulating an arity mismatch).

Out of scope

Window assembler + graph_extraction_window_size config — task feat: page framework #30-A1 / @ziang PR feat(indexing): add graph extraction window assembler #1918.
MAX_ENTITIES / MAX_RELATIONS / TIMEOUT / BOOTSTRAP / MAX_PROMPT_TOKENS co-scale — task feat: page framework #30-A2 / @huangheng task fix: husky hooks #54.
Frontend / generated schema for the new collection config — A1 covers them.

Merge order

PM @不穷 dispatched A1 → A3 → A2 (msg=fadcd64c). This PR (A3) will be rebased onto A1 after PR #1918 lands; A2 then folds the co-scale formula into the prompt's max_entities / max_relations placeholders.

Test plan

uvx ruff@0.15.12 check aperag/ tests/ — pass
uvx ruff@0.15.12 format --check aperag/ tests/ — pass
uv run --extra test python -m pytest tests/integration/test_graph_extractor.py -q — 36 passed
Full CI lint-and-unit + e2e-http-smoke + e2e-http-provider (after rebase onto A1)

🤖 Generated with Claude Code

@ziang

… parser Task #30-A3 / sub-task #53. Implements the prompt v2 contract from ``docs/zh-CN/architecture/task-30-graph-chunk-window-spec-v1.md`` §3.1.3 — every entity / relation now carries a non-empty ``source_chunk_ids`` list, the prompt prepends a ``[[chunk_id=<id> index=<n>]]`` boundary marker per chunk, and the parser validates returned ids against the window allowlist. The 7 hard requirements all land in this PR: 1. Per-chunk boundary markers in the rendered prompt. 2. ``source_chunk_ids`` REQUIRED on every entity + relation. 3. Cross-chunk relations encouraged in the prompt rules. 4. Dedup / canonical-name guidance in the prompt rules. 5. Fail-safe / no-fabrication clause. 6. Cap rules now talk about per-window output (the ``max_entities`` and ``max_relations`` placeholders are filled by huangheng's A2 co- scale change once that PR lands; this PR just stops calling them ``per chunk`` in the rule text). 7. Graph extractor still passes ``response_format={"type": "json_object"}`` to the LLM callable (kept verbatim from the task #14 / PR #1877 setup). Implementation - ``aperag/indexing/llm.py``: rewrite ``ENTITY_RELATION_EXTRACTION`` with the v2 rules + the ``source_chunk_ids`` schema; rename the former single-chunk ``input_text`` parameter to a ``window_chunks`` sequence and emit one ``[[chunk_id=...]]`` block per chunk; render ``allowed_chunk_ids`` so the LLM has the explicit allowlist; expose a ``few_shot_locale`` opt-in (default ``None``) that pulls from a small ``zh`` / ``cross_chunk`` example library for collection configs that want a non-English / cross-chunk hint without paying the prompt-token tax by default. - ``aperag/indexing/graph_extractor.py``: introduce ``_extract_one_window`` taking a sequence of chunks; keep ``_extract_one_chunk`` as a thin compat shim that wraps a single-chunk window so the ziang #1918 dispatcher and existing callers stay green until A1 lands. Rework the parser: ``_parse_extraction_response`` now takes ``allowed_chunk_ids`` instead of a single ``chunk_id``, and the new ``_resolve_source_chunk_ids`` enforces the spec invariant — single-chunk window falls back to the lone allowed id when the field is missing, multi-chunk window requires the LLM to populate it (otherwise the record is skipped + warned), and any out-of- allowlist id is silently dropped before the non-empty check so hallucinated chunk_ids do not poison provenance. Tests - ``tests/integration/test_graph_extractor.py``: rewire every ``_parse_extraction_response`` call to the new ``allowed_chunk_ids`` keyword, update the rendered-prompt smoke test to the new ``window_chunks`` signature, add new tests for per-chunk markers, the few-shot opt-in, the empty-window guard, the single-chunk fallback, the multi-chunk required-provenance branch, the out-of-allowlist drop, and the cross-chunk provenance preservation. The ``build_collection_llm_callable`` monkeypatch stubs now accept ``**_kwargs`` so the patched lambdas tolerate the ``response_format`` keyword that ``build_collection_graph_extractor`` has been passing since task #14 (these stubs were silently accumulating an arity mismatch). Out of scope (matches the spec dispatch) - Window assembler + ``graph_extraction_window_size`` config — task #30-A1 / @ziang PR #1918. - ``MAX_ENTITIES`` / ``MAX_RELATIONS`` / ``TIMEOUT`` / ``BOOTSTRAP`` / ``MAX_PROMPT_TOKENS`` co-scale — task #30-A2 / @huangheng task #54. - Frontend / generated schema changes for the new collection config — task #30-A1 covers them. Test plan - ``uvx ruff@0.15.12 check aperag/ tests/`` — pass - ``uvx ruff@0.15.12 format --check aperag/ tests/`` — pass - ``uv run --extra test python -m pytest tests/integration/test_graph_extractor.py -q`` — 36 passed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1921) Pre-existing CI flake放行 per ci-flake-policy.md § 2.2: - Run id: 25134494472 / Job id: 73669566493 - Shape: E2E HTTP Qdrant + Nebula (single shape; Lite ✅ + Neo4j ✅) - Signature exact match: aperag/domains/agent_runtime/runtime.py:1056 ValidationException: Model specification is required for agent runtime v3 - PR diff zero intersection with §2.2 forbidden list: changes only graph_extractor.py + KnowledgeGraphConfig schema + boundary tests + regenerated schema.d.ts; agent_runtime / model_platform / mcp / indexing-worker DI / e2e-http workflow / deploy/aperag all untouched All other gates green: lint-and-unit ✅, smoke 3/3 ✅, preflight 3/3 ✅. 5-lane CR收齐 (Bryce A3 fold-in / Weston arch re-CR / ziang impl re-CR / 符炫炜 ratify / huangheng author). task #30 Phase A 全闭环 (A1 #1918 / A3 #1920 / A2 this PR).

@ziang

…ode default ON Two BLOCKER fixes per @ziang msg=56912dae + @huangzhangshu msg=cda4dc75: * **BLOCKER 1** — `source_chunk_ids` validity is now strictly window-scoped: `run_window` computes `source_chunk_ids_valid` / `source_chunk_ids_total` against that single window's `allowed_chunk_ids`, and `aggregate_sample` only sums per-window counters. Previously the per-document union check let a record produced in window-0 reference a chunk_id from window-1 and pass — violating the A3 parser invariant (source_chunk_ids ⊆ current window's chunk_ids, not document union). New unit test `test_aggregate_sample_source_chunk_ids_is_window_scoped_not_union` pins the cross-window pollution case. * **BLOCKER 2** — `--response-format-json` is now ON by default with `--no-response-format-json` as the explicit opt-out. A3 PR #1920 (`01b45196`) made `response_format=json_object` a graph extractor production invariant, but the legacy benchmark default was off, so B2 baseline would have measured `json_ok_rate` / parse failure / cost on the pre-A3 path. README updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@ziang

…rics (#1923) * feat(benchmark): task #30 B1 — chunk-window matrix + per-document metrics Extend tests/benchmarks/graph_extraction harness for the task #30 graph chunk window benchmark (spec § 6.3, dispatched in msg=cecae5ed): * Fix `render_extraction_prompt` API drift — the harness was still on the legacy `input_text=...` signature; PR #1918/#1920/#1921 (task #30 Phase A) moved it to `window_chunks=[{chunk_id, text}]`. Without this fix B2 (Planetegg msg=9489efdb) cannot run. * Add `--chunk-window-size N` for single-shape runs and `--matrix N1,N2,...` for batch sweeps (the two are mutually exclusive). Sample text is split into `--pseudo-chunks-per-doc` (default 4) pseudo-chunks and grouped into non-overlapping windows of size N, mirroring the production `_GraphChunkWindow` shape (PR #1918). * Aggregate **per-document** the 7 metrics required by spec § 6.3 + Planetegg msg=ea7efa7b: `llm_call_count`, `input_tokens_total`, `output_tokens_total`, `wall_time_s`, `timeout_or_failure_count`, entity+relation totals + duplicate counts, and the new `source_chunk_ids_valid` / `source_chunk_ids_total` provenance check (task #30 §3.1.3 hard requirement #2). * `--dry-run` produces a placeholder schema for B2 to verify ingestion before paying provider cost (per Planetegg msg=cbe84223). * `test_runner_units.py` pins the harness structural pieces (chunking, windowing, validity counting, per-document aggregation) so the matrix output schema stays contract-stable even though the benchmark itself runs out-of-CI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(benchmark): task #30 B1 — window-scoped source_chunk_ids + JSON-mode default ON Two BLOCKER fixes per @ziang msg=56912dae + @huangzhangshu msg=cda4dc75: * **BLOCKER 1** — `source_chunk_ids` validity is now strictly window-scoped: `run_window` computes `source_chunk_ids_valid` / `source_chunk_ids_total` against that single window's `allowed_chunk_ids`, and `aggregate_sample` only sums per-window counters. Previously the per-document union check let a record produced in window-0 reference a chunk_id from window-1 and pass — violating the A3 parser invariant (source_chunk_ids ⊆ current window's chunk_ids, not document union). New unit test `test_aggregate_sample_source_chunk_ids_is_window_scoped_not_union` pins the cross-window pollution case. * **BLOCKER 2** — `--response-format-json` is now ON by default with `--no-response-format-json` as the explicit opt-out. A3 PR #1920 (`01b45196`) made `response_format=json_object` a graph extractor production invariant, but the legacy benchmark default was off, so B2 baseline would have measured `json_ok_rate` / parse failure / cost on the pre-A3 path. README updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

earayu force-pushed the bryce/task-30-a3-prompt-v2 branch from 58f2480 to 444fb88 Compare April 29, 2026 21:02

earayu force-pushed the bryce/task-30-a3-prompt-v2 branch from 444fb88 to 70a9b25 Compare April 29, 2026 21:03

earayu merged commit 01b4519 into main Apr 29, 2026
7 checks passed

earayu deleted the bryce/task-30-a3-prompt-v2 branch April 29, 2026 21:03

earayu mentioned this pull request Apr 29, 2026

feat(graph-extraction): task #30 A2 — 5 const co-scale + boundary test #1921

Merged

4 tasks

earayu mentioned this pull request Apr 29, 2026

feat(benchmark): task #30 B1 — chunk-window matrix + per-document metrics #1923

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(graph-extractor): prompt v2 — 7 hard requirements + window-aware parser (task #30-A3)#1920

feat(graph-extractor): prompt v2 — 7 hard requirements + window-aware parser (task #30-A3)#1920
earayu merged 1 commit into
mainfrom
bryce/task-30-a3-prompt-v2

earayu commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

earayu commented Apr 29, 2026

Summary

7 hard requirements landed

Implementation

Tests

Out of scope

Merge order

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant