Skip to content

feat(graph-extractor): prompt v2 — 7 hard requirements + window-aware parser (task #30-A3)#1920

Merged
earayu merged 1 commit into
mainfrom
bryce/task-30-a3-prompt-v2
Apr 29, 2026
Merged

feat(graph-extractor): prompt v2 — 7 hard requirements + window-aware parser (task #30-A3)#1920
earayu merged 1 commit into
mainfrom
bryce/task-30-a3-prompt-v2

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 29, 2026

Summary

Task #30-A3 / sub-task #53. Implements the prompt v2 contract from docs/zh-CN/architecture/task-30-graph-chunk-window-spec-v1.md §3.1.3 — every entity / relation now carries a non-empty source_chunk_ids list, the prompt prepends a [[chunk_id=<id> index=<n>]] boundary marker per chunk, and the parser validates returned ids against the window allowlist.

7 hard requirements landed

  1. Per-chunk boundary markers in the rendered prompt.
  2. source_chunk_ids REQUIRED on every entity + relation in the JSON schema.
  3. Cross-chunk relations encouraged in the prompt rules.
  4. Dedup / canonical-name guidance in the prompt rules.
  5. Fail-safe / no-fabrication clause.
  6. Cap rules now talk about per-window output (the max_entities and max_relations numbers are still rendered as-is from caller args; @huangheng's task feat: page framework #30-A2 / task fix: husky hooks #54 fills the co-scale formula).
  7. Graph extractor still passes response_format={"type": "json_object"} to the LLM callable (kept verbatim from task feat: socket debug #14 / PR feat(llm): wire response_format=json_object through CompletionService for graph extractor (issue #1861, task #14) #1877).

Implementation

  • aperag/indexing/llm.pyENTITY_RELATION_EXTRACTION rewritten with the v2 rules; render_extraction_prompt now takes window_chunks instead of a single input_text and renders one [[chunk_id=...]] block per chunk. New few_shot_locale opt-in (default None) pulls from a zh / cross_chunk example library for collection configs that want a non-English or cross-chunk hint without paying the prompt-token tax by default.
  • aperag/indexing/graph_extractor.py — new _extract_one_window taking a sequence of chunks; the legacy _extract_one_chunk is kept as a thin compat shim that wraps a single-chunk window so @ziang's task feat: page framework #30-A1 (PR feat(indexing): add graph extraction window assembler #1918) dispatcher + existing callers stay green until A1 lands. Parser rewired: _parse_extraction_response now takes allowed_chunk_ids, and the new _resolve_source_chunk_ids enforces the spec invariant — single-chunk window falls back to the lone allowed id when the field is missing, multi-chunk window requires the LLM to populate it (otherwise the record is skipped + warned), and any out-of-allowlist id is silently dropped before the non-empty check so hallucinated chunk_ids do not poison provenance.

Tests

  • Rewire every _parse_extraction_response call to the new allowed_chunk_ids keyword.
  • Update the rendered-prompt smoke test to the new window_chunks signature.
  • New tests: per-chunk markers, few-shot opt-in, empty-window guard, single-chunk fallback, multi-chunk required-provenance branch, out-of-allowlist drop, cross-chunk provenance preservation.
  • The build_collection_llm_callable monkeypatch stubs now accept **_kwargs so the patched lambdas tolerate the response_format keyword that build_collection_graph_extractor has been passing since task feat: socket debug #14 (these stubs were silently accumulating an arity mismatch).

Out of scope

Merge order

PM @不穷 dispatched A1 → A3 → A2 (msg=fadcd64c). This PR (A3) will be rebased onto A1 after PR #1918 lands; A2 then folds the co-scale formula into the prompt's max_entities / max_relations placeholders.

Test plan

  • uvx ruff@0.15.12 check aperag/ tests/ — pass
  • uvx ruff@0.15.12 format --check aperag/ tests/ — pass
  • uv run --extra test python -m pytest tests/integration/test_graph_extractor.py -q — 36 passed
  • Full CI lint-and-unit + e2e-http-smoke + e2e-http-provider (after rebase onto A1)

🤖 Generated with Claude Code

@earayu earayu force-pushed the bryce/task-30-a3-prompt-v2 branch from 58f2480 to 444fb88 Compare April 29, 2026 21:02
… parser

Task #30-A3 / sub-task #53. Implements the prompt v2 contract from
``docs/zh-CN/architecture/task-30-graph-chunk-window-spec-v1.md``
§3.1.3 — every entity / relation now carries a non-empty
``source_chunk_ids`` list, the prompt prepends a
``[[chunk_id=<id> index=<n>]]`` boundary marker per chunk, and the
parser validates returned ids against the window allowlist.

The 7 hard requirements all land in this PR:
1. Per-chunk boundary markers in the rendered prompt.
2. ``source_chunk_ids`` REQUIRED on every entity + relation.
3. Cross-chunk relations encouraged in the prompt rules.
4. Dedup / canonical-name guidance in the prompt rules.
5. Fail-safe / no-fabrication clause.
6. Cap rules now talk about per-window output (the ``max_entities``
   and ``max_relations`` placeholders are filled by huangheng's A2 co-
   scale change once that PR lands; this PR just stops calling them
   ``per chunk`` in the rule text).
7. Graph extractor still passes ``response_format={"type":
   "json_object"}`` to the LLM callable (kept verbatim from the task
   #14 / PR #1877 setup).

Implementation
- ``aperag/indexing/llm.py``: rewrite ``ENTITY_RELATION_EXTRACTION``
  with the v2 rules + the ``source_chunk_ids`` schema; rename the
  former single-chunk ``input_text`` parameter to a ``window_chunks``
  sequence and emit one ``[[chunk_id=...]]`` block per chunk; render
  ``allowed_chunk_ids`` so the LLM has the explicit allowlist;
  expose a ``few_shot_locale`` opt-in (default ``None``) that pulls
  from a small ``zh`` / ``cross_chunk`` example library for collection
  configs that want a non-English / cross-chunk hint without paying
  the prompt-token tax by default.
- ``aperag/indexing/graph_extractor.py``: introduce
  ``_extract_one_window`` taking a sequence of chunks; keep
  ``_extract_one_chunk`` as a thin compat shim that wraps a
  single-chunk window so the ziang #1918 dispatcher and existing
  callers stay green until A1 lands. Rework the parser:
  ``_parse_extraction_response`` now takes ``allowed_chunk_ids``
  instead of a single ``chunk_id``, and the new
  ``_resolve_source_chunk_ids`` enforces the spec invariant —
  single-chunk window falls back to the lone allowed id when the
  field is missing, multi-chunk window requires the LLM to populate
  it (otherwise the record is skipped + warned), and any out-of-
  allowlist id is silently dropped before the non-empty check so
  hallucinated chunk_ids do not poison provenance.

Tests
- ``tests/integration/test_graph_extractor.py``: rewire every
  ``_parse_extraction_response`` call to the new
  ``allowed_chunk_ids`` keyword, update the rendered-prompt smoke
  test to the new ``window_chunks`` signature, add new tests for
  per-chunk markers, the few-shot opt-in, the empty-window guard,
  the single-chunk fallback, the multi-chunk required-provenance
  branch, the out-of-allowlist drop, and the cross-chunk provenance
  preservation. The ``build_collection_llm_callable`` monkeypatch
  stubs now accept ``**_kwargs`` so the patched lambdas tolerate the
  ``response_format`` keyword that ``build_collection_graph_extractor``
  has been passing since task #14 (these stubs were silently
  accumulating an arity mismatch).

Out of scope (matches the spec dispatch)
- Window assembler + ``graph_extraction_window_size`` config — task
  #30-A1 / @ziang PR #1918.
- ``MAX_ENTITIES`` / ``MAX_RELATIONS`` / ``TIMEOUT`` / ``BOOTSTRAP``
  / ``MAX_PROMPT_TOKENS`` co-scale — task #30-A2 / @huangheng task
  #54.
- Frontend / generated schema changes for the new collection config
  — task #30-A1 covers them.

Test plan
- ``uvx ruff@0.15.12 check aperag/ tests/`` — pass
- ``uvx ruff@0.15.12 format --check aperag/ tests/`` — pass
- ``uv run --extra test python -m pytest tests/integration/test_graph_extractor.py -q`` — 36 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@earayu earayu force-pushed the bryce/task-30-a3-prompt-v2 branch from 444fb88 to 70a9b25 Compare April 29, 2026 21:03
@earayu earayu merged commit 01b4519 into main Apr 29, 2026
7 checks passed
@earayu earayu deleted the bryce/task-30-a3-prompt-v2 branch April 29, 2026 21:03
earayu added a commit that referenced this pull request Apr 29, 2026
#1921)

Pre-existing CI flake放行 per ci-flake-policy.md § 2.2:

- Run id: 25134494472 / Job id: 73669566493
- Shape: E2E HTTP Qdrant + Nebula (single shape; Lite ✅ + Neo4j ✅)
- Signature exact match: aperag/domains/agent_runtime/runtime.py:1056 ValidationException: Model specification is required for agent runtime v3
- PR diff zero intersection with §2.2 forbidden list: changes only graph_extractor.py + KnowledgeGraphConfig schema + boundary tests + regenerated schema.d.ts; agent_runtime / model_platform / mcp / indexing-worker DI / e2e-http workflow / deploy/aperag all untouched

All other gates green: lint-and-unit ✅, smoke 3/3 ✅, preflight 3/3 ✅. 5-lane CR收齐 (Bryce A3 fold-in / Weston arch re-CR / ziang impl re-CR / 符炫炜 ratify / huangheng author).

task #30 Phase A 全闭环 (A1 #1918 / A3 #1920 / A2 this PR).
earayu added a commit that referenced this pull request Apr 29, 2026
…ode default ON

Two BLOCKER fixes per @ziang msg=56912dae + @huangzhangshu msg=cda4dc75:

* **BLOCKER 1** — `source_chunk_ids` validity is now strictly
  window-scoped: `run_window` computes `source_chunk_ids_valid` /
  `source_chunk_ids_total` against that single window's
  `allowed_chunk_ids`, and `aggregate_sample` only sums per-window
  counters. Previously the per-document union check let a record
  produced in window-0 reference a chunk_id from window-1 and pass —
  violating the A3 parser invariant (source_chunk_ids ⊆ current
  window's chunk_ids, not document union). New unit test
  `test_aggregate_sample_source_chunk_ids_is_window_scoped_not_union`
  pins the cross-window pollution case.

* **BLOCKER 2** — `--response-format-json` is now ON by default with
  `--no-response-format-json` as the explicit opt-out. A3 PR #1920
  (`01b45196`) made `response_format=json_object` a graph extractor
  production invariant, but the legacy benchmark default was off, so
  B2 baseline would have measured `json_ok_rate` / parse failure /
  cost on the pre-A3 path. README updated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu added a commit that referenced this pull request Apr 29, 2026
…rics (#1923)

* feat(benchmark): task #30 B1 — chunk-window matrix + per-document metrics

Extend tests/benchmarks/graph_extraction harness for the task #30 graph
chunk window benchmark (spec § 6.3, dispatched in msg=cecae5ed):

* Fix `render_extraction_prompt` API drift — the harness was still on
  the legacy `input_text=...` signature; PR #1918/#1920/#1921 (task #30
  Phase A) moved it to `window_chunks=[{chunk_id, text}]`. Without this
  fix B2 (Planetegg msg=9489efdb) cannot run.

* Add `--chunk-window-size N` for single-shape runs and `--matrix
  N1,N2,...` for batch sweeps (the two are mutually exclusive). Sample
  text is split into `--pseudo-chunks-per-doc` (default 4) pseudo-chunks
  and grouped into non-overlapping windows of size N, mirroring the
  production `_GraphChunkWindow` shape (PR #1918).

* Aggregate **per-document** the 7 metrics required by spec § 6.3 +
  Planetegg msg=ea7efa7b: `llm_call_count`, `input_tokens_total`,
  `output_tokens_total`, `wall_time_s`, `timeout_or_failure_count`,
  entity+relation totals + duplicate counts, and the new
  `source_chunk_ids_valid` / `source_chunk_ids_total` provenance check
  (task #30 §3.1.3 hard requirement #2).

* `--dry-run` produces a placeholder schema for B2 to verify ingestion
  before paying provider cost (per Planetegg msg=cbe84223).

* `test_runner_units.py` pins the harness structural pieces (chunking,
  windowing, validity counting, per-document aggregation) so the
  matrix output schema stays contract-stable even though the benchmark
  itself runs out-of-CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(benchmark): task #30 B1 — window-scoped source_chunk_ids + JSON-mode default ON

Two BLOCKER fixes per @ziang msg=56912dae + @huangzhangshu msg=cda4dc75:

* **BLOCKER 1** — `source_chunk_ids` validity is now strictly
  window-scoped: `run_window` computes `source_chunk_ids_valid` /
  `source_chunk_ids_total` against that single window's
  `allowed_chunk_ids`, and `aggregate_sample` only sums per-window
  counters. Previously the per-document union check let a record
  produced in window-0 reference a chunk_id from window-1 and pass —
  violating the A3 parser invariant (source_chunk_ids ⊆ current
  window's chunk_ids, not document union). New unit test
  `test_aggregate_sample_source_chunk_ids_is_window_scoped_not_union`
  pins the cross-window pollution case.

* **BLOCKER 2** — `--response-format-json` is now ON by default with
  `--no-response-format-json` as the explicit opt-out. A3 PR #1920
  (`01b45196`) made `response_format=json_object` a graph extractor
  production invariant, but the legacy benchmark default was off, so
  B2 baseline would have measured `json_ok_rate` / parse failure /
  cost on the pre-A3 path. README updated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant