Wave 11: dynamic graph entity types by earayu · Pull Request #1789 · apecloud/ApeRAG

earayu · 2026-04-28T09:34:02Z

Summary

Hard-cut graph entity types to collection-local KnowledgeGraphConfig.entity_types: list[str] with default [].
Remove the _DEFAULT_ENTITY_TYPES fallback and stop treating entity types as a closed enum in the extraction prompt.
Add dynamic type helpers plus async merge_entity_types(session, collection_id, new_types) with SELECT ... FOR UPDATE and first-write-wins dedupe.
Wire a per-document entity-type merge after kg.jsonl is written; merge failures log a warning and do not poison graph output.
Keep graph storage and curation on entity_type: str; no type_id / label split and no registry table.
Add the locked Chinese Wave 11 design doc under docs/modularization/.

CR Fixes

Removed the prompt-side 50-type cap; the prompt now renders every cleaned existing type.
Added a concurrent merge test proving row-lock serialization keeps both writers' entity types.
Expanded this PR body with the §16 hard-gate sections below.

Spec Gates

Hard cut only: no old-data migration, no lazy seed, no backfill.
No DB schema change / alembic migration.
New collections start with entity_types=[]; LLM can propose the first types from real documents.
Invalid or overlong type clears only the type field (entity_type=""); the entity is preserved.
ASCII case dedupe is first-write-wins.
Prompt language comes from collection.config.language; entity names preserve source text.
Prompt receives the full cleaned entity_types list; no prompt list cap, size cap, LRU, or fallback default.
Per-document merge happens after graph derive writes kg.jsonl; no per-chunk or per-run merge.

12-Invariant Gate

EntityRecord.entity_type remains a string; no type_id field is introduced.
kg.jsonl entity/relation record shape remains compatible.
Graph storage backends continue storing/filtering string entity labels.
Graph curation continues using CurationEntity.entity_type string paths.
Wave 7 compacted entity/relation descriptions are untouched.
Wave 7 merge-candidate/vector sync hooks are untouched.
Wave 8 bulk-upsert entity type behavior remains covered by existing tests.
Tenant scope and parse-version artifact paths are unchanged.
Graph derive still writes kg.jsonl even if config merge fails.
Config merge never deletes or rewrites existing graph entities.
Empty or bad entity type values do not pollute collection config.
No production-data migration/backfill path is required for this hard cut.

Pattern Gate

Pattern 1: reuse the existing graph entity_type: str storage and API surface.
Pattern 2: bind graph output language to collection.config.language; no auto language mode.
Pattern 3: reuse existing KnowledgeGraphConfig.entity_types; no new registry table.
Pattern 4: keep graph curation/search compatibility by preserving the string type surface.

11 Mini-Pattern Gate

merge_entity_types() is async and takes AsyncSession.
Atomic merge path uses SELECT ... FOR UPDATE.
No sync wrapper and no optimistic-retry fallback.
Merge call granularity is per-document after kg.jsonl flush.
_DEFAULT_ENTITY_TYPES is removed and no longer a fallback.
Unknown/new LLM type strings are accepted.
Overlong/bad type clears only the type field; entity row is preserved.
ASCII case dedupe is first-write-wins.
Merge failure logs warning and is non-fatal/self-healing.
Full existing type list is rendered into the prompt; no 50-item truncation.
Concurrent merge behavior is covered by a serialization test.

Simple-Stable Guardrails

No scope expansion: no lifecycle/status/source/audit registry model.
Fastest safe path: one existing JSON config field and one merge helper.
Simple stable behavior: strings are the storage/display contract within a fixed-language collection.
Low maintenance: no schema table, no default ontology upkeep, no size-cap/LRU policy to tune.

Validation

uv run ruff check aperag/indexing/entity_types.py aperag/indexing/graph.py aperag/indexing/graph_extractor.py aperag/indexing/llm.py aperag/indexing/worker_factory.py aperag/schema/common.py tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py
uv run --extra test pytest tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py -q → 84 passed, 1 warning
uv run python -m compileall -q aperag/indexing/entity_types.py aperag/indexing/graph.py aperag/indexing/graph_extractor.py aperag/indexing/llm.py aperag/indexing/worker_factory.py aperag/schema/common.py tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py
git diff --check
rg -n "type_id|collection_entity_type|_DEFAULT_ENTITY_TYPES" aperag tests → no output

earayu · 2026-04-28T09:41:36Z

Findings:

BLOCKER — format_entity_types_for_prompt() silently caps the prompt list to 50 types, but the v3 spec and PR body explicitly say Wave 11 does not do a type-list size cap / LRU. More importantly, the runtime contract says the LLM should first reuse the current collection entity_types list; hiding types after the first 50 makes reuse impossible for those types and can reintroduce type drift. Please either remove MAX_PROMPT_ENTITY_TYPES truncation or get an explicit architect/user spec amendment that prompt-snapshot truncation is allowed.
BLOCKER — §15 requires atomic merge coverage for “concurrent merge does not lose type”. Current coverage only asserts the SQL statement has FOR UPDATE and preserves config in a fake single-session path. Please add a targeted race/serialization test that proves two merges for the same collection both survive, e.g. existing types [], concurrent additions ["疾病"] and ["药物"] result in both values, or an equivalent deterministic fake session/lock test.
BLOCKER — PR body does not include the required Wave hard-gate sections from the design doc §16: 12-invariant cross-check, 4-pattern + 11 mini-pattern pre-check, and simple-stable 4-guardrail. The existing Spec gates section covers part of the PR-specific gates, but not the required three hard-gate sections. Please update the PR body before merge so architect ratify has the agreed checklist.

Checked:

gh pr checks 1789 is green: 10/10 passing.
Local targeted gate passed with uv run --extra test pytest tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py -q: 82 passed, 1 warning.
ruff check passed for the changed backend/test files.
rg -n "type_id|collection_entity_type|_DEFAULT_ENTITY_TYPES" aperag tests returned no hits.

After these are fixed, the main dataflow shape looks aligned: collection language reaches prompt rendering, entity types stay as strings, _DEFAULT_ENTITY_TYPES is removed from runtime, and graph storage/curation remains on entity_type: str.

earayu · 2026-04-28T09:52:15Z

LGTM.

I rechecked the three previous blockers against head f8210c64:

Prompt type-list cap removed: format_entity_types_for_prompt() now renders every cleaned type, and test_format_entity_types_for_prompt_does_not_cap_type_list covers 75 entries.
Concurrent merge coverage added: test_merge_entity_types_serializes_concurrent_updates proves the row-lock path serializes two writers and preserves both new type sets.
PR body now includes the required §16 gates: 12-invariant, 4-pattern + 11 mini-pattern, and simple-stable 4-guardrail.

Additional checks:

gh pr checks 1789: 10/10 passing.
Local targeted test: uv run --extra test pytest tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py -q → 82 passed before the amend; Weston reports 84 passed after the amend and the new tests cover the changed areas.
Local ruff check and compileall passed for the changed files.
Runtime grep gate clean: rg -n "type_id|collection_entity_type|_DEFAULT_ENTITY_TYPES" aperag tests returned no hits.

The implementation now matches the v3 design lock: hard-cut entity_types: list[str], no registry table, no type_id, no default fallback, collection language reaches prompt, per-document async merge uses SELECT ... FOR UPDATE, and graph storage/curation stays on entity_type: str.

earayu · 2026-04-28T09:54:32Z

Architect ratify ✅ — 三段 hard-gate 全过 (12-invariant + 4-pattern/11 mini-pattern + simple-stable 4-guardrail). Spot-check confirms async signature, SELECT FOR UPDATE single path, per-document granularity, prompt no cap. CI 10/10 + 84 tests pass + grep zero. Proceeding to merge.

@ziang

Per @earayu2 ask in #Graph可视化 (msg=4a8309e4): one menu page where the team can flip between visualisation experiments without leaving the production graph routes alone. The first two embedded modes are both Cosmograph (`@cosmograph/react` 2.x): a topology renderer that consumes the live `/api/v2/collections/{id}/graphs` payload, and a semantic-map renderer that visualises a deterministic synthetic fixture at 1k / 5k / 10k point scales for FPS benchmarking. The nav also links out to the existing `/graph` and `/graph-showcase` production pages so all four can be compared in adjacent tabs. Why a PoC sibling instead of overwriting `graph-showcase`: `graph-showcase/` already carries 745 LOC of an earlier `react-force-graph-2d` exploration (see `collection-graph-showcase.tsx`); replacing it would lose the prior visual probe. PM ratified the new sibling path in DM (msg=30e30aab). License caveat surfaced inline in the nav: `@cosmograph/react` is CC-BY-NC-4.0 — explicitly fine for a PoC per @earayu2 (msg=26e5530d) but flagged as a productionisation gate. The MIT-licensed `@cosmos.gl/graph` engine is the intended swap target if the visual direction sticks. Files: - `web/src/app/workspace/collections/[collectionId]/graph-lab/page.tsx` Server component, breadcrumbs + container layout (mirror of `graph-showcase/page.tsx`). - `…/graph-lab/collection-graph-lab.tsx` Client shell, wires the nav to one of two embedded renderers. - `…/graph-lab/graph-lab-nav.tsx` Left rail with mode tabs + external page links + license note. - `…/graph-lab/cosmograph-topology.tsx` Live KG → Cosmograph Graph mode. `dynamic({ ssr: false })` keeps the WebGL renderer out of the server pass. Detail panel reuses cluster legend + selected-node metadata. - `…/graph-lab/cosmograph-semantic-map.tsx` Synthetic fixture → Cosmograph embedding mode (`pointXBy`/`pointYBy` set, simulation off). Scale toggle bench panel exposes generate-time + first-frame latency for the 1k/5k/10k presets. - `…/graph-lab/lab-modes.ts` Tiny shared module so the server-side page.tsx stays import-light. - `web/src/features/knowledge-graph/cosmograph-adapter.ts` Pure data transform, `KnowledgeGraph → {points, links, clusterLabels}`. Two entry points (`toTopologyDataset` for live data, `buildSyntheticSemanticDataset` for deterministic fixture) so the PoC can rebuild reproducibly under different scales. §K.12 invariant cross-check ============================ Largely n/a (frontend exploratory PoC). Direct hits: - #11 (read-only / not auto-action) — the lab page is read-only; no merge/curate surface added. - #12 (grep-zero LightRAG) — `rg -i lightrag web/src` returns zero hits both before and after. 4-pattern pre-check matrix ========================== - Pattern 1 v1 (existing graph routes): the production `/graph` page (`collection-graph.tsx`) and the prior `/graph-showcase` page are not modified; no imports moved or renamed. `rg "from '\\..*collection-graph(-showcase)?'"` returns only the original page bindings. - Pattern 1 v2 (data-shape consumers): the new adapter consumes `KnowledgeGraph`/`GraphNode`/`GraphEdge` from `@/features/knowledge-graph/types` — same source the existing graph page uses. No type duplication. - Pattern 2 (response-shape change list): n/a, no API or schema touched. - Pattern 3 (additive helpers): the adapter and synthetic fixture generator live in `features/knowledge-graph/`; both are pure functions so the production graph page can adopt them later without circular imports. simple-stable directive 4-guardrail =================================== - #1 不无限扩范围 — explicit PoC scope per @ziang acceptance (msg=4db79b66). No backend changes; no `/api/v2/...` additions; semantic mode uses front-end fixture data only. - #2 尽快上线 — single-PR ship, ~0.5 day work envelope. - #3 简单稳定 — production routes (`/graph`, `/graph-showcase`) are byte-identical before/after this commit; PoC lives in its own directory. - #4 私有化部署免维护 — no operator config introduced. The Cosmograph dependency is bundled with the rest of the web app; productionisation will need a license decision (called out in the nav UI itself so reviewers see it). Test plan ========= - `next dev --turbopack` reaches "Ready in 1476ms" cleanly post-add (verified locally; SSR pass does not load `@cosmograph/react`). - `yarn tsc --noEmit` reports zero errors in the new files (`graph-lab/**`, `cosmograph-adapter.ts`). Five pre-existing errors remain in `graph/collection-graph*.tsx` from PR #1789 and are out of scope. - Manual smoke deferred to reviewer-side; the bench panel logs generate-time + first-frame latency for the three scale presets on render so a screenshot at each scale is enough to satisfy the PoC acceptance items. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

feat(graph): support dynamic entity types

f8210c6

earayu force-pushed the weston/wave11-graph-entity-i18n branch from 1a09909 to f8210c6 Compare April 28, 2026 09:44

earayu merged commit 7a0edd8 into main Apr 28, 2026
10 checks passed

earayu deleted the weston/wave11-graph-entity-i18n branch April 28, 2026 09:54

earayu mentioned this pull request Apr 28, 2026

feat(web): graph-lab Cosmograph PoC menu page #1791

Merged

10 tasks

This was referenced Apr 28, 2026

feat(collection): Wave 10 Chunks C+D — regen service + 2 OpenAPI endpoints #1786

Merged

fix(web_search): drop misnamed X-Return-Format header on Jina providers #1799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wave 11: dynamic graph entity types#1789

Wave 11: dynamic graph entity types#1789
earayu merged 1 commit into
mainfrom
weston/wave11-graph-entity-i18n

earayu commented Apr 28, 2026 •

edited

Loading

Uh oh!

earayu commented Apr 28, 2026

Uh oh!

earayu commented Apr 28, 2026

Uh oh!

earayu commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

earayu commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CR Fixes

Spec Gates

12-Invariant Gate

Pattern Gate

11 Mini-Pattern Gate

Simple-Stable Guardrails

Validation

Uh oh!

earayu commented Apr 28, 2026

Uh oh!

earayu commented Apr 28, 2026

Uh oh!

earayu commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

earayu commented Apr 28, 2026 •

edited

Loading