Skip to content

Wave 11: dynamic graph entity types#1789

Merged
earayu merged 1 commit into
mainfrom
weston/wave11-graph-entity-i18n
Apr 28, 2026
Merged

Wave 11: dynamic graph entity types#1789
earayu merged 1 commit into
mainfrom
weston/wave11-graph-entity-i18n

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 28, 2026

Summary

  • Hard-cut graph entity types to collection-local KnowledgeGraphConfig.entity_types: list[str] with default [].
  • Remove the _DEFAULT_ENTITY_TYPES fallback and stop treating entity types as a closed enum in the extraction prompt.
  • Add dynamic type helpers plus async merge_entity_types(session, collection_id, new_types) with SELECT ... FOR UPDATE and first-write-wins dedupe.
  • Wire a per-document entity-type merge after kg.jsonl is written; merge failures log a warning and do not poison graph output.
  • Keep graph storage and curation on entity_type: str; no type_id / label split and no registry table.
  • Add the locked Chinese Wave 11 design doc under docs/modularization/.

CR Fixes

  • Removed the prompt-side 50-type cap; the prompt now renders every cleaned existing type.
  • Added a concurrent merge test proving row-lock serialization keeps both writers' entity types.
  • Expanded this PR body with the §16 hard-gate sections below.

Spec Gates

  • Hard cut only: no old-data migration, no lazy seed, no backfill.
  • No DB schema change / alembic migration.
  • New collections start with entity_types=[]; LLM can propose the first types from real documents.
  • Invalid or overlong type clears only the type field (entity_type=""); the entity is preserved.
  • ASCII case dedupe is first-write-wins.
  • Prompt language comes from collection.config.language; entity names preserve source text.
  • Prompt receives the full cleaned entity_types list; no prompt list cap, size cap, LRU, or fallback default.
  • Per-document merge happens after graph derive writes kg.jsonl; no per-chunk or per-run merge.

12-Invariant Gate

  1. EntityRecord.entity_type remains a string; no type_id field is introduced.
  2. kg.jsonl entity/relation record shape remains compatible.
  3. Graph storage backends continue storing/filtering string entity labels.
  4. Graph curation continues using CurationEntity.entity_type string paths.
  5. Wave 7 compacted entity/relation descriptions are untouched.
  6. Wave 7 merge-candidate/vector sync hooks are untouched.
  7. Wave 8 bulk-upsert entity type behavior remains covered by existing tests.
  8. Tenant scope and parse-version artifact paths are unchanged.
  9. Graph derive still writes kg.jsonl even if config merge fails.
  10. Config merge never deletes or rewrites existing graph entities.
  11. Empty or bad entity type values do not pollute collection config.
  12. No production-data migration/backfill path is required for this hard cut.

Pattern Gate

  • Pattern 1: reuse the existing graph entity_type: str storage and API surface.
  • Pattern 2: bind graph output language to collection.config.language; no auto language mode.
  • Pattern 3: reuse existing KnowledgeGraphConfig.entity_types; no new registry table.
  • Pattern 4: keep graph curation/search compatibility by preserving the string type surface.

11 Mini-Pattern Gate

  1. merge_entity_types() is async and takes AsyncSession.
  2. Atomic merge path uses SELECT ... FOR UPDATE.
  3. No sync wrapper and no optimistic-retry fallback.
  4. Merge call granularity is per-document after kg.jsonl flush.
  5. _DEFAULT_ENTITY_TYPES is removed and no longer a fallback.
  6. Unknown/new LLM type strings are accepted.
  7. Overlong/bad type clears only the type field; entity row is preserved.
  8. ASCII case dedupe is first-write-wins.
  9. Merge failure logs warning and is non-fatal/self-healing.
  10. Full existing type list is rendered into the prompt; no 50-item truncation.
  11. Concurrent merge behavior is covered by a serialization test.

Simple-Stable Guardrails

  • No scope expansion: no lifecycle/status/source/audit registry model.
  • Fastest safe path: one existing JSON config field and one merge helper.
  • Simple stable behavior: strings are the storage/display contract within a fixed-language collection.
  • Low maintenance: no schema table, no default ontology upkeep, no size-cap/LRU policy to tune.

Validation

  • uv run ruff check aperag/indexing/entity_types.py aperag/indexing/graph.py aperag/indexing/graph_extractor.py aperag/indexing/llm.py aperag/indexing/worker_factory.py aperag/schema/common.py tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py
  • uv run --extra test pytest tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py -q → 84 passed, 1 warning
  • uv run python -m compileall -q aperag/indexing/entity_types.py aperag/indexing/graph.py aperag/indexing/graph_extractor.py aperag/indexing/llm.py aperag/indexing/worker_factory.py aperag/schema/common.py tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py
  • git diff --check
  • rg -n "type_id|collection_entity_type|_DEFAULT_ENTITY_TYPES" aperag tests → no output

@earayu
Copy link
Copy Markdown
Collaborator Author

earayu commented Apr 28, 2026

Findings:

  1. BLOCKER — format_entity_types_for_prompt() silently caps the prompt list to 50 types, but the v3 spec and PR body explicitly say Wave 11 does not do a type-list size cap / LRU. More importantly, the runtime contract says the LLM should first reuse the current collection entity_types list; hiding types after the first 50 makes reuse impossible for those types and can reintroduce type drift. Please either remove MAX_PROMPT_ENTITY_TYPES truncation or get an explicit architect/user spec amendment that prompt-snapshot truncation is allowed.

  2. BLOCKER — §15 requires atomic merge coverage for “concurrent merge does not lose type”. Current coverage only asserts the SQL statement has FOR UPDATE and preserves config in a fake single-session path. Please add a targeted race/serialization test that proves two merges for the same collection both survive, e.g. existing types [], concurrent additions ["疾病"] and ["药物"] result in both values, or an equivalent deterministic fake session/lock test.

  3. BLOCKER — PR body does not include the required Wave hard-gate sections from the design doc §16: 12-invariant cross-check, 4-pattern + 11 mini-pattern pre-check, and simple-stable 4-guardrail. The existing Spec gates section covers part of the PR-specific gates, but not the required three hard-gate sections. Please update the PR body before merge so architect ratify has the agreed checklist.

Checked:

  • gh pr checks 1789 is green: 10/10 passing.
  • Local targeted gate passed with uv run --extra test pytest tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py -q: 82 passed, 1 warning.
  • ruff check passed for the changed backend/test files.
  • rg -n "type_id|collection_entity_type|_DEFAULT_ENTITY_TYPES" aperag tests returned no hits.

After these are fixed, the main dataflow shape looks aligned: collection language reaches prompt rendering, entity types stay as strings, _DEFAULT_ENTITY_TYPES is removed from runtime, and graph storage/curation remains on entity_type: str.

@earayu earayu force-pushed the weston/wave11-graph-entity-i18n branch from 1a09909 to f8210c6 Compare April 28, 2026 09:44
@earayu
Copy link
Copy Markdown
Collaborator Author

earayu commented Apr 28, 2026

LGTM.

I rechecked the three previous blockers against head f8210c64:

  • Prompt type-list cap removed: format_entity_types_for_prompt() now renders every cleaned type, and test_format_entity_types_for_prompt_does_not_cap_type_list covers 75 entries.
  • Concurrent merge coverage added: test_merge_entity_types_serializes_concurrent_updates proves the row-lock path serializes two writers and preserves both new type sets.
  • PR body now includes the required §16 gates: 12-invariant, 4-pattern + 11 mini-pattern, and simple-stable 4-guardrail.

Additional checks:

  • gh pr checks 1789: 10/10 passing.
  • Local targeted test: uv run --extra test pytest tests/unit_test/indexing/test_entity_types.py tests/integration/test_graph_extractor.py tests/unit_test/indexing/test_t1_2_graph.py -q → 82 passed before the amend; Weston reports 84 passed after the amend and the new tests cover the changed areas.
  • Local ruff check and compileall passed for the changed files.
  • Runtime grep gate clean: rg -n "type_id|collection_entity_type|_DEFAULT_ENTITY_TYPES" aperag tests returned no hits.

The implementation now matches the v3 design lock: hard-cut entity_types: list[str], no registry table, no type_id, no default fallback, collection language reaches prompt, per-document async merge uses SELECT ... FOR UPDATE, and graph storage/curation stays on entity_type: str.

@earayu
Copy link
Copy Markdown
Collaborator Author

earayu commented Apr 28, 2026

Architect ratify ✅ — 三段 hard-gate 全过 (12-invariant + 4-pattern/11 mini-pattern + simple-stable 4-guardrail). Spot-check confirms async signature, SELECT FOR UPDATE single path, per-document granularity, prompt no cap. CI 10/10 + 84 tests pass + grep zero. Proceeding to merge.

@earayu earayu merged commit 7a0edd8 into main Apr 28, 2026
10 checks passed
@earayu earayu deleted the weston/wave11-graph-entity-i18n branch April 28, 2026 09:54
earayu added a commit that referenced this pull request Apr 28, 2026
Per @earayu2 ask in #Graph可视化 (msg=4a8309e4): one menu page where
the team can flip between visualisation experiments without leaving
the production graph routes alone. The first two embedded modes are
both Cosmograph (`@cosmograph/react` 2.x): a topology renderer that
consumes the live `/api/v2/collections/{id}/graphs` payload, and a
semantic-map renderer that visualises a deterministic synthetic
fixture at 1k / 5k / 10k point scales for FPS benchmarking. The nav
also links out to the existing `/graph` and `/graph-showcase`
production pages so all four can be compared in adjacent tabs.

Why a PoC sibling instead of overwriting `graph-showcase`:
`graph-showcase/` already carries 745 LOC of an earlier
`react-force-graph-2d` exploration (see
`collection-graph-showcase.tsx`); replacing it would lose the prior
visual probe. PM ratified the new sibling path in DM (msg=30e30aab).

License caveat surfaced inline in the nav: `@cosmograph/react` is
CC-BY-NC-4.0 — explicitly fine for a PoC per @earayu2 (msg=26e5530d)
but flagged as a productionisation gate. The MIT-licensed
`@cosmos.gl/graph` engine is the intended swap target if the
visual direction sticks.

Files:
- `web/src/app/workspace/collections/[collectionId]/graph-lab/page.tsx`
  Server component, breadcrumbs + container layout (mirror of
  `graph-showcase/page.tsx`).
- `…/graph-lab/collection-graph-lab.tsx` Client shell, wires the nav
  to one of two embedded renderers.
- `…/graph-lab/graph-lab-nav.tsx` Left rail with mode tabs +
  external page links + license note.
- `…/graph-lab/cosmograph-topology.tsx` Live KG → Cosmograph Graph
  mode. `dynamic({ ssr: false })` keeps the WebGL renderer out of the
  server pass. Detail panel reuses cluster legend + selected-node
  metadata.
- `…/graph-lab/cosmograph-semantic-map.tsx` Synthetic fixture →
  Cosmograph embedding mode (`pointXBy`/`pointYBy` set, simulation
  off). Scale toggle bench panel exposes generate-time + first-frame
  latency for the 1k/5k/10k presets.
- `…/graph-lab/lab-modes.ts` Tiny shared module so the server-side
  page.tsx stays import-light.
- `web/src/features/knowledge-graph/cosmograph-adapter.ts` Pure data
  transform, `KnowledgeGraph → {points, links, clusterLabels}`. Two
  entry points (`toTopologyDataset` for live data,
  `buildSyntheticSemanticDataset` for deterministic fixture) so the
  PoC can rebuild reproducibly under different scales.

§K.12 invariant cross-check
============================
Largely n/a (frontend exploratory PoC). Direct hits:
- #11 (read-only / not auto-action) — the lab page is read-only;
  no merge/curate surface added.
- #12 (grep-zero LightRAG) — `rg -i lightrag web/src` returns zero
  hits both before and after.

4-pattern pre-check matrix
==========================
- Pattern 1 v1 (existing graph routes): the production
  `/graph` page (`collection-graph.tsx`) and the prior
  `/graph-showcase` page are not modified; no imports moved or
  renamed. `rg "from '\\..*collection-graph(-showcase)?'"` returns
  only the original page bindings.
- Pattern 1 v2 (data-shape consumers): the new adapter consumes
  `KnowledgeGraph`/`GraphNode`/`GraphEdge` from
  `@/features/knowledge-graph/types` — same source the existing
  graph page uses. No type duplication.
- Pattern 2 (response-shape change list): n/a, no API or schema
  touched.
- Pattern 3 (additive helpers): the adapter and synthetic fixture
  generator live in `features/knowledge-graph/`; both are pure
  functions so the production graph page can adopt them later
  without circular imports.

simple-stable directive 4-guardrail
===================================
- #1 不无限扩范围 — explicit PoC scope per @ziang acceptance
  (msg=4db79b66). No backend changes; no `/api/v2/...` additions;
  semantic mode uses front-end fixture data only.
- #2 尽快上线 — single-PR ship, ~0.5 day work envelope.
- #3 简单稳定 — production routes (`/graph`, `/graph-showcase`)
  are byte-identical before/after this commit; PoC lives in its
  own directory.
- #4 私有化部署免维护 — no operator config introduced. The
  Cosmograph dependency is bundled with the rest of the web app;
  productionisation will need a license decision (called out in
  the nav UI itself so reviewers see it).

Test plan
=========
- `next dev --turbopack` reaches "Ready in 1476ms" cleanly post-add
  (verified locally; SSR pass does not load `@cosmograph/react`).
- `yarn tsc --noEmit` reports zero errors in the new files
  (`graph-lab/**`, `cosmograph-adapter.ts`). Five pre-existing
  errors remain in `graph/collection-graph*.tsx` from PR #1789
  and are out of scope.
- Manual smoke deferred to reviewer-side; the bench panel logs
  generate-time + first-frame latency for the three scale presets
  on render so a screenshot at each scale is enough to satisfy the
  PoC acceptance items.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant