feat(sutta-studio): provider abstraction + Citation extension (Tier-1 commit A of 5)#38
Merged
Merged
Conversation
…ion fields
Tier-1 commit A per ADR SUTTA-008 §Build order. Lands the data-layer
plumbing that subsequent commits attach DPD, VRI, bilara, and
suttaplex providers to. No behaviour change for the live compiler in
this commit; hand-curation tooling can already call the registry.
New: services/providers/
- types.ts — LexiconProvider, MorphologyProvider, CommentaryProvider,
EditionProvider, WitnessProvider, ParallelProvider interfaces.
Every response carries `sourceId` + deterministic `citationId`
(`cite:{providerId}:{sourceId}` or `cite:{providerId}:q:{query}`)
so citation materialisation is mechanical, not hand-glued.
- citationHelpers.ts — citationIdFor + materializeCitation.
- lexiconRegistry.ts — LexiconProviderRegistry runs providers in
parallel, preserves per-provider entries in `entriesBySource`
(powers the source-disagreement inspector in ADR UI vision §7),
isolates one provider's failure from the others.
- suttaCentralDictionary.ts — SuttaCentralDictionaryProvider wraps
`/api/dictionary_full/{lemma}`; per-session cache; preserves raw
payload as `rawExcerpt` so the LLM prompt + UI see unmodified
attestations. First citizen of the registry.
- index.ts — barrel + defaultLexiconRegistry singleton (SC only for now).
- Tests: 29 across the three modules (citationHelpers 11, lexiconRegistry
8, suttaCentralDictionary 10).
Modified: types/suttaStudio.ts
- Citation extended with provenance / query / excerpt / license / fetchedAt.
Excerpt is baked into the packet so the renderer doesn't re-fetch.
- CitationProvenance enum: sc-dictionary-full, dpd, ms-dpd, ped-dsal, cpd,
vri-attha, vri-cscd, sc-bilara, sc-suttaplex, buddhanexus, bdrc, cbeta,
gretil, 84000, manual.
- 3 new round-trip tests + 1 disagreement-grouping test in
types/suttaStudio.test.ts.
Compiler path (services/compiler/index.ts:387) is intentionally
unchanged. The existing fetchDictionaryEntry callsite keeps working;
provider wiring will land alongside DpdProvider in commit B as a
single coherent unit.
Verified: 42 new tests pass; 65 existing sutta-studio tests still
pass; typecheck shows only the 5 pre-existing baseline errors
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…1 commit B.1) Per ADR SUTTA-008 §Build order step 3, lands the DPD data layer for MN10. Provider impl follows in B.2; compiler wiring in B.3. ms-dpd vs full dpd-db decision: ms-dpd is verb-blind — its inflection table has zero verb conjugation rows, only declensions. For the assasati verb family central to MN10 this kills it. Using full dpd-db. Storage strategy (resolved during this spike — ADR §Open Questions #2): Per-sutta subsets, not full corpus. Full DPD export is 80-120MB JSON; committing that for one sutta is disproportionate. The script extracts only headwords referenced by surface forms in the target sutta. Total committed for MN10: ~656KB. Each new sutta adds its own subset directory. Surface→lemma resolution: Heuristic stem-stripping over dpd.txt (the 4MB human-readable release; no SQLite required). Initial pass: 34%. After parser fix for single-digit homonyms (DPD uses both "me 1" and "a 1.1" styles): 81.6% coverage on MN10 (436/534 surface forms). Remaining 18% are mostly compounds (sammāsambuddhassa, ajjhattikabāhiresu) and inflected verb forms that live in DPD's SQLite inflection table. Documented as unmatchedSurfaces in manifest.json; SQLite escalation is a future commit if curation needs higher coverage. Files: - scripts/build-dpd.ts — Node TS, no native deps. Downloads pinned DPD release dpd-txt.zip (4MB) on first run, caches in data/_raw/ (gitignored), parses to structured DpdRecord, fetches bilara MN10 Pāli root, extracts surface forms, resolves via stem-stripping + quotative marker handling + locative→stem restoration. Projects to LexiconEntry shape per the providers types added in commit A. - data/dpd/mn10/headwords.json (618KB) — lemma → LexiconEntry[] - data/dpd/mn10/forms.json (20KB) — surface → lemma candidates - data/dpd/mn10/manifest.json — coverage stats + unmatched surfaces - data/LICENSE-DATA.md — CC BY-NC-SA 4.0 with DPD attribution + placeholders for VRI / bilara / future providers - .gitignore — data/_raw/ added (upstream zip + extracted txt) - package.json — `npm run build:dpd` script entry Pinned release: dpd-db v0.4.20260501 (May 2026). Re-ingest with `npm run build:dpd -- --force` after bumping DPD_RELEASE_TAG in the script. Monthly cadence upstream. Verified: 42 pre-existing provider tests still pass. No app code changed in this commit; the DpdProvider that consumes this data lands in B.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the MN10 DPD dataset from B.1 into a LexiconProvider that
hand-curation scripts and (after B.3) the live LLM compiler will
both call.
services/providers/dpd.ts (isomorphic — no Node imports)
- DpdProvider implements LexiconProvider
- Lookup strategy:
1. Direct lemma match in headwords map
2. Surface-form match via forms map → resolve candidate lemmas,
merge entries deduped by sourceId
- Direct match is preferred over surface-form match
- Lemma normalisation (trim + lowercase) at lookup time
- mergeDpdData helper for combining multiple per-sutta subsets
services/providers/dpd-loader-fs.ts (Node-only, separate from dpd.ts
so browser bundles don't accidentally pull node:fs)
- loadDpdSubsetFromFs(suttaUid, dataRoot?) — single subset
- loadAllDpdSubsetsFromFs(dataRoot?) — merge every subset under
data/dpd/, silently skipping siblings without headwords.json
services/providers/dpd.test.ts (20 tests)
- Synthetic data: direct match, surface→lemma, normalisation,
multi-candidate, missing-lemma, empty-input, direct-vs-surface
preference, provider contract (id/label/license)
- mergeDpdData: conflict resolution, empty sources, sources without
forms
- Real MN10 integration: 5 tests that load data/dpd/mn10 and verify
common lemmas (sati, viharati, bhikkhu) resolve; locative surface
(kāye) routes via forms to kāya; DPD POS=fem projects to
MorphHint.gender='f'; absent lemmas return empty
services/providers/index.ts
- Re-exports DpdProvider + types
- Does NOT register DPD into defaultLexiconRegistry — registration
requires environment-specific loaders (Vite glob in browser,
fs.readFileSync in Node) and is wired in commit B.3. Hand-curation
scripts construct directly via the FS loader.
Verified: 62 provider tests pass (42 from commit A + 20 new); 65
existing sutta-studio tests still pass. No app code touched —
compiler wiring lands in B.3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mit E) Per ADR SUTTA-008 §Build order step 7. Unblocks phase-by-phase hand re-curation of demoPacket.json (task #14) by making every lemma lookup grounded in real attestation rather than memory. Usage: npm run sutta:lookup -- --phase phase-a npm run sutta:lookup -- --lemmas evaṁ,me,sutaṁ npm run sutta:lookup -- --phase phase-a --json > /tmp/out.json For a phase, the script reads demoPacket.json, extracts every paliWord surface form (concatenated segment text), calls every registered provider in parallel, and prints per-source blocks: [ 1] evaṁ (function) — wordId=a1 ✓ SC dictionary_full (3 entries): ... ✓ DPD (5 entries): • eva [ind]: only; just; merely; exclusively citationId: cite:dpd:dpd:18051 ... The citationIds shown are deterministic — the curator pastes them into Sense.sourceCitationIds and the Citation entry materialises via the helpers from commit A. Implementation notes: - Constructs a fresh LexiconProviderRegistry (never mutates the default). Registers SC + DPD (the DPD subset for the sutta is loaded via dpd-loader-fs.ts). - --json mode emits a structured blob for programmatic consumption. - Errors in one provider don't poison the others (registry isolation behaviour from commit A). - Network calls happen via SC's existing fetchJsonViaProxies path. Smoke-tested against phase-a (evaṁ, me, sutaṁ): - DPD returns 5 senses for evaṁ, 6 case-by-case senses for me, 6 entries for sutaṁ (past participle, neuter noun, masc/fem homonyms suta/sutā) — each with deterministic citationIds and structured MorphHint where the POS maps. - SC dictionary_full returns 3 entries per lemma; their first- sense extraction shows "(no sense)" for shapes my SC parser doesn't fully handle (raw payload is still preserved in rawExcerpt for the curator to read). Known refinement. Verified: 62 provider tests still pass; no app code changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… commit B.3)
Closes the loop on ADR SUTTA-008's keystone principle — hand-curation
and the LLM compiler now share the same data layer. Every Sutta Studio
compilation at /sutta/<uid> consults DPD alongside SC dictionary_full,
giving the lexicographer LLM structured morphology + multiple
attested senses per Pāli word with deterministic citationIds.
Additive, low-risk wiring:
- DictionaryCache (IndexedDB) unchanged; cached SC payloads still load
- SC fetchDictionaryEntry path unchanged
- DPD is consulted *in parallel* as a separate data source
- Old callers of buildLexicographerPrompt without dpdLookups get
exactly the prior prompt (optional parameter)
- If DPD loading or lookup fails for any reason, the lexicographer
pass logs a warning and falls through to SC-only behaviour
New: services/providers/dpd-loader-vite.ts
- Uses Vite's `import.meta.glob` to eager-load every bundled
data/dpd/<sutta>/{headwords,forms}.json at build time
- Merges into one DpdData singleton via getBundledDpdData()
- Wrapped in try/catch so any glob resolution issue degrades
gracefully rather than breaking the compiler
Modified: services/compiler/prompts.ts
- buildLexicographerPrompt accepts optional dpdLookups param
- When present, renders a structured "DPD attestations" block:
• lemma [pos]: first-sense gloss {morphHint} cite=cite:dpd:dpd:N
- Caps at 5 entries per word so compound-heavy phases stay bounded
Modified: services/compiler/index.ts
- After SC dictionary fetch + cache work, runs DpdProvider.lookup
for every content word in parallel against the bundled subset
- Logs DPD hit rate: "DPD attestations: 14/15 words matched"
- Passes dpdLookups to buildLexicographerPrompt
Verified:
- 157 tests pass across 9 suites (no regressions in provider, compiler,
sutta-studio-utils, or rehydrator tests)
- Typecheck clean for changed files (5 pre-existing baseline errors
elsewhere unchanged)
- `npx vite build` succeeds; DPD JSON shards bundle in (~656KB
contribution from data/dpd/mn10/, well under chunk-size limits
already triggered by pre-existing chunks)
This is the final commit of Tier-1 B. Tier-1 sequence so far:
A 9168b5a provider abstraction + Citation extension
B.1 82fae37 DPD ingestion + MN10 subset (81.6% coverage)
B.2 49d3eba DpdProvider impl + tests
E 5ff46c0 curation helper (npm run sutta:lookup)
B.3 (this) compiler wired to DPD via Vite-bundled subsets
Tier-1 C (VRI providers) and D (SC bilara + suttaplex providers)
remain. Task #14 (MN10 phase-a re-curation) is now fully unblocked —
both manual lookup (E) and automated compilation (B.3) consult DPD.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (Tier-1 commit D)
Closes Tier-1 of ADR SUTTA-008 §Build order alongside the deferred C
(Aṭṭhakathā provider — to land as a follow-up). Phase-a re-curation
now has every Tier-1 data source feeding the curation helper.
services/providers/scBilaraVariants.ts
- SuttaCentralBilaraVariantsProvider fetches
raw.githubusercontent.com/suttacentral/bilara-data/published/
variant/pli/ms/sutta/<basket>/<uid>_variant-pli-ms.json
- Parses bilara's free-form "original → reading (witness1, witness2)"
notation; multi-variant entries split on `|`
- Returns VariantReading[] per canonical segment id, compatible with
DeepLoomPacket.provenance.segmentVariants
- Per-sutta cache; missing variant files (common for stable openings
like mn10:1.1) cached as empty so we don't refetch
- License: CC BY-SA 4.0
services/providers/scSuttaplex.ts
- SuttaCentralSuttaplexParallelProvider implements ParallelProvider
- Calls https://suttacentral.net/api/parallels/{uid}
- Handles work-level keys (top-level "mn10") and segment-level keys
("mn10#44.1", with # separator) — the latter projected to
canonicalSegmentId form ("mn10:44.1") for consistency with bilara
- Returns ParallelRef[] with deterministic citationIds + sourceIds
services/providers/index.ts
- Re-exports both providers + their result types
scripts/sutta-studio/lookup-phase.ts
- At phase start, prints "Phase-level evidence":
• Parallels for the sutta (top 8 of N) — workId, type, title, acronym
• Variant readings for the phase's canonicalSegmentIds, or "(none)"
when stable across witnesses
- --json mode includes the parallels + variants in the structured blob
Verified end-to-end against phase-a / mn10:1.1:
## Phase-level evidence
Parallels for mn10 (top 8 of 16):
→ dn22: full · The Long Discourse about the Ways of Attending to Mindfulness · DN 22
→ ea12.1: full · The One Way In Sūtra · EA 12.1
→ ma98: full · 念處 · MA 98
→ mn119: full · Mindfulness of the Body · MN 119
...
Variant readings: (none for mn10:1.1 — stable across witnesses)
Tests:
- 7 tests for SuttaCentralBilaraVariantsProvider (parse, cache,
network failure, malformed entries)
- 9 tests for SuttaCentralSuttaplexParallelProvider (work-level,
segment-level, # → : conversion, merging, fallbacks)
- 173 total tests pass across 11 suites; no regressions
Tier-1 status:
A 9168b5a provider abstraction + Citation extension
B.1 82fae37 DPD ingestion + MN10 subset (81.6%)
B.2 49d3eba DpdProvider impl + tests
E 5ff46c0 curation helper script
B.3 bc46e47 compiler wired to DPD via Vite bundle
D (this) SC bilara variants + suttaplex parallels providers
C pending VRI edition + Aṭṭhakathā commentary (deferred per ADR
§Open Questions #4; alignment unknown; not blocking
phase-a re-curation)
Task #14 MN10 phase-a re-curation is now fully unblocked with every
Tier-1 data source flowing through the curation helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…log skeleton Locks the curation process before phase-a re-curation begins. Operationalises task #14 (MN10 phase-by-phase re-curation) with explicit gates, artifact shapes, and human-approval moments. The discipline this protocol enforces: Schema and UI insights are extracted *after* the packet diff, not allowed to hijack the packet diff. Two new docs: docs/sutta-studio/CURATION_PROTOCOL.md (327 lines) - §0 Five invariant questions (always-load-bearing) - §1 The 12-step loop (brief → evidence → … → commit → issue extraction) - §2 Artifact shapes: phaseBrief (with required `tension` field), evidenceBundle (with inline excerpts — gate-check must be semantic, not syntactic), alignment scaffold, epistemic classification table, curation log entry - §3 Three gates: Evidence Gate, Ghost Gate, Affordance Gate. `required` GhostKind is discouraged as default; curator must name a specific kind from the expanded set. - §4 Role locks for future multi-agent runs (curator can edit packet + curation/, builder can edit services/types/tests, observer reads .runs/ ledger only, human at semantic gates) - §5 Eight specific human-approval moments - §6 Batching recommendation for MN10 first 15 phases (a alone, then b-d, e-h, i-o; re-evaluate before 16-51) - §7 Machine-observability deferred (L1 logs → L2 tmux → L3 events.jsonl), with sutta_curation_conductor angel sketched but not built; per the "earn-the-externalization" principle, protocol stabilises first - §8 Where each kind of content lives (curation log vs commit message vs FEATURES.md vs ADR vs new issue) - §9 Known failure modes + remedies - §10 Four open protocol questions for the next refinement cycle docs/sutta-studio/curation/phase-a.md (141 lines) - Skeleton with section-by-section to-fill markers - Committed empty so the protocol structure is in place before work begins; filled during the actual phase-a run Refinements absorbed into this protocol: From Aditya's draft Grounded Curation Loop proposal: - Phase brief comes first - Evidence bundle as curated artifact, not terminal spam - Epistemic classification before tooltips - Curation log as durable why-behind-what record - Three gates (evidence, ghost, affordance) - Batch sizing (a, b-d, e-h, i-o) From my refinement pass: - Evidence bundle includes inline `excerpt` + `decisionRelevance`, not just citationIds. Closes the audit loop in one pass. - Phase brief includes `tension` field. Every UI affordance must resolve a named tension or it's decorative. From Aditya's multi-agent observability sketch: - Role locks codified (curator / builder / observer / human) - Three escalation levels (L1 logs → L2 tmux → L3 ledger) - Curation conductor angel deferred — script-then-protocol sequencing inverted; iterate protocol first Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce (steps 0-2)
Per CURATION_PROTOCOL.md §1, fills the first three steps of the
Grounded Curation Loop for phase-a before any packet edit. Awaits
human gate at §evidenceBundle before proceeding to alignment.
§0 Phase brief
- 3 tensions named (primary: grammar-bridge — oblique 'me' + past-
participle 'sutaṁ' + ghost 'I')
- Secondary tensions: evaṁ polysemy, transmission-frame
- Plain-language summary: 4 English words conceal 3 grammatical
facts + 1 transmission-frame claim
§1 Current packet snapshot
- 3 paliWords, 4 englishTokens, 1 relation
- Strengths: dual-register tooltips already in place, content/function
split respected
- Gaps from new schema: MorphHint empty (case/form/gender across all
segments), epistemicBasis + sourceCitationIds empty across all
senses, isAnchor unset on candidate (sutaṁ), ghostKind on 'have'
using catch-all 'required' instead of specific 'auxiliary',
refrainId open question (cross-sutta scope)
§2 Evidence bundle (7 usable citations from DPD)
- eva: dpd:18054 (emphatic narrative-opener, PRIMARY) + dpd:18051
(restrictive sense, SECONDARY for polysemy tooltip)
- me: dpd:53164 ('by me', PRIMARY) + dpd:53163 ('myself/me-object',
SECONDARY for case polysemy)
- suta: dpd:63769 (pp 'heard', PRIMARY) + dpd:63771 (nt 'what is
heard', PRIMARY for sense 2) + dpd:63770 (literal 'learned',
SECONDARY)
- Parallels: 16 work-level (DN22, MN119, MA98...) — flagged as
schema tension in §10 (where do work-level parallels live?)
- Variants: ZERO for mn10:1.1 — opening stable across all witnesses
- Gaps: SC dictionary_full low-info (parser limitation), Aṭṭhakathā
not wired (commit C deferred), comparative basis not wired
Next step (after human review): §3 alignment scaffold. No demoPacket
edit yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3-§10
Gate verdict from Aditya: approve §0-§2 conditionally with 6
amendments + clean deferral. All applied:
1. transmission-frame tension marked as packet-level (resolutionSurface:
provenance.attribution + narrator frame + recited-speech span);
phase-a introduces but does not resolve.
2. DPD evidence for evaṁ DOWNGRADED. Verified directly: dpd.txt has
NO evaṁ/evaṃ headword, only eva (the bare particle) + evam-prefixed
sandhi compounds. The stem-stripper in build-dpd.ts conflates
evaṁ → eva mechanically, but the senses of eva (only/just/merely/
indeed/still) do NOT include the 'thus / in this way' reading
required for the opening formula. This is derivational, not
inflectional. Provider issue logged in §10 as fix-targets.
3. evaṁ sense upgraded to "Thus" + nuance "Narrative-opening deictic
('in this way') — points forward to what is about to be recited."
Grounded in Pāli grammar (Geiger §66 / Warder Ch.13), not DPD.
4. sutaṁ confirmed as anchor candidate. Both pp 'heard' and nt 'what
is heard' DPD citations preserved. Tooltips reframed per amendment:
the -ṁ ending marks neuter nom/acc sg declension; it does NOT
nominalize. Substantive use is a syntactic possibility of past
participles in Pāli generally.
5. 16 work-level Suttaplex parallels NOT placed in phase-a.parallels
(option c). Logged in §10 as schema gap — DeepLoomPacket needs a
packet-level workParallels field separate from PhaseView.parallels.
6. Filled §3 alignment, §4 linguistic, §5 bridge, §6 pedagogy, §7
epistemic audit, §8 decisions, §9 open questions, §10 schema/UI
tensions surfaced.
§7 epistemic audit table covers every claim:
- 5 lexical claims with citationIds
- 5 etymological claims (Pāli grammar)
- 5 curator-inferred claims (explicitly marked)
- 0 naked authoritative claims
§10 surfaces 6 schema/UI tensions for separate follow-up commits:
- DPD stem-stripper conflates derivational forms
- Packet-level workParallels schema gap
- Cross-sutta formula recurrence (extend Span.kind)
- Implicit-subject "I" rendering affordance
- SC dictionary_full parser limitation
- First-class curator-inference marker
No demoPacket.json edit in this commit. The proposed JSON diff
follows as a chat artifact for the second gate; demoPacket.json
edits land only after gate approval.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First MN10 phase re-curated via the Grounded Curation Loop protocol
(docs/sutta-studio/CURATION_PROTOCOL.md). Seven localized changes
in demoPacket.json phase-a + four new packet.citations entries.
Second-gate verdict from Aditya: "Approve to apply, with small
wording/schema-basis amendments before commit." All seven amendments
applied. See docs/sutta-studio/curation/phase-a.md §12 for the
amendment list.
Packet changes (phase-a):
1. a1 (evaṁ) sense upgraded to grounded deictic:
"Thus" + nuance "Narrative-opening deictic ('in this way') — points
forward to what is about to be recited" + notes contrasting with
bare-eva senses + epistemicBasis: 'etymological' (placeholder —
should be 'grammatical' once enum extends, §10.7).
2. a1.s1 tooltip softened: "Do not confuse evaṁ with bare eva, whose
common senses include 'only', 'just', or 'indeed'. Here evaṁ
functions as the adverbial deictic: 'thus; in this way.'"
3. a1.s2 tooltip rewritten: "[Niggahīta -ṁ] Marks the surface form
evaṁ, the adverbial deictic 'thus; in this way.'"
4. a2 (me): morph: { case: 'gen' } added; sense gets
sourceCitationIds + epistemicBasis: 'lexical' + confidence: 'high'
tied to cite:dpd:dpd:53164. Relation (a2→a3 "Heard BY") gets
confidence + epistemicBasis: 'etymological' (placeholder for
'grammatical', §10.7).
5. a3 (sutaṁ) marked isAnchor: true. Three existing senses retained,
each gets sourceCitationIds + epistemicBasis: 'lexical' grounded in
cite:dpd:dpd:63769 (pp 'heard'), :63771 (nt 'what is heard'),
:63770 (pp 'learned'). a3.s2 (ta) gets morph: { form: 'participle' };
a3.s3 (ṁ) gets morph: { gender: 'n', number: 'sg' } AND its
tooltips reframed to avoid the false "nominalizes" claim — now
says "Declensional ending; supports either reading; English perfect
collapses them."
6. ea2g.ghostKind: 'required' → 'auxiliary' (specific kind from
FEATURES.md §2.3 expanded set).
7. packet.citations populated with 4 entries (the cited DPD records)
carrying provenance + excerpt + license + fetchedAt — baked in so
the future audit/disagreement UI renders without re-fetch.
Curation log (docs/sutta-studio/curation/phase-a.md):
- All 12 sections filled
- §10 surfaces 9 schema/UI tensions (was 6; added 7-9 from the gate):
grammatical/curatorial EpistemicBasis values, function on MorphHint,
finer ghostKinds
- §11 Outcome documents test + build results
- §12 captures all 7 amendments from Aditya's second-gate review
Provider issue logged for follow-up commit (§10.1): DPD stem-stripper
conflates derivational forms (evaṁ → eva). Fix targets:
scripts/build-dpd.ts.
Verified:
- JSON parses cleanly; field-level readback confirms every edit
- 173/173 tests pass across 11 suites
- npx vite build succeeds (23.30s)
Tier-1 + first phase = first real grounding event. The live /sutta/demo
phase-a will improve once main updates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Kind + arrow calm-default)
First renderer chunk after the phase-a re-curation. Per Aditya's
"implicit visual grammar first, explicit explanation second"
discipline: most schema fields should become FELT structure, not
labels. This commit makes three of them visible without prose.
components/sutta-studio/PaliWord.tsx
Anchor styling on PaliWord.isAnchor=true:
- subtle warm underline (border-amber-700/30, only when no
refrain underline already present)
- slight weight increase (font-medium) on the word's serif text
Implicit cue — no badge, no "★", no label. A felt difference only.
Phase-a's `sutaṁ` is the first word in MN10 with isAnchor=true.
components/sutta-studio/EnglishWord.tsx
Per-kind ghost styling — implicit visual grammar, not labels:
- auxiliary → soft solid underline (border-slate-700/50)
- pronoun_from_verb → faint dotted line (border-slate-700/70 dotted)
- interpretive → italics only (already from isGhost styles)
- required → dotted underline (existing behavior preserved)
- article/copula/preposition_from_case/punctuation/quote_marker
→ default ghost styles only
Phase-a's "have" ghost (now ghostKind: 'auxiliary') visibly
distinguishes from generic 'required' ghosts.
components/sutta-studio/SuttaStudioView.tsx
Arrow default opacity: 0.4 → 0.2.
Arrows are interactions, not furniture. Default state is quieter
so the hover-summon feels like the arrow appears FOR the reader,
not always-on clutter. Focused state unchanged at 0.9.
This is chunk 1 of the renderer arc:
1. ✓ visual hints + arrow calm-default (this commit)
2. About-this-text provenance panel
3. Tooltip restructure (plain/grammar split, click-Pāli cycles facets)
4. Per-sense citation chips + DPD modal
Phase-b curation paused until visual hints land and we look at
phase-a in the browser to test the implicit-first principle
empirically.
Verified: 109 provider + sutta-studio component tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ooltip
Aditya observed that clicking a Pāli segment pins its tooltip, but
the visual state doesn't communicate "this is pinned now" vs
"this is just on hover" — the cursor stays as help-? and the
styling is identical. Confusing.
Two small changes give the pin state a felt signal:
components/sutta-studio/PaliWord.tsx
Separate isPinned styling from isHovered:
- hovered: white underline + bg (existing)
- pinned: emerald-600/70 underline + bg + thin emerald ring
around the segment (ring-1 ring-emerald-700/40)
+ cursor switches to 'pointer' (signals "click target")
Hover and pin no longer share visual state.
components/sutta-studio/Tooltip.tsx
Accept optional pinned: boolean prop:
- border becomes emerald-700/70 (vs slate)
- small '×' glyph in the top-right corner indicating the
tooltip is pinned and the segment can be clicked to unpin.
- The × is decorative (pointer-events-none); actual unpinning
happens by clicking the segment, same as before.
This is the A piece of "A then B" — quick cosmetic fix to make
the existing click-pin behavior legible. B (Chunk 3, tooltip
restructure with click-cycles-facets) replaces the pin model
entirely; this distinction will be repurposed for facet-state.
Verified: 44 tests pass (SuttaStudioView + sutta-studio-utils).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the click-toggles-pin model on Pāli segments with the
grounded-curation click model: each click on a Pāli segment pins
the tooltip AND advances to the next tooltip facet. To dismiss the
pinned tooltip, click the × glyph (now interactive).
Per the user's curation rhythm: Pāli stays put; click on Pāli
cycles the tooltip's facets (Meaning / What English hides / Example
/ etc.). Click on English continues to cycle senses (separate
mechanism). The asymmetry maps to what each side is for —
Pāli is canonical, English is plural.
SuttaStudioView.tsx
- New state: tooltipFacetIndices: Record<`${phaseId}-${segId}`, number>
- New function: cycleSegmentTooltipFacet(phaseId, segId) advances
the facet index, wrapping at seg.tooltips.length. No-op for
segments with ≤1 tooltip.
- Plumbs both to PaliWordEngine.
PaliWord.tsx
- Accepts tooltipFacetIndices + cycleSegmentTooltipFacet props
- Picks tooltip from seg.tooltips[facetIdx] (was: implicitly
coupled to word sense index — buggy when sense count ≠
tooltip count)
- onClick: always pins the segment (no more toggle-unpin on
same-segment click) + calls cycleSegmentTooltipFacet + still
cycles segment senses if present
- Passes facetIndex/facetTotal/onUnpin to Tooltip
Tooltip.tsx
- Accepts facetIndex + facetTotal: shows a small "1/3"-style
indicator in the top-left when there are multiple facets so
the reader knows clicking again advances
- Accepts onUnpin: makes the × glyph an actual <button>
(pointer-events-auto on the button only; the tooltip body
stays pointer-events-none so it doesn't intercept hovers from
the segment below)
Verified: 109 tests pass (SuttaStudioView + sutta-studio-utils +
all provider suites).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… of viewport
Two tooltip layout bugs surfaced via Playwright verification of phase-a:
1. Horizontal overflow: long facet text on right-edge segments clipped
past viewport — caused by `whitespace-nowrap` forcing single line.
Fix: `whitespace-normal break-words` + `max-w-[min(28rem,90vw)]` +
`w-max`. Tooltip wraps to multi-line for long text but never wider
than 90% of viewport.
2. Vertical overflow at top: phase-a sits at the top of the scroll
container; with the default "above segment" position (-top-14)
the tooltip rendered off-screen at y=-14.
Fix: useLayoutEffect measures the *segment*'s position (offsetParent)
— not the tooltip's — and flips to below-segment positioning
(top-full mt-3) when there isn't room above. Measuring the segment
is invariant under flip; measuring the tooltip caused an infinite
loop (above → top<8 → flip below → top>8 → flip back).
Also added:
- leading-relaxed for readable line-height in multi-line tooltips
- text-left so wrapped tooltips read left-to-right cleanly
- tooltip position uses bottom-full/top-full (dynamic) instead of
fixed -top-14 — adapts to tooltip's actual height
Trade-off: when the tooltip flips below for phase-a, it overlaps part
of the English row ("by me" partially obscured). Acceptable for now
because the tooltip is dismissible via the × glyph or clicking elsewhere.
Future improvement: tooltip could be semi-transparent over English, or
a higher z-index dimmer over the English row when tooltip pinned.
Verified via Playwright:
- sutaṁ ṁ segment click → tooltip at y=150 (below segment), all
overflow checks false (top, bottom, left, right)
- Zero console errors after the infinite-loop fix
- Multi-line wrap working (height=77px = 3 lines for the long
"In the formula 'me sutaṁ'..." facet text)
- 44 component tests still pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…provenance
Lands Level 3 audit per ADR SUTTA-008 §UI Vision and Aditya's "the audit
you are most hungry for" framing — what historical/textual object am
I reading? — work, expression, edition, transmission, translator,
annotation layer, and crucially: visible unknowns.
components/sutta-studio/AboutThisText.tsx (new)
- Compact chip mounted at top of content:
"▶ MN10 · Pāli · Theravāda · tr. Bhikkhu Sujato about"
- Click expands a structured panel with sections:
Work · Expression · Edition · Translation ·
Traditional Attribution · Annotation Layer ·
External References · Unknowns
- Traditional Attribution surfaces Provenance.attribution.confidence
("traditional" / "attested" / "disputed") — keeps tradition from
laundering itself as fact (the user's "make beliefs pay rent")
- Unknowns section makes gaps visible as honest prose, not blank
fields:
- "No single manuscript witness is attached to this packet"
- "First written attestation date is not recorded"
- "mn10:1.1 has no recorded variants" (when relevant)
- "Buddhaghosa's commentary not yet wired"
- Annotation Layer is rendered as static prose (LexiconForge curator,
DPD-backed). A first-class `Provenance.annotationLayer` schema
field is a follow-up — flagged in §10.7 of phase-a curation log.
components/sutta-studio/SuttaStudioView.tsx
- Mount <AboutThisText packet={packet} /> after ScrollProgressBar,
before the phases. The chip is the first thing the reader sees;
the body remains collapsed by default.
components/sutta-studio/demoPacket.json
- Populate packet.provenance with what we honestly know about MN10:
attribution: speaker, audience, legendary place/date, confidence='traditional'
oralLineage: Theravāda, Pāli, 5th–1st c. BCE, bhāṇaka recitation
edition: Mahāsaṅgīti Tipiṭaka 1956, Sixth Council, VRI digital source
translation: Bhikkhu Sujato (2018, CC0, via SuttaCentral)
external: link to suttacentral.net/mn10
- Deliberately NOT populated: firstWritten, manuscripts (we don't
have witness data for this specific packet — surfaces as
Unknowns prose)
Verified via Playwright:
- Chip renders with all 4 breadcrumb parts (workId, language,
tradition, translator)
- Click expands → 8 sections render with correct field plumbing
- "Certainty: traditional" appears on the attribution row
- "No single manuscript witness..." appears in Unknowns
- Zero console errors
Tier-1 + first phase + renderer arc + Level 3 audit affordance ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second MN10 phase re-curated via the Grounded Curation Loop. Six
localized changes + 5 new packet.citations. Aditya's gate-2 verdict:
approve with 2 amendments. Both applied.
Packet changes (phase-b):
1. b1 (ekaṁ): morph case=acc, number=sg on stem; case=acc on ṁ suffix.
sourceCitationIds on the 'one' sense → cite:dpd:dpd:17376 (card 'one')
+ cite:dpd:dpd:17382 (ind 'once'). confidence: high, lexical basis.
2. b2 (samayaṁ): **isAnchor: true** (amendment from Aditya — moved from
b3 to b2 because phase-b's bridge-learning point is 'ekaṁ samayaṁ
→ At one time', not 'bhagavā'). Morph on segments: gender=m on
aya root; case=acc, gender=m, number=sg on ṁ suffix. All 3 senses
get citationIds (cite:dpd:dpd:59451 for 'occasion' + 'time';
cite:dpd:dpd:59452 for 'opportunity'). Relation b2s3 → b3 'Time WHEN'
gets confidence: high + epistemicBasis: 'etymological' (placeholder
for 'grammatical' once enum extends — phase-a §10.7).
3. b3 (bhagavā): **NOT marked isAnchor** per amendment. Retains
refrainId: 'bhagava' (sufficient marker for recurring Buddha-epithet).
Morph case=nom, gender=m, number=sg on vā segment. All 3 senses get
cite:dpd:dpd:49147 (DPD packages all 3 readings — 'Sublime One' /
'Blessed One' / 'Fortunate One' — into one entry tagged as
Buddha-epithet).
4. eb1g 'At' ghost: ghostKind 'required' → **'preposition_from_case'**.
Specific reason: Pāli accusative-of-time-when (the -ṁ on ekaṁ and
samayaṁ) surfaces as English preposition. Per ADR ghost gate, the
specific kind is mandated over the catch-all.
5. packet.citations: 5 new DPD entries (eka×2, samaya×2, bhagavā×1).
Citations now total 9 (4 phase-a + 5 phase-b).
Curation log (docs/sutta-studio/curation/phase-b.md):
- All sections filled with gate-2 amendments recorded in §8.
- §7.10 new schema tension: RelationType lacks 'temporal' value.
Phase-b's b2→b3 'Time WHEN' is logically temporal, currently
encoded as 'location' as a hack. Proposed: add 'temporal' enum
value + renderer palette color. Issue to file.
- §7.11 cross-references phase-a §10.7 (EpistemicBasis missing
'grammatical' enum value — same gap surfaces here for the
relation's basis).
Verified:
- JSON parses; field-level readback confirms every amendment landed
- b3.isAnchor is None (NOT set); b2.isAnchor is True
- Citations array has 9 entries
- 44 sutta-studio component tests + utils tests pass
- npx vite build succeeds (built in 18.65s — including packet
bundling, no chunk-size regression)
Tier-1 + first 2 phases curated + renderer arc (chunks 1+A+B+about
panel + tooltip overflow fix) all on feat/opus-grounded-data-layer.
PR #38 now has 11 commits since branching from main at cfdc48c.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… principled tooltip-register check
Aditya's pushback (paraphrased): "Why are those words forbidden?
It feels fragile to have such blacklists. Can't we have some more
principled approach?"
The forbidden-words list (adverbial, deictic, cataphoric, niggahīta,
neuter nom/acc singular, declensional ending, past participle,
genitive, oblique, …) was a useful first heuristic but it has real
failure modes:
- Doesn't scale (new phases add new jargon never on the list)
- Pattern-matches WORDS, not CONCEPTS the reader stumbles on
- Forces euphemism that's sometimes worse than the original term
- Has arbitrary cutoffs (accusative isn't on it; adverbial is)
- Risks mode collapse: writers avoid the word rather than teach
the concept
Replaced with §3.4 Plain-Register Check — three criteria applied
per-tooltip, contextually, not globally:
1. Reader profile (default): a thoughtful adult, no Pāli training,
possibly familiar with popular Buddhism. Plain prose stands
alone for this reader; other readers get the structured
registers (grammar chip, audit modal).
2. Pay-rent rule: every technical term must answer (a) what
concept does it label that the reader needs precision about
and (b) why is precision needed here. If yes to both, keep it
AND gloss it in the same sentence. If no, replace with plain
English. Example: "accusative" pays rent (recurs across many
phases, names a specific case-form). Example: "in the genitive,
functioning adverbially" doesn't pay rent — replace with "the
form 'of me' here works like 'by me'".
3. Register layering: plain prose is for the default reader;
grammar chip allows any term (that's its job); audit modal
carries provider-level technical language by definition.
Plain prose doing the work of all three registers IS the
failure mode.
The old word list is preserved as DIAGNOSTIC EXAMPLES of the
failure tone — not a rule. If the curator finds themselves
reaching for one of those words in plain prose, it's a signal to
pause and check whether plain English carries the load.
Quick self-check before approving a tooltip:
1. Read aloud. Does the plain prose make sense to the default
reader without the bracketed chip?
2. For each technical term, can you answer the two pay-rent
questions?
3. Is anything in plain prose that would be better in the grammar
chip or audit modal?
Slots after §3.3 Affordance Gate as a 4th gate. Cross-references
FEATURES.md §6 (format prescription).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eferences
Per Aditya's gratitude framing: every named source in About-this-text
should resolve to a real public page where the reader can encounter
it directly. Audit + acknowledgment meet in the link.
Additive schema changes to Provenance:
edition: + url? (canonical edition page) + digitalSourceUrl? (the
digital pipeline, e.g., bilara-data GitHub) + licenseUrl? (CC deed)
translation: + url? (publisher / SC sutta page) + licenseUrl?
references?: new top-level Array<{label, url, note?, category?}> —
a curated list of works this packet rests on, framed as
acknowledgments. Category enum for renderer grouping:
'dictionary' | 'translation' | 'edition' | 'manuscript-archive'
| 'scholarly-reference' | 'commentary' | 'other'
All fields optional → no migration; old packets still validate.
Distinct from:
- external[] — per-text registry links (BDRC, GRETIL, CBETA, …)
- per-sense Citation rows — attest specific glosses with excerpts
The renderer (next commit) will surface references with a
gratitude-register header ("What this packet rests on") and link
through to each upstream source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… named source becomes a place to visit
Per Aditya's gratitude framing: the About panel was naming sources
without showing their proofs. Each unlinked claim was a missed
opportunity for both audit and acknowledgement. This commit gives
every named source a real link the reader can follow.
components/sutta-studio/demoPacket.json — packet.provenance:
edition: + url (tipitaka.org), + digitalSourceUrl (github.com/
suttacentral/bilara-data), + license string
translation: + url (suttacentral.net/mn10/en/sujato),
+ licenseUrl (creativecommons.org/publicdomain/zero/1.0/)
references: 7 new entries with full attribution prose —
1. Digital Pāli Dictionary (DPD) — dpdict.net, Bryan Levman et
al., CC BY-NC-SA 4.0, release v0.4.20260501
2. DPD source repository — github
3. SuttaCentral aggregated dictionary — /define
4. SuttaCentral bilara-data repo
5. Bhikkhu Sujato MN10 translation — direct sutta page
6. Vipassana Research Institute Tipiṭaka — tipitaka.org
7. SuttaCentral MN10 sutta page
components/sutta-studio/AboutThisText.tsx:
- New ExternalLink helper component with consistent styling
- Edition section: name becomes a link (when url set); digital
source line becomes a link; new license line with linked terms
- Translation section: translator becomes a link; license becomes
a link to its terms (CC0 deed)
- New "What this packet rests on" section rendering references[]:
- Italic gratitude-register intro: "Listed here as
acknowledgement, not just citation. Each link goes to a real
place the source lives — where you can meet the work on its
own terms."
- Per-reference: name as link + small mono-tag category chip
(dictionary/translation/edition/etc.) + multi-line note in
muted text
- External references section unchanged in shape, now uses
ExternalLink helper for consistency
Verified via Playwright:
- 12 outbound links rendering in the expanded panel
- All hrefs reach the expected upstream pages (tipitaka.org,
suttacentral.net, github.com/digitalpalidictionary/dpd-db,
creativecommons.org, etc.)
- 21 type-related tests + provider tests still pass
Two-commit acknowledgment arc complete: 9b5b59c (schema extension
for URLs + references) + this commit (content + renderer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d diff (awaiting gate)
Compressed two-gate format: brief + snapshot + evidence + proposed diff
in one log. Awaits gate verdict before packet edit.
§0 Phase brief — Kurūsu viharati ("[was] dwelling among the Kurus")
3 tensions: locative-plural-as-preposition (primary; no ghost
needed unlike phase-b's 'At'), historical-present (Pāli pres-tense
rendered as English past), place-vs-people-vs-region (3 readings
of the locative)
§2 Evidence bundle
- viharati: clean DPD coverage (cite:dpd:dpd:69661 + 69662) —
both pr=present, finite. EXCELLENT structured morphology data.
- kurūsu: **DPD STEM-STRIPPER BUG HITS AGAIN.** Resolved to
'kura [nt]: rice' (cite:dpd:dpd:22496) — totally wrong; kurūsu
is locative plural of 'kuru' (the Kuru people), not the
unrelated noun 'rice'. Second hit of the same conflation bug
from phase-a (evaṁ → eva). DO NOT cite the kura entry.
- Variants: zero for mn10:1.2 (line stable).
§3 Proposed packet diff
- c1 kurūsu: morph (number=pl on kurū stem; case=loc, number=pl
on su suffix); relation 'Dwelling IN' gets confidence + basis;
senses get nuance + basis but NO sourceCitationIds (provider
bug). Honest grounding: 'etymological' (Pāli grammar inference).
- c2 viharati: **isAnchor: true** (the action verb of the
geographical-frame clause); morph on ti suffix (person=3,
number=sg, tenseAspect=present, form=finite); 3 senses get
cite:dpd:dpd:69661/69662 with confidence: high.
- 2 new packet.citations (viharati's DPD entries).
- No ghost upgrades (phase-c has zero English ghosts — locative
case absorbs into sense gloss).
§5 Schema tensions
- Tension #1 (DPD stem-stripper conflation) hits 2nd time:
phase-a evaṁ→eva, phase-c kurūsu→kura. 2 of 3 phases.
Suggested: if phase-d hits a 3rd time, fix is overwhelming.
Patch is small (~10 LOC in scripts/build-dpd.ts).
- Tension #7 (EpistemicBasis 'grammatical' gap) hits 3rd time:
phase-a (evaṁ + relation), phase-b (relation), phase-c (c1
senses + relation). 3 of 3 phases. Very strong signal.
- No new tensions from phase-c.
§6 Plain-register flag (not in diff):
c2s3's only tooltip '[Thematic vowel] Class I verb marker'
strips to empty when grammar-terms is off. Fails §3.4 check.
Defer to plain-first tooltip rewrite chunk.
§7 Open questions:
- File the DPD stripper fix before phase-d, or batch later?
- At what tension-hit count do we cut a small fix commit?
Suggested heuristic: 5 hits OR after batch 2 complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ration
Phase-a (evaṁ→eva) and phase-c (kurūsu→kura) both surfaced the same
class of provider bug. Aditya: "more data before deciding." Two of
three phases hit a flavour of this, so investigating now.
Root cause was THREE distinct bugs, all in scripts/build-dpd.ts:
Bug 1 — Niggahīta diacritic mismatch (root cause for evaṁ).
DPD uses ṃ (U+1E43, m-with-dot-BELOW; ISO 15919).
SuttaCentral bilara uses ṁ (U+1E41, m-with-dot-ABOVE; IAST).
Same Pāli sound, different Unicode bytes. Direct lookup of bilara's
'evaṁ' against DPD's 'evaṃ' headword failed; the stem-stripper
then fell through to the unrelated particle `eva` ('only/just/
indeed') — entirely different semantics from the adverbial evaṁ
('thus; in this way').
Fix: normalize DPD's ṃ → bilara's ṁ during parsing AND during
surface-form extraction. Single source of truth.
Bug 2 — Over-greedy 3-char endings 'ūsu' / 'ūhi'.
These were listed as single morphological endings, but locative-
plural is actually -su (with the long ū belonging to a sandhi-
lengthened stem: kuru → kurū before -su). The over-greedy strip
collapsed kurūsu → 'kur', then tried 'kur'+a = 'kura' (the noun
'rice', totally unrelated to the Kuru people).
Fix: removed 'ūsu' and 'ūhi' from PALI_ENDINGS.
Bug 3 — Missing bare 'su' / 'hi' endings + missing vowel-shortening.
Once 'ūsu' was removed, kurūsu still didn't match because:
- 'su' wasn't in the endings list at all (only 'esu' for a-stems)
- even when stripped, kurū → kuru required vowel-shortening
(the locative-plural rule lengthens stem-final vowels)
Fix: added 'su' / 'hi' to endings. Added vowel-shortening logic:
after stripping any case ending, if the stem ends in long ā/ī/ū,
also try the short variant (kurū → kuru, bhikkhū → bhikkhu).
Verified end-to-end re-ingestion:
- evaṁ → ['evaṁ', 'eva']
PRIMARY: 'thus; this; like this; similarly; in the same manner;
just as; such' (DPD's evaṃ, now normalized) — exactly the sense
phase-a's curation correctly identified as the right reading.
SECONDARY: bare 'eva' (still derivationally related; surfaced
for transparency rather than hidden).
- kurūsu → ['kurū', 'kuru']
PRIMARY: 'name of the people of Kuru; Kurus' (DPD's kurū entry,
long-vowel stem)
SECONDARY: 'name of a country' (DPD's kuru entry)
Both real Kuru entries; the unrelated 'kura' (rice) is GONE.
- Coverage: 81.6% → 86.5% on MN10 (462/534 surface forms;
26 newly-matched surfaces).
- 76 surfaces still unmatched (was 98); remaining gaps are
compounds the stem-stripper can't decompose.
Implications for curation:
- phase-a (8e7b197) intentionally did NOT cite DPD for evaṁ
because the conflation made the citation misleading. Those
citations are now available and HONEST. Backfill is optional
enrichment work; not done in this commit.
- phase-c gate-pending diff (b46aa64) similarly assumed no DPD
for kurūsu. The kurū / kuru entries are now citable. Backfill
is part of phase-c gate-2 amendments.
Tension #1 (DPD stem-stripper conflation) is fixed. Cumulative
hit count was 2/3 phases; future phases should be cleaner.
Verified: 78 tests pass, Vite build green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third MN10 phase re-curated. 5 localized packet changes + 4 new citations. Aditya's gate-1 surfaced the DPD stripper bug; gate-2 became "fix the bug, then apply with the now-honest citations." Mid-curation provider fix shipped as c33b115 (three stripper bugs: niggahīta normalization, over-greedy -ūsu/-ūhi endings, missing bare -su/-hi + vowel-shortening). Coverage 81.6% → 86.5%. Packet changes (phase-c): 1. c1 (kurūsu): morph (number=pl on kurū stem; case=loc, number=pl on su suffix). Relation c1s2 → c2 "Dwelling IN" gets confidence high + epistemicBasis etymological (placeholder for 'grammatical' once enum extends — Tension #7). **DPD citations now real**: kurūsu correctly resolves to two distinct DPD entries (no longer the kura/rice misfire): - "among the Kurus" → cite:dpd:dpd:22524 (kurū masc — "name of the people of Kuru; Kurus") - "in Kuru territory" → cite:dpd:dpd:22502 (kuru masc — "name of a country") - "with the Kuru people" → cite:dpd:dpd:22524 (secondary use) All 3 senses: epistemicBasis 'lexical', confidence high/medium. 2. c2 (viharati): **isAnchor: true** (the action verb of the geographical-frame clause). Morph on ti suffix (person=3, number=sg, tenseAspect=present, form=finite) — exactly what DPD's pos=pr declares. 3 senses get cite:dpd:dpd:69661/69662 with confidence high: - "was dwelling" → :69661 (primary) - "was staying" → :69661 + :69662 (both senses contain this) - "was abiding" → :69662 (primary) 3. packet.citations: 4 new entries (kurū, kuru, viharati x2). Total now 13 (4 phase-a + 5 phase-b + 4 phase-c). Curation log: §8 records the mid-flight provider fix and how it changed the citation landscape (kurūsu went from "no DPD citation" to two real ones in the same draft). §9 Outcome filled. Tension #1 (DPD stripper) marked RESOLVED. Verified: - JSON parses; spot-checks confirm every field landed - c2.isAnchor = True (viharati anchor); c1 morph cases correct - 21 component + type tests pass - Vite build green Tier-1 done + first 3 MN10 phases curated + first provider quality fix shipped. Stack: feat/opus-grounded-data-layer on PR #38. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… link honest Phase-a (8e7b197) deliberately left a1.senses[0] (evaṁ "Thus") without a DPD citation because the stem-stripper at the time conflated evaṁ with the unrelated bare particle `eva`. The conflation was Tension #1 on the schema-tensions list — fixed in commit c33b115 (niggahīta normalization + endings-list fixes + vowel-shortening). DPD's evaṃ headword (now normalized to evaṁ in our index) carries exactly the sense phase-a's curation correctly identified: "thus; this; like this; similarly; in the same manner; just as; such" (cite:dpd:dpd:18134, ind). Backfill applied: a1.senses[0]: - epistemicBasis: 'etymological' → 'lexical' - sourceCitationIds: + ['cite:dpd:dpd:18134'] - confidence: 'high' (new) - notes: updated to reference DPD's own treatment of evaṁ/eva as distinct headwords (the curation's "do not confuse evaṁ with bare eva" framing is VALIDATED by DPD, not contradicted by it) packet.citations: + 1 new entry (cite:dpd:dpd:18134) Total packet.citations: 13 → 14 phase-a curation log §13 records the backfill. Verified: - JSON parses; a1 has lexical basis + citation + confidence - 21 tests still pass - Vite build will be triggered with next renderer change The "Do not confuse evaṁ with bare eva" tooltip on a1.s1 remains correct; DPD's separate headword treatment makes it MORE accurate post-fix than at original curation time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…curatorial' to EpistemicBasis Tension #7 surfaced in phase-a/b/c curation: claims grounded in syntactic/morphological rules were being labeled 'etymological' as the closest enum fit. But etymology is word-history (sandhi, cognate); these claims are GRAMMATICAL (agent-in-genitive-of-passive- participle, accusative-of-time-when, locative-as-location). Hit 3/3 phases — strong signal per the user's "more data first" rule. types/suttaStudio.ts - EpistemicBasis enum: added 'grammatical' + 'curatorial'. Now 7 values: etymological / grammatical / commentarial / contextual / lexical / comparative / curatorial. - Doc block explains the resolution history and what each new value covers ('grammatical' for syntactic rules; 'curatorial' for explicit inference grammatically grounded but not from a single attestation). components/sutta-studio/demoPacket.json - 3 placeholder usages migrated from 'etymological' → 'grammatical': * phase-a a2.s1.relation 'Heard BY' (agent-genitive-of-passive- participle pattern) * phase-b b2.s3.relation 'Time WHEN' (accusative-of-time-when) * phase-c c1.s2.relation 'Dwelling IN' (locative-as-location) - All 3 are syntactic rules, not etymology. Migration is honest; new enum value makes the basis accurate. - Zero remaining 'etymological' values in packet (all migrated; no legitimate word-history claims in phase-a/b/c yet). types/suttaStudio.test.ts - New test: EpistemicBasis enum round-trips all 7 values. Tension #7 closed at 3/3 phase hits. Cumulative schema-tensions resolved by today's work: #1 DPD stem-stripper conflation (commit c33b115) #7 EpistemicBasis 'grammatical'/'curatorial' (this commit) Verified: 14 types tests pass; types compile clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed citations visible ADR SUTTA-008 §UI Vision #4 ("Why does this gloss say X?") promised a way to surface Sense.sourceCitationIds to the reader. 14 DPD citations have accumulated in packet.citations across phase-a/b/c, all invisible until now. Design choice — show only when pinned, not on hover: - Hover = reading flow → minimal noise - Pin = audit moment → source revealed Matches the existing pin-as-engagement model. components/sutta-studio/SuttaStudioView.tsx citationsById = useMemo lookup table from packet.citations. Threaded down to PaliWordEngine. components/sutta-studio/PaliWord.tsx - New citationsById prop. - In the segment render loop, resolve activeSense (segment-level senses take priority over word-level for compounds). - When activeSense.sourceCitationIds is non-empty, resolve via citationsById and pass as citations[] to Tooltip. - Imports Citation type. components/sutta-studio/Tooltip.tsx - New citations?: Citation[] prop. - When pinned AND citations.length > 0, render a footer below the main tooltip body: * "SOURCES" header in emerald-500/70 uppercase tracking-widest * Per-citation: short ref in slate-300 + italic excerpt in slate-500 (quoted) - Hidden entirely on hover (matches the audit-on-pin discipline). Verified via Playwright on /sutta/demo: - Click sutaṁ's ṁ segment → tooltip shows facet 2/2 ("In the formula 'me sutaṁ'...") plus SOURCES footer: DPD s.v. suta (pp — heard) "suta [pp]: heard" - Click eva segment → tooltip shows evaṁ deictic prose plus SOURCES footer with cite:dpd:dpd:18134 (the backfilled evaṁ entry). - Zero console errors. Net effect: the entire grounded-curation arc is now READABLE in the UI. Pin any Pāli segment whose word/segment senses have citations, see exactly which DPD entries support the gloss + the excerpt. 58 tests pass across providers, sutta-studio, types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c session
Captures the entire session arc — Tier-1 grounded data layer, 3 MN10
phases curated via the Grounded Curation Loop protocol, full renderer
arc (anchor styling, pin model, click-cycles-facets, tooltip overflow
flip, About-this-text panel with linked acknowledgments, citation
chips), DPD bug fix, EpistemicBasis enum extension.
WORKLOG.md
- New top entry for 2026-05-12 with status, sources of truth,
25-commit arc summary, deferred-list, resume instructions.
- Previous 2026-05-11 entries preserved.
HANDOVER.md
- Replaces prior handover (preserved in git history).
- Sections: session summary, 25 commits categorized by arc,
what landed (6 categories), pending threads (priority/effort
ordered with context-cost notes), key context the next
instance needs (curation rhythm, DPD bug pattern, pinned-
tooltip discipline, schema-tension hit-count discipline,
gratitude register house style), resume instructions.
The "Pending threads" section is the load-bearing one — sorted
high context-value first so a fresh agent can pick tasks aligned
with their available context.
Last commit before handover-and-stop per session-boundary discipline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gressions Adds 27 unit tests for the three root causes fixed in c33b115: 1. normalizeNiggahita — ṃ (U+1E43) → ṁ (U+1E41) conversion; idempotency; mixed-form handling; direct codepoint assertions. 2. PALI_ENDINGS — assertions that 'ūsu'/'ūhi' are absent (the over-greedy endings that conflated kurūsu → kura/rice) and that bare 'su'/'hi' are present (paired with vowel-shortening). 3. tryStemStrips — kurūsu produces 'kuru' (via vowel-shortening) and does NOT produce 'kur' or 'kura' (the bug-path); same pattern verified for bhikkhūsu; coverage of all three long vowels (ā→a, ī→i, ū→u); evaṁ direct-match preservation; quotative-tail handling. Refactor: build-dpd.ts now exports the pure helpers (normalizeNiggahita, PALI_ENDINGS, QUOTATIVE_TAILS, stripQuotative, tryStemStrips) and gates main() with a standard ESM Node entrypoint guard so importing the module during tests is side-effect-free. Verified: \`npm run build:dpd\` still runs correctly. Per HANDOVER §Pending #4: c33b115 shipped with only end-to-end verification. These unit tests catch regressions before they surface in coverage drops on the next phase's curation. Tests: 112 passed under \`services/providers\` + \`scripts\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ings Phase-d surfaced the same stem-stripper conflation pattern that c33b115 fixed for -ūsu/-ūhi: kurūnaṁ (gen pl of Kuru) over-stripped to 'kur' and conflated with 'kura' (rice). Hit #3 in 3/3 batch-2 phases — crosses the threshold phase-c §5 set for "overwhelming case to fix the stripper." Parallel fix mirroring c33b115: - Removed 'ūnaṁ' and 'unaṁ' from PALI_ENDINGS (u-stem gen pl is vowel-lengthening + bare -naṁ, not a single 4-char ending) - Added bare 'naṁ' paired with the existing vowel-shortening rule - Kept 'ānaṁ' (a-stem gen pl IS a real single ending in standard analyses — dhammānaṁ = dhamm + ānaṁ, not dhammā + naṁ) Coverage: 86.5% → 86.9% on MN10 (+2 surface forms now resolved). kurūnaṁ now resolves to ['kurū', 'kuru'] — identical to phase-c's kurūsu, the correct DPD entries (dpd:22524 "name of the people of Kuru" + dpd:22502 "name of a country"). Tests: added 10 regression cases (PALI_ENDINGS membership assertions for ūnaṁ/unaṁ/naṁ/ānaṁ, tryStemStrips coverage for kurūnaṁ + bhikkhūnaṁ, a-stem gen pl preservation regression net). 37 pass under build-dpd.test; 20 pass under dpd.test against the rebuilt dataset. Resolves schema tension #1 — DPD stripper conflation. Both -su/-hi and -naṁ patterns now closed for u-stems. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urūnaṁ nigamo
Closes batch 2 of CURATION_PROTOCOL §6 (phase-b, phase-c, phase-d).
Changes (one phase, one commit, per protocol):
- d1 Kammāsadhammaṁ: isAnchor=true; d1s3 morph {number:sg, gender:n}
(per gate-2 amendment: don't overclaim case — neuter sg has identical
nom/acc forms; ambiguity noted in tooltip). 3 senses with separated
epistemic basis: lexical (DPD dpd:20396) / curatorial (Jātaka-derived
"Spotted-One Tamed" — note softened per amendment to make traditional
derivation honest, not lexical-asserted) / etymological (compound parse).
- d2 nāma: sense lexical + dpd:36427 (the naming-particle DPD entry,
selected from 7 nāma homonyms).
- d3 kurūnaṁ: d3s1 morph {number:pl}; d3s2 morph {case:gen, number:pl};
relation extended with confidence + epistemicBasis=grammatical
("Town OF" is case-derived). 3 senses lexical + REUSES phase-c
citations (dpd:22524 kurū + dpd:22502 kuru) — same stem, new case.
- d4 nigamo: d4s3 morph {case:nom, number:sg, gender:m}. 3 senses:
market-town/township lexical (dpd:36863 + dpd:74785); "trading center"
curatorial + low confidence per amendment (DPD doesn't attest it).
- packet.citations: 14 → 18 (added dpd:20396, dpd:36427, dpd:36863,
dpd:74785). 2 phase-c entries reused.
Methodological win (per Aditya's framing): phase-d forces the system to
separate four kinds of claim — lexical attestation, grammatical inference,
traditional/commentarial etymology, curatorial pedagogical expansion.
First phase to exercise the new 'curatorial' EpistemicBasis value (added
in 4323310) for real, on the Jātaka derivation + trading-center expansion.
Schema tensions:
- #1 (DPD stripper conflation) — RESOLVED across both -su/-hi (c33b115)
and -naṁ (be2b141, this session). All u-stem oblique plurals now
correctly handled.
- #7 (EpistemicBasis enum) — first real load on 'curatorial'; no
laundering of curator inference as etymology.
- No new tensions surfaced from phase-d.
Curation log at docs/sutta-studio/curation/phase-d.md (gates, amendments,
plain-register deferrals, open questions).
Tests: 321/322 pass (1 flake in SessionInfo unrelated to phase-d; passes
in isolation).
Batch 2 complete → re-evaluate protocol before starting batch 3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves docs/WORKLOG.md conflict: 1242e43's temporary "claim" entry (landed on main yesterday signalling this branch's work was in progress) is superseded by the comprehensive "done" entry produced this session. Also updates the done entry to reflect today's three additional commits: - b1b7fdb (DPD bug-fix unit tests — 37 regression cases) - be2b141 (DPD parallel fix — -ūnaṁ/-unaṁ removal, +0.4pp coverage) - b5a52a9 (phase-d re-curation, closes CURATION_PROTOCOL §6 batch 2) Total: 28 commits on the branch (29 including this merge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First of five Tier-1 commits implementing the grounded curation data layer per ADR SUTTA-008. Lands the provider abstraction + Citation provider-attribution fields. Subsequent commits add DPD, VRI edition, Aṭṭhakathā commentary, SC bilara/suttaplex, and curation helper.
Draft because B-E are coming on this branch. Convert to ready-for-review when the Tier-1 set is complete (or when we decide to merge A independently).
What's in
services/providers/types.ts— interfaces:LexiconProvider,MorphologyProvider,CommentaryProvider,EditionProvider,WitnessProvider,ParallelProvider. Every response carriessourceId+ deterministiccitationIdper ADR amendment refactor(db): remove legacy indexedDB facade #2.services/providers/citationHelpers.ts—citationIdFor(cite:{providerId}:{sourceId}orcite:{providerId}:q:{query}) andmaterializeCitation. Citation minting is mechanical, not hand-glued.services/providers/lexiconRegistry.ts—LexiconProviderRegistryruns providers in parallel; preserves per-provider entries inentriesBySource(powers ADR UI vision feat(prompts): metadata preamble #7 source-disagreement inspector); isolates one provider's failure from others.services/providers/suttaCentralDictionary.ts—SuttaCentralDictionaryProviderwraps the existing/api/dictionary_full/{lemma}endpoint as a first citizen. Per-session cache; preserves raw payload asrawExcerpt.services/providers/index.ts— barrel +defaultLexiconRegistrysingleton.types/suttaStudio.ts—Citationextended withprovenance/query/excerpt/license/fetchedAt; newCitationProvenanceenum (15 sources).What's not in (intentionally)
services/compiler/index.ts:387(thefetchDictionaryEntrycallsite) is unchanged. Compiler wiring lands with DpdProvider in commit B as one coherent unit — we change the lexicographer prompt builder once, not twice.defaultLexiconRegistryin commits B-D.Test plan
services/providers/)defaultLexiconRegistry.lookupduring phase-a re-curation (task feat(db): pre-migration backup + version gate + restore #14) — first real usage exerciseADR alignment
LexiconProviderinstantiatedsourceId+ deterministiccitationIdprovenance/query/excerpt/license/fetchedAtentriesBySourcepreserves per-provider entries; test demonstrates grouping byquery🤖 Generated with Claude Code