convert: Brainstorm-E split migrators (treatment→manipulation, ontology_table_row→observations)#149
Merged
Merged
Conversation
…e_row→obs) #2 of the migration work — the migration CODE for the Brainstorm E split, implementing the specs in did-schema V_epsilon/conversions/from_did_v1/. Engine (v1_to_v2): adds a TargetVersion option (default 'V_delta', behaviour unchanged) and 1→N support. Under TargetVersion 'V_epsilon', a class with a split migrator under +did2/+convert/+migrators_e is routed there and may return several bodies; each is padded, optionally validated, and counted independently. Fully backward compatible — the default path still runs the existing per-class migrators 1→1, so the V_delta corpus pipeline is untouched. Migrators (+migrators_e): - treatment.m: 1→1 branch dispatch to injection / bath / procedural_ / temperature_ / environmental_manipulation, with the Dab string_value→ target_structure edge case and out-of-tier (quarantine-with-reason) routing for non-manipulation rows. Branch resolution is a keyword/CURIE-prefix seed; the per-term table is finalized in discovery mode. - ontology_table_row.m: 1→N — each row → a scalar/categorical observation property class (or generic escape hatch), value placed in the declaring block (shape mixin for scalars / categorical_concept for inherited categoricals / the concrete class for the overriders). Tests (testMigratorsE.m): synthetic, Validate=false (matching testMigrators) — thermal/drug/environmental/Dab/not-a-manipulation treatment cases, the ontology_table_row 1→N fan-out + per-row unique ids, and a backward-compat guard that the default target leaves treatment unchanged. Not local-verifiable (no MATLAB in the authoring env) — relies on CI. Corpus discovery against V_epsilon (#3) is a follow-up that needs the E draft classes reachable by the schema cache (DID_SCHEMA_PATH → V_epsilon); see PR body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
The compressed multi-line if/elseif chain with a trailing end (line 232) tripped the MATLAB Code Analyzer keyword/end alignment check. Expand to standard multi-line form; no behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
Wires the corpus migration test (#3) to run against the V_epsilon E classes, isolated from the green V_delta corpus run: - runCorpusDiscovery: add a TargetVersion option (default 'V_delta'), forwarded to v1_to_v2. - testCorpusEpsilon.m: opt-in (DID_RUN_EPSILON_CORPUS) discovery run of the Dab corpus through TargetVersion='V_epsilon' (Dab's treatment rows, incl. the optogenetic-tetanus "Target Location" idiom, exercise the split). Skips cleanly in the default V_delta run via assumeFail. - test-code.yml: after the default test run, assemble a combined V_epsilon stable+draft schema dir (the cache loads one flat dir; tier class names are disjoint), point DID_SCHEMA_PATH at it, and run testCorpusEpsilon with DID_RUN_EPSILON_CORPUS=1. Own step env, so the V_delta job is untouched. Discovery mode: quarantine (e.g. the expected missing-time_reference until synthesis lands) is reported, not failed; only errors fail the step. The report artifact drives the next iteration of the dispatch/synthesis work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
Adding options.TargetVersion to the arguments block requires in the function signature too (MATLAB MismatchBetweenBlockAndLine). This also broke the existing corpus tests that call this helper. No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…elta) Per "replacing V_delta with V_epsilon": the standard test run now validates against V_epsilon and the corpus discovery migrates through the Brainstorm-E split, instead of running V_epsilon as an isolated opt-in side step. - test-code.yml: assemble a combined V_epsilon stable+draft schema dir (the cache loads one flat dir; tier class names are disjoint) and point the main DID_SCHEMA_PATH at it. Removed the separate opt-in epsilon step. - runCorpusDiscovery: default TargetVersion is now V_epsilon, so all corpus tests (Dab, B, JH, PRED, 20211116, Soph) exercise the split as the main test. - v1_to_v2: stamp document_class.schema_version = V_epsilon on outputs when TargetVersion is V_epsilon. - Removed testCorpusEpsilon.m and the DID_RUN_EPSILON_CORPUS gate (redundant). Non-corpus unit tests are unaffected: they set their own fixtures schema path and run Validate=false, so they do not read DID_SCHEMA_PATH. v1_to_v2's own default TargetVersion stays V_delta so the per-class V_delta migrator unit tests (testMigrators/testConvertV1ToV2) remain valid; corpus migration opts into V_epsilon via runCorpusDiscovery. Discovery mode: quarantine (e.g. the expected missing-time_reference until synthesis lands) is reported, not failed. Local-dev default schema path (cache defaultSchemaPath) is unchanged; a multi-tier cache loader is the follow-up that removes the assemble-a-dir step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…ore empty time_reference) The Brainstorm-E split migrators now attach a real time_reference instead of an empty slot (the prior cause of quarantine for behavioral/treatment rows): - Each migrated interaction depends_on a session_relative_reference document (ordinal, relation='during', anchored to the source's base.session_id) minted with did.ido.unique_id(). 'during' is the honest universal fallback (the event happened within the session); 'at_end_of' is reserved for known-terminal cases, not asserted blanket. - treatment: now 1 -> 2 (manipulation + anchor). ontology_table_row: 1 -> N+1 (N observations sharing one session anchor). - Fixes invalid row ids: ontology_table_row observations now get fresh did.ido.unique_id() base ids (the previous "<id>_row%02d" was not a valid did_uid and would have quarantined). Updates testMigratorsE expectations for the added anchor doc (+ asserts the anchor relation is 'during'). Authored without local MATLAB — CI-verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
- treatment.m: drop the stale %#ok<AGROW> suppression (the line isn't in a loop, so no AGROW warning is generated). - ontology_table_row.m: remove the now-unused rowIndex argument threaded through migrateRow/makeScalarObservation/makeCategoricalObservation/ startObservation (row ids now come from did.ido.unique_id()). No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…s report
From the V_epsilon corpus discovery (B 0, 20211116 275, Dab 7859, JH 41504):
1. ontology_table_row "has no rows to migrate" (~47K, the dominant cause):
the real v1 layout is parallel char fields (names / variable_names /
ontology_nodes / data), not a 'rows' array, so extractRows found nothing
and threw. Now falls back to migrating the document unchanged as an
ontology_table_row (the class still exists in V_epsilon, so it validates)
instead of quarantining. Splitting that char-field layout into per-row
observations is a follow-up.
2. target_structure "must be a struct (named composite type ontology_term)"
(~105): the field is an ARRAY of ontology_term (a struct array), but the
migrators emitted a cell array. Emit struct arrays
(empty: struct('node',{},'name',{}); populated: struct('node',..,'name',..))
in both treatment.m and ontology_table_row.m.
Remaining quarantine is mostly non-E drift (stimulus_bath mixture,
stimulus_response_scalar_parameters_basic block, subject_group has no V_epsilon
class) plus a handful of unresolved-branch treatments (curator review).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…olumn observations extractRows now parses the actual v1 ontology_table_row form (per the schema): parallel comma-separated names / variable_names / ontology_nodes plus a data struct keyed by variable_names. One document is one table ROW; each COLUMN becomes one observation (identity = ontology_node + name; value = data.<var>). Columns with no usable value (missing key, [], '', NaN) are skipped. The synthetic rows-array / single-row shapes still work; the unparseable-fallback (migrate unchanged as ontology_table_row) remains for anything else. This converts the ~47K "has no rows to migrate" quarantine into actual per-column observations (the dominant E-split gap from the corpus report). Adds testMigratorsE cases for the char-field split and empty-value skipping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
No .codespellrc to ignore-list, so rename the variables. No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
The extractRows rewrite's old_string didn't include the original function's closing end, leaving a duplicate end after splitCSV (Code Analyzer parse error at END, line 290; the 'function might be unused' alert was its downstream symptom). Removed the extra end; the file is balanced (the apparent off-by-one in keyword counts is the arguments block). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
The dominant corpus quarantine ("No schema file for class
generic_categorical_observation" 41,095 in JH; session_relative_reference;
generic_scalar_observation.value placement) was not migrator bugs — the
workflow checked out did-schema at ref: main, where the Brainstorm-E classes
do not yet exist. They live on the matching did-schema branch (DID-schema#62).
Point the did-schema checkout at ${{ github.head_ref || 'main' }} so the
V_epsilon corpus discovery validates against the actual E schema. Revert to
main once DID-schema#62 lands.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
stimulus_bath migrates to a `bath` (pharmacological_manipulation), but the bath needs its subject (the stimulator element's subject) and an epoch_bounded_ reference time anchor (the stimulator's epoch) -- both require following stimulus_element_id into the session/element graph. A manipulation must be emitted complete, so the whole bath is assembled in ndi.migrate.local, not per-document. The per-doc converter now defers via did2:convert:needsSessionContext, so the discovery report reads "migrate via ndi.migrate.local" instead of the misleading "mixture missing" (the prior V_delta-block-placement fallback). Adds a test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…ays-error pattern The prior 'error then unreachable assignment' tripped 'return value might be unset' + 'input unused'. Use ~ for the ignored input and assign the required output before the error (suppressing the dead assignment), so the stub is analyzer-clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
isAlreadyVDelta -> isAlreadyTarget(body, targetVersion): a body already tagged with the configured TargetVersion (V_delta or V_epsilon) short-circuits the migration loop and is just padded (ensureClassBlocks) + validated, not re-migrated. Default 'V_delta' preserves existing behaviour exactly. This is what lets ndi.migrate.local feed NDI-assembled V_epsilon bodies (the bath + epoch_bounded_reference from stimulusBathToBath) back through v1_to_v2 to get the superclass chain rebuilt and validated. Adds a short-circuit test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
Follows the V_epsilon schema flatten (categorical_concept.value now placement: concrete_class). ontology_table_row's categorical path no longer branches on whether the class is an overrider; it writes the bound term uniformly into body.(className).value. This clears the ~47K "categorical_concept.value missing" corpus quarantine, since the inherited value now lives in each observation's own block. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
Clears the JH subject_group corpus quarantine (353 docs): subject_group is deprecated in V_epsilon, so the per-document E migrator folds it into a subject flagged is_group (optional legacy group_name/description map to local_identifier/description). Membership -> group_assignment is relational and stays an NDI-layer follow-up (like stimulus_bath -> bath). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
20211116 was the lone corpus still migrating to V_delta via bespoke test code. Convert it to the shared did2.unittest.helpers.runCorpusDiscovery driver (same as B / Dab / JH), which targets V_epsilon and reuses the common schema-path install/teardown and report writer. This puts every corpus on the same V_epsilon target so its remaining quarantine is measured on the same footing as the others. Discovery mode is unchanged (no zero-quarantine assertion); the report is the deliverable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
v1 documents carried a property block for every class in their
hierarchy. V_epsilon introduces abstract / fieldless parent classes
(new -- the previous schema had none), which contribute no block to an
instance. The inherited empty block (e.g. stimulus_response_scalar_
parameters: {} on a stimulus_response_scalar_parameters_basic doc) rode
through migration untouched and tripped the strict undeclared-top-level-
block validator.
In ensureClassBlocks, after manufacturing the contributing blocks, remove
any top-level block that names a chain class which contributes no block
AND is an empty struct. Only EMPTY blocks are dropped; a non-empty one
signals real data a migrator must place, so it is left to fail loudly
rather than be silently discarded. General fix (one place) rather than a
per-class migrator, so every v1 abstract-parent-empty-block case clears.
Regression-checked by the 20211116 corpus discovery job (its 273
"undeclared top-level block stimulus_response_scalar_parameters" should
drop to ~0).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…r; enable Soph corpus
Three remaining deprecated-class folds (no migrator existed, so they would
quarantine under V_epsilon):
- treatment_drug -> injection (kind: drug); mixture from mixture_table
- virus_injection -> injection (kind: virus); virus+dilution in the mixture
- treatment_transfer -> biological_transfer; recipient->subject_id,
donor carried as donor_id, method->procedure/kind
Each is a 1 -> 2 emit (the manipulation + the shared session_relative_
reference anchor subject_interaction requires) and uses the correct
{chemical, amount} mixture record shape. Tests added to testMigratorsE.
Also enable the large Soph corpus (~446 MB, ~101k docs) in test-code.yml
(DID_RUN_SOPH_TEST=1) so its V_epsilon quarantine is surfaced for the
discovery pass; this is opt-in and meant to be gated back off once Soph
coverage is characterised.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
Soph discovery is characterised (0 quarantine at V_epsilon), so drop the DID_RUN_SOPH_TEST=1 enable from test-code.yml -- the ~446 MB / ~101k-doc download should not run on every push. The test remains opt-in (skips cleanly) and can be re-enabled on demand. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
The treatment -> injection/bath builders predated schema validation
(treatment isn't in the validated corpora) and had three mismatches that
would quarantine the moment a real treatment-drug/bath row appeared:
- mixture records used {agent, concentration}; the schema is
{chemical, amount}.
- injection/bath were given a `notes` field neither class declares
(strict undeclared-field failure).
- injection omitted its required `volume`.
Align both builders with the injection/bath schemas: mixture as
{chemical, amount}, injection carries kind/volume(blank)/route/
target_structure, bath carries kind/location. notes is dropped (no field
on these classes). Matches the {chemical, amount} shape the new
treatment_drug / virus_injection migrators already use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…ting) To curate the authoritative treatment / ontology_table_row routing tables from real data, the discovery run now writes corpus-reports/<corpus>- routing.json: for every migrated observation/manipulation, the identity term (measured_property / applied_property / procedure / factor / entity / first mixture chemical) aggregated to its routed class with counts. This makes the heuristic routing AUDITABLE -- terms landing in generic_* are unmatched, and a term under a surprising class is a mis-route. Best-effort and wrapped so it never breaks the discovery summary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
The legacy did1 validator derived a document's superclass NAMES solely
from each superclasses entry's `.definition` path
($NDIDOCUMENTPATH/base.json -> base). did2-form documents (the shape every
v1->v2 migrator now emits, and increasingly what NDI writes) carry the
name directly as `.class_name` with no `.definition`, so the extraction
errored to empty and EVERY such document failed the superclasses check
with `("base" <=> "")` -- e.g. the NDI dataset/session ingestion suite on
the Vnext line.
Accept both forms: keep the `.definition` derivation, and when that
yields nothing, fall back to `.class_name`. Bound the deeper superclass
recursion by `superFullNames` (the `.definition` paths) so did2-form docs
-- which expose names but no paths -- skip the path-based recursion
cleanly instead of indexing past the end; the concrete class is still
validated and the superclass-name check has already passed.
Verified via the NDI dataset/session ingestion tests (they exercise this
legacy path); did2's own suite does not reach it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
The per-term routing report ships as a JSON artifact, but the artifact isn't readable from the migration tooling we use to curate. Echo the actionable breakdown to stdout so it lands in the CI log: every term routed to a generic_* escape hatch (the UNMATCHED ones that need a minted class or a routing rule), then the top routes overall. Data source for curating the authoritative treatment / ontology_table_row routing tables. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
After the 1->N splits and class folds, run did2.validate.references over the migrated batch and report dangling depends_on edges (orphan_count / edges_examined, aggregated by source-class.edge_name). Surfaces any reference the migration would break -- e.g. a split that didn't preserve a referenced id, or an edge to a deferred/quarantined doc. Best-effort, non-fatal (discovery mode); echoed to the CI log next to the routing inventory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…er") Codespell read the catch variable `refErr` as a misspelling of "refer". There's no .codespellrc ignore-list in the repo, so rename it (matching the already-clean routingErr style). No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
…-routes (#1) The routing inventory exposed heuristic false positives from substring matching: "average amplitude"->age_observation ("aver-AGE-"), C. elegans "encounter*"->litter_size ("en-COUNT-er"), "sampling rate"->heart_rate, "number of samples"->litter_size, "growth duration"/"latency"->age, "bacterial patch volume"->organ_volume. Two changes: - containsAny now matches at WORD BOUNDARIES (regexp \<...\>), so a keyword matches only as a whole word/phrase -- average !-> age, encounter !-> count. - the scalar/categorical keyword lists are tightened to SPECIFIC, high-confidence phrases (body weight, body length, heart rate, respiration rate, blood pressure, litter size, cell count, life cycle stage, health status, ...) and the over-broad singletons (rate, frequency, count, number, volume, duration, latency, score, status, stage, behavior) are dropped. respiration_rate and cell_count get their own branches. Anything not confidently a known property falls to the generic_* escape hatch -- correct for the lab-specific corpus terms. Existing testMigratorsE cases (weight, life cycle stage) still route as before. treatment.m is intentionally left stem-based (rear->rearing) and showed no mis-routes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ABEeDJ83r9Hq9PaiSije8o
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#2 of the migration work — the migration code for the Brainstorm E split, implementing the specs added to did-schema (
V_epsilon/conversions/from_did_v1/treatment.md+ newontology_table_row.md). Companion schema PR: Waltham-Data-Science/DID-schema#62.Engine (
v1_to_v2)TargetVersionoption (default'V_delta'— behaviour unchanged).ensureClassBlocks), optionally validated, and counted independently. The historical 1→1 path is preserved exactly.TargetVersion='V_epsilon', a class with a split migrator under+did2/+convert/+migrators_eis routed there instead.Migrators (
+migrators_e)treatment.m— 1→1 branch dispatch →injection/bath/procedural_manipulation/temperature_manipulation/environmental_manipulation, with theDabstring_value→target_structureedge case and out-of-tier routing (quarantine-with-reason) for non-manipulation rows (DOB, experiment time).ontology_table_row.m— genuine 1 → N: each row → a scalar/categorical observation property class (or a generic escape hatch), value placed in the declaring block (shape mixin for scalars;categorical_conceptfor inherited categoricals; the concrete class for the two overriders).Branch/term resolution is a keyword/CURIE-prefix heuristic seed; the authoritative per-term tables are finalized in discovery mode against real corpora (the report-only loop the conversion docs describe).
Tests (
testMigratorsE.m)Synthetic,
Validate=false(matchingtestMigrators): thermal/drug/environmental/Dab/not-a-manipulationtreatmentcases;ontology_table_row1→N fan-out + per-row unique ids; and a backward-compat guard that the default target leavestreatmentunchanged.Honest status / what's left
V_delta/stable, but the E classes live inV_epsilon/draft. To run the corpuses through the E split end-to-end, CI needsDID_SCHEMA_PATHpointed at V_epsilon with the draft E classes reachable by the schema cache (promote to stable, or load draft). That's the one coordinated decision; I can wire an opt-inDID_RUN_EPSILON_CORPUSdiscovery job once it's settled.V_deltapipeline and its green corpus CI are untouched by this PR.🤖 Generated with Claude Code
Generated by Claude Code