feat(profile): YAML projection follow-ups — CI gate, per-dist merge, catalog inheritance#3910
Merged
Conversation
Adds three smoke tests in src/cmd/profile/profile_spec.rs: - embedded_dcat_ap_v3_parses_and_dry_compiles: locks down DCAT-AP v3 (validation disabled, eu_theme vocab, 28 field_mappings). - embedded_croissant_parses_and_dry_compiles: locks down Croissant (validation + discovery_merge disabled, recordsets populated, croissant_datatype vocab, 16 field_mappings). - all_embedded_profiles_parse_and_dry_compile: parametric loop over EMBEDDED so any future profile added to the bundle automatically inherits load + dry_compile coverage without a per-profile test edit. Addresses the "Profile dry-run validation gate in CI" follow-up from profile3-handoff.md §6. The dry_compile pass already runs at runtime via profile_spec::load; these tests exercise it at build time too so a malformed bundled YAML fails CI rather than only at first invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today the `dcat:distribution` array is on `discovery_merge.never_overwrite` so publisher DCAT contributions to per-Distribution metadata (titles, rights, license attributions) are silently dropped. This commit adds a profile-configurable per-element merge path that lets publisher fields flow into the inferred record while qsv's own per-distribution stats (byteSize, mediaType, CSVW schema) always win. Engine - New `DistributionMerge` struct on `DiscoveryMerge` with `enabled`, `array_key`, `identity_keys`, `field_strategy`, `append_unmatched`. - `discovery_merge::merge` now handles the distribution array first when `distribution_merge.enabled`, bypassing `never_overwrite` for that one key. Each publisher Distribution is matched against an inferred one by the first identity key with a non-empty value on both sides; matched fields merge via `field_strategy` (default `fill-if-absent`). Unmatched publisher entries are silently dropped unless `append_unmatched: true`. - Helpers (`merge_distribution_array`, `coerce_to_array`, `find_distribution_match`, `non_empty_string`) keep the matching logic isolated from the top-level key loop. Profile YAMLs - `dcat-us-v3.yaml` + `dcat-ap-v3.yaml` enable per-distribution merging with `dcat:downloadURL` → `dcat:accessURL` → `@id` as the identity-key priority and `fill-if-absent` for matched fields. Unmatched publisher entries are dropped (`append_unmatched: false`) since qsv's inferred distributions are canonical for the local data. - Croissant intentionally leaves `discovery_merge.enabled: false` (no change). Docs - `resources/profiles/README.md` gains a "Discovery merge" section covering both the outer `discovery_merge` block and the new `distribution_merge` sub-block, including the identity-key precedence used by the bundled DCAT profiles. Tests - 8 new unit tests in `discovery_merge::tests`: pairs_on_download_url, falls_back_to_access_url, drops_unmatched_when_append_disabled, appends_unmatched_when_configured, handles_single_object_inputs, bypasses_never_overwrite_when_enabled, disabled_preserves_never_overwrite, empty_identity_keys_treats_all_as_unmatched. Addresses profile3-handoff.md §6 follow-up "Per-distribution merging in discovery_merge". All 133 profile unit tests + 49 profile integration tests pass; clippy + docs-drift-check clean; all four binary variants (qsv/qsvlite/qsvmcp/qsvdp) build green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today `catalog.inherit_from_dataset` is a static `Vec<String>` —
each listed Dataset key is copied verbatim onto the Catalog
envelope. Useful for `dct:publisher` style hand-offs, but it
cannot rename keys, transform values, or condition the copy on
runtime state.
This commit unlocks all three by exposing the rendered Dataset
block as `inner` on the rendering context used by both
`catalog.fields[]` templates and the catalog `title_template`.
Users gain the full FieldDecl directive surface (emit_when,
default, required_level) for catalog-side inheritance without
a new config block.
Engine
- `projection::wrap_as_catalog` now takes the analysis JSON ctx
directly (`&Value`) instead of a pre-built `minijinja::Value`.
A new `build_ctx_with_inner` helper unions the analysis ctx
with `{"inner": dataset}` so catalog templates see both the
underlying analysis vars (`pkg`/`res`/`stats`/`dpp`) and the
rendered Dataset values via `inner["..."]`.
- `wrap_in_catalog_envelope` and the in-`project` Catalog-mode
call site updated to pass the JSON ctx through.
- `title_template` rendering now uses the same ctx_with_inner,
so titles can mix `pkg.publisher` with `inner["dct:title"]`.
This is a backward-compatible superset — legacy templates that
only referenced `inner` keep working.
Profile spec
- `CatalogBlock` doc comments refreshed to document the `inner`
binding and the analysis-ctx availability. `inherit_from_dataset`
is kept as the verbatim-copy shortcut; `fields[]` is presented
as the path for rename / transform / conditional inheritance.
Docs
- `resources/profiles/README.md` gains a "Catalog envelope"
section contrasting the two inheritance paths with a worked
rename + conditional-copy example.
Tests
- 4 new unit tests in `projection::tests`:
catalog_field_template_can_reference_inner_dataset,
catalog_field_template_sees_analysis_ctx_alongside_inner,
catalog_field_emit_when_against_inner_skips_empty,
catalog_title_template_can_reference_analysis_ctx.
- Existing `catalog_mode_wraps_dataset` and
`dcat_us_v3_golden_parity_catalog` still pass byte-exact —
the bundled YAMLs use `inherit_from_dataset` so their wire
shape is unchanged.
Addresses profile3-handoff.md §6 follow-up "Catalog inheritance
from arbitrary Dataset keys". 137 profile unit tests + 49
integration tests pass; clippy + docs-drift-check clean; all
four binaries build green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced by Roborev #2499 against 6e0508b (per-distribution discovery merge): MEDIUM — `merge_distribution_array` bypassed the `forced_dcat_paths` protection. A user override at `/dcat/dcat:distribution/0/<field>` (set via `--initial-context dataset_info`) could silently be filled from publisher-discovered DCAT when the inferred value was absent, contradicting the documented force-override guarantee. `forced_dcat_paths` and `dist_key` now thread into the per-element path; each publisher field is skipped when its pointer `/dcat/<dist_key>/<idx>/<field>` is forced exactly or is the parent of a forced sub-key (matches the top-level forced-path semantics in `merge`). LOW — `array_key` was documented as defaulting to `dcat:distribution`, but the implementation used `and_then(|d| d.array_key.as_deref())`, so enabling `distribution_merge` without spelling out `array_key` silently disabled merging. Switched to `.map(|d| d.array_key.as_deref().unwrap_or("dcat:distribution"))` so the default actually fires. Tests - `dist_merge_array_key_defaults_to_dcat_distribution` - `dist_merge_honors_forced_path_on_distribution_field` - `dist_merge_honors_forced_nested_path_under_distribution_field` - `dist_merge_unrelated_forced_path_does_not_block` All 141 profile unit + 49 integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR delivers follow-ups for the YAML-driven qsv profile projection engine: it adds a CI-style “all embedded profiles must load + dry_compile” gate, expands discovery-merge to support identity-based per-Distribution merges, and enhances the catalog envelope to support template-driven inheritance from the rendered Dataset via an injected inner binding.
Changes:
- Add template-driven catalog inheritance by injecting
inner(rendered Dataset) into the catalog rendering context, and unify title/catalog field rendering against that context. - Introduce
discovery_merge.distribution_mergeto merge publisher DCAT distribution entries into inferred distributions by identity keys (downloadURL/accessURL/@id), honoring forced override paths. - Add/extend embedded-profile tests and update bundled profile docs + YAMLs to enable per-distribution merge for DCAT-US v3 and DCAT-AP v3.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cmd/profile/projection.rs | Injects inner into catalog template context and adds tests ensuring catalog templates can reference both inner and analysis ctx. |
| src/cmd/profile/profile_spec.rs | Extends YAML spec with distribution_merge config, adds embedded-profile CI-gate tests. |
| src/cmd/profile/discovery_merge.rs | Implements per-distribution identity matching + merge logic with forced-path protections and extensive unit tests. |
| resources/profiles/README.md | Documents catalog envelope inheritance and per-distribution discovery merge configuration. |
| resources/profiles/dcat-us-v3.yaml | Enables distribution_merge for DCAT-US v3 with identity key priority and fill-if-absent strategy. |
| resources/profiles/dcat-ap-v3.yaml | Enables distribution_merge for DCAT-AP v3 mirroring DCAT-US v3 behavior. |
Two Copilot findings on PR #3910: 1. `projection.rs:314` — doc comment incorrectly claimed `serde_json::Value::clone` was cheap because of string reference-counting. That's wrong; it's a deep copy that allocates for nested objects/arrays/strings. Reworded to acknowledge the allocation honestly (once per catalog wrap, small ctx in typical use) and noted the future optimization would be to layer contexts inside minijinja instead of merging at the JSON layer. 2. `discovery_merge.rs:170` — `merge_distribution_array` called `a.to_vec()` on the publisher array upfront, paying a full clone even when most entries are dropped. Switched to iterating by reference (`as_slice` for arrays, `std::slice::from_ref` for the scalar coercion). Per-field clones now happen lazily inside the matched-pair arm, after the forced-path filter — so forced fields are skipped without cloning and unmatched-drop entries cost nothing. All 141 profile unit tests + 49 integration tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 38 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Picks up the §6 queued follow-ups from
profile3-handoff.mdfor the YAML-driven projection engine. Four stacked commits, each scoped to one item:3d1e2f7d3— CI gate: parametriccargo test embedded_*smoke covers every entry inEMBEDDEDforload + dry_compile, so any future profile YAML inherits coverage automatically.6e0508b2c— Per-distribution identity-based discovery merge: newDistributionMergeconfig lets publisher DCAT contribute per-Distribution metadata (titles, rights, licenses) without losing qsv's inferred stats. Matches bydcat:downloadURL/dcat:accessURL/@id.f16603dc4— Template-driven catalog inheritance: exposes the rendered Dataset asinneroncatalog.fields[]+title_templatectx, so users can rename / transform / conditionally copy via the existingFieldDeclmachinery.inherit_from_dataset: [...]stays as the verbatim-copy shortcut.c827da5c2— Roborev feat:statsadd "sortiness" statistic #2499 fix: per-dist merge now honorsforced_dcat_pathsat/dcat/<dist_key>/<idx>/<field>(exact + nested-parent), andarray_keyactually defaults todcat:distributionwhen omitted.Engine net change: ~1003 LOC added, ~23 removed across 6 files. All knowledge stays in the YAML profiles (
dcat-us-v3.yaml,dcat-ap-v3.yaml) — no hardcoded behavior added.§6 items still queued (not in this PR — both need external-dep decisions):
mlcroissantPython validator integrationTest plan
cargo test --bin qsv -F profile,feature_capable cmd::profile::— 141 unit tests pass (16 new across the 4 commits)cargo test --test tests -F profile,feature_capable -- test_profile::— 49 integration tests pass (golden parity byte-exact, no shape drift)cargo buildfor all 4 binaries:qsv(-F all_features),qsvmcp(-F qsvmcp),qsvlite(-F lite),qsvdp(-F datapusher_plus) — all greencargo clippy --bin qsv -F profile,feature_capable— no new warningscargo +nightly fmtappliedpython3 scripts/docs-drift-check.py— no drift detectedresources/profiles/README.mdupdated with "Catalog envelope" + "Discovery merge" sections (incl.distribution_mergesub-block)statsadd "sortiness" statistic #2499 closed (closed=true)🤖 Generated with Claude Code