Skip to content

feat(profile): YAML projection follow-ups — CI gate, per-dist merge, catalog inheritance#3910

Merged
jqnatividad merged 5 commits into
masterfrom
profile-follow-ups
May 27, 2026
Merged

feat(profile): YAML projection follow-ups — CI gate, per-dist merge, catalog inheritance#3910
jqnatividad merged 5 commits into
masterfrom
profile-follow-ups

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Picks up the §6 queued follow-ups from profile3-handoff.md for the YAML-driven projection engine. Four stacked commits, each scoped to one item:

  • 3d1e2f7d3 — CI gate: parametric cargo test embedded_* smoke covers every entry in EMBEDDED for load + dry_compile, so any future profile YAML inherits coverage automatically.
  • 6e0508b2c — Per-distribution identity-based discovery merge: new DistributionMerge config lets publisher DCAT contribute per-Distribution metadata (titles, rights, licenses) without losing qsv's inferred stats. Matches by dcat:downloadURL / dcat:accessURL / @id.
  • f16603dc4 — Template-driven catalog inheritance: exposes the rendered Dataset as inner on catalog.fields[] + title_template ctx, so users can rename / transform / conditionally copy via the existing FieldDecl machinery. inherit_from_dataset: [...] stays as the verbatim-copy shortcut.
  • c827da5c2 — Roborev feat: stats add "sortiness" statistic #2499 fix: per-dist merge now honors forced_dcat_paths at /dcat/<dist_key>/<idx>/<field> (exact + nested-parent), and array_key actually defaults to dcat:distribution when omitted.

Engine net change: ~1003 LOC added, ~23 removed across 6 files. All knowledge stays in the YAML profiles (dcat-us-v3.yaml, dcat-ap-v3.yaml) — no hardcoded behavior added.

§6 items still queued (not in this PR — both need external-dep decisions):

  • SHACL validation backend for DCAT-AP v3
  • Croissant mlcroissant Python validator integration

Test plan

  • cargo test --bin qsv -F profile,feature_capable cmd::profile::141 unit tests pass (16 new across the 4 commits)
  • cargo test --test tests -F profile,feature_capable -- test_profile::49 integration tests pass (golden parity byte-exact, no shape drift)
  • cargo build for all 4 binaries: qsv (-F all_features), qsvmcp (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus) — all green
  • cargo clippy --bin qsv -F profile,feature_capable — no new warnings
  • cargo +nightly fmt applied
  • python3 scripts/docs-drift-check.py — no drift detected
  • resources/profiles/README.md updated with "Catalog envelope" + "Discovery merge" sections (incl. distribution_merge sub-block)
  • Roborev feat: stats add "sortiness" statistic #2499 closed (closed=true)

🤖 Generated with Claude Code

jqnatividad and others added 4 commits May 27, 2026 07:25
Adds three smoke tests in src/cmd/profile/profile_spec.rs:

- embedded_dcat_ap_v3_parses_and_dry_compiles: locks down DCAT-AP v3
  (validation disabled, eu_theme vocab, 28 field_mappings).
- embedded_croissant_parses_and_dry_compiles: locks down Croissant
  (validation + discovery_merge disabled, recordsets populated,
  croissant_datatype vocab, 16 field_mappings).
- all_embedded_profiles_parse_and_dry_compile: parametric loop over
  EMBEDDED so any future profile added to the bundle automatically
  inherits load + dry_compile coverage without a per-profile test edit.

Addresses the "Profile dry-run validation gate in CI" follow-up from
profile3-handoff.md §6. The dry_compile pass already runs at runtime
via profile_spec::load; these tests exercise it at build time too so
a malformed bundled YAML fails CI rather than only at first invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today the `dcat:distribution` array is on `discovery_merge.never_overwrite`
so publisher DCAT contributions to per-Distribution metadata (titles,
rights, license attributions) are silently dropped. This commit adds a
profile-configurable per-element merge path that lets publisher fields
flow into the inferred record while qsv's own per-distribution stats
(byteSize, mediaType, CSVW schema) always win.

Engine
- New `DistributionMerge` struct on `DiscoveryMerge` with `enabled`,
  `array_key`, `identity_keys`, `field_strategy`, `append_unmatched`.
- `discovery_merge::merge` now handles the distribution array first
  when `distribution_merge.enabled`, bypassing `never_overwrite` for
  that one key. Each publisher Distribution is matched against an
  inferred one by the first identity key with a non-empty value on
  both sides; matched fields merge via `field_strategy`
  (default `fill-if-absent`). Unmatched publisher entries are
  silently dropped unless `append_unmatched: true`.
- Helpers (`merge_distribution_array`, `coerce_to_array`,
  `find_distribution_match`, `non_empty_string`) keep the matching
  logic isolated from the top-level key loop.

Profile YAMLs
- `dcat-us-v3.yaml` + `dcat-ap-v3.yaml` enable per-distribution
  merging with `dcat:downloadURL` → `dcat:accessURL` → `@id` as the
  identity-key priority and `fill-if-absent` for matched fields.
  Unmatched publisher entries are dropped (`append_unmatched: false`)
  since qsv's inferred distributions are canonical for the local data.
- Croissant intentionally leaves `discovery_merge.enabled: false`
  (no change).

Docs
- `resources/profiles/README.md` gains a "Discovery merge" section
  covering both the outer `discovery_merge` block and the new
  `distribution_merge` sub-block, including the identity-key
  precedence used by the bundled DCAT profiles.

Tests
- 8 new unit tests in `discovery_merge::tests`:
  pairs_on_download_url, falls_back_to_access_url,
  drops_unmatched_when_append_disabled, appends_unmatched_when_configured,
  handles_single_object_inputs, bypasses_never_overwrite_when_enabled,
  disabled_preserves_never_overwrite,
  empty_identity_keys_treats_all_as_unmatched.

Addresses profile3-handoff.md §6 follow-up "Per-distribution merging
in discovery_merge". All 133 profile unit tests + 49 profile
integration tests pass; clippy + docs-drift-check clean; all four
binary variants (qsv/qsvlite/qsvmcp/qsvdp) build green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today `catalog.inherit_from_dataset` is a static `Vec<String>` —
each listed Dataset key is copied verbatim onto the Catalog
envelope. Useful for `dct:publisher` style hand-offs, but it
cannot rename keys, transform values, or condition the copy on
runtime state.

This commit unlocks all three by exposing the rendered Dataset
block as `inner` on the rendering context used by both
`catalog.fields[]` templates and the catalog `title_template`.
Users gain the full FieldDecl directive surface (emit_when,
default, required_level) for catalog-side inheritance without
a new config block.

Engine
- `projection::wrap_as_catalog` now takes the analysis JSON ctx
  directly (`&Value`) instead of a pre-built `minijinja::Value`.
  A new `build_ctx_with_inner` helper unions the analysis ctx
  with `{"inner": dataset}` so catalog templates see both the
  underlying analysis vars (`pkg`/`res`/`stats`/`dpp`) and the
  rendered Dataset values via `inner["..."]`.
- `wrap_in_catalog_envelope` and the in-`project` Catalog-mode
  call site updated to pass the JSON ctx through.
- `title_template` rendering now uses the same ctx_with_inner,
  so titles can mix `pkg.publisher` with `inner["dct:title"]`.
  This is a backward-compatible superset — legacy templates that
  only referenced `inner` keep working.

Profile spec
- `CatalogBlock` doc comments refreshed to document the `inner`
  binding and the analysis-ctx availability. `inherit_from_dataset`
  is kept as the verbatim-copy shortcut; `fields[]` is presented
  as the path for rename / transform / conditional inheritance.

Docs
- `resources/profiles/README.md` gains a "Catalog envelope"
  section contrasting the two inheritance paths with a worked
  rename + conditional-copy example.

Tests
- 4 new unit tests in `projection::tests`:
  catalog_field_template_can_reference_inner_dataset,
  catalog_field_template_sees_analysis_ctx_alongside_inner,
  catalog_field_emit_when_against_inner_skips_empty,
  catalog_title_template_can_reference_analysis_ctx.
- Existing `catalog_mode_wraps_dataset` and
  `dcat_us_v3_golden_parity_catalog` still pass byte-exact —
  the bundled YAMLs use `inherit_from_dataset` so their wire
  shape is unchanged.

Addresses profile3-handoff.md §6 follow-up "Catalog inheritance
from arbitrary Dataset keys". 137 profile unit tests + 49
integration tests pass; clippy + docs-drift-check clean; all
four binaries build green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced by Roborev #2499 against
6e0508b (per-distribution discovery merge):

MEDIUM — `merge_distribution_array` bypassed the
`forced_dcat_paths` protection. A user override at
`/dcat/dcat:distribution/0/<field>` (set via
`--initial-context dataset_info`) could silently be filled
from publisher-discovered DCAT when the inferred value was
absent, contradicting the documented force-override
guarantee. `forced_dcat_paths` and `dist_key` now thread into
the per-element path; each publisher field is skipped when
its pointer `/dcat/<dist_key>/<idx>/<field>` is forced
exactly or is the parent of a forced sub-key (matches the
top-level forced-path semantics in `merge`).

LOW — `array_key` was documented as defaulting to
`dcat:distribution`, but the implementation used
`and_then(|d| d.array_key.as_deref())`, so enabling
`distribution_merge` without spelling out `array_key`
silently disabled merging. Switched to
`.map(|d| d.array_key.as_deref().unwrap_or("dcat:distribution"))`
so the default actually fires.

Tests
- `dist_merge_array_key_defaults_to_dcat_distribution`
- `dist_merge_honors_forced_path_on_distribution_field`
- `dist_merge_honors_forced_nested_path_under_distribution_field`
- `dist_merge_unrelated_forced_path_does_not_block`

All 141 profile unit + 49 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR delivers follow-ups for the YAML-driven qsv profile projection engine: it adds a CI-style “all embedded profiles must load + dry_compile” gate, expands discovery-merge to support identity-based per-Distribution merges, and enhances the catalog envelope to support template-driven inheritance from the rendered Dataset via an injected inner binding.

Changes:

  • Add template-driven catalog inheritance by injecting inner (rendered Dataset) into the catalog rendering context, and unify title/catalog field rendering against that context.
  • Introduce discovery_merge.distribution_merge to merge publisher DCAT distribution entries into inferred distributions by identity keys (downloadURL/accessURL/@id), honoring forced override paths.
  • Add/extend embedded-profile tests and update bundled profile docs + YAMLs to enable per-distribution merge for DCAT-US v3 and DCAT-AP v3.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/cmd/profile/projection.rs Injects inner into catalog template context and adds tests ensuring catalog templates can reference both inner and analysis ctx.
src/cmd/profile/profile_spec.rs Extends YAML spec with distribution_merge config, adds embedded-profile CI-gate tests.
src/cmd/profile/discovery_merge.rs Implements per-distribution identity matching + merge logic with forced-path protections and extensive unit tests.
resources/profiles/README.md Documents catalog envelope inheritance and per-distribution discovery merge configuration.
resources/profiles/dcat-us-v3.yaml Enables distribution_merge for DCAT-US v3 with identity key priority and fill-if-absent strategy.
resources/profiles/dcat-ap-v3.yaml Enables distribution_merge for DCAT-AP v3 mirroring DCAT-US v3 behavior.

Comment thread src/cmd/profile/projection.rs Outdated
Comment thread src/cmd/profile/discovery_merge.rs Outdated
Two Copilot findings on PR #3910:

1. `projection.rs:314` — doc comment incorrectly claimed
   `serde_json::Value::clone` was cheap because of string
   reference-counting. That's wrong; it's a deep copy that
   allocates for nested objects/arrays/strings. Reworded to
   acknowledge the allocation honestly (once per catalog wrap,
   small ctx in typical use) and noted the future optimization
   would be to layer contexts inside minijinja instead of
   merging at the JSON layer.

2. `discovery_merge.rs:170` — `merge_distribution_array`
   called `a.to_vec()` on the publisher array upfront, paying
   a full clone even when most entries are dropped. Switched
   to iterating by reference (`as_slice` for arrays,
   `std::slice::from_ref` for the scalar coercion). Per-field
   clones now happen lazily inside the matched-pair arm, after
   the forced-path filter — so forced fields are skipped
   without cloning and unmatched-drop entries cost nothing.

All 141 profile unit tests + 49 integration tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 38 complexity

Metric Results
Complexity 38

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@jqnatividad jqnatividad merged commit 68cea26 into master May 27, 2026
17 of 19 checks passed
@jqnatividad jqnatividad deleted the profile-follow-ups branch May 27, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants