Skip to content

feat(profile): YAML-driven projection engine — DCAT-US v3, DCAT-AP v3, Croissant#3908

Merged
jqnatividad merged 16 commits into
masterfrom
dcat-us-v3-optimization2
May 27, 2026
Merged

feat(profile): YAML-driven projection engine — DCAT-US v3, DCAT-AP v3, Croissant#3908
jqnatividad merged 16 commits into
masterfrom
dcat-us-v3-optimization2

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Replaces the hardcoded dcat.rs projection engine with a YAML-driven, multi-profile engine selectable via --profile <name|path>. Three bundled profiles ship:

Profile Spec Validation
dcat-us-v3 (default) https://github.com/GSA/dcat-us/ embedded GSA JSON Schema bundle
dcat-ap-v3 https://semiceu.github.io/DCAT-AP/releases/3.0.0/ disabled (upstream ships SHACL)
croissant https://github.com/mlcommons/croissant disabled (Python validator upstream)

All three are bundled (include_str!) and resolved before file-path fallback. Users can also pass a custom YAML profile.

Highlights

  • −257 net Rust LOC vs the pre-YAML engine: dcat.rs (1738), catalog.rs (154), ckan_to_dcat.rs (271), curie.rs (225) deleted; new profile_spec.rs, projection.rs, discovery_merge.rs plus 12 new minijinja helpers added.
  • Byte-equivalent DCAT-US v3 output verified by 3 fixture × 2 mode golden parity tests (tests/resources/profile/golden/).
  • CKAN→target translation now profile-driven via field_mappings: in each YAML.
  • --profile flag added; defaults to dcat-us-v3 so existing invocations are backwards-compatible.
  • --dcat-legacy-license now actually threads into the projection (was parsed-but-dead).
  • Catalog envelope carries its own @context (legacy engine omitted it — invalid JSON-LD).
  • Discovery merge applies to inner Dataset in Catalog mode (legacy targeted the outer envelope).
  • Force semantics now alias-aware: forcing package.author also locks package.publisher since both map to /dcat/dct:publisher.

Test plan

  • cargo test --bin qsv -F profile,feature_capable cmd::profile:: — 127 unit tests pass
  • cargo test --test tests -F profile,feature_capable -- test_profile:: — 49 integration tests pass
  • cargo build --bin qsv -F all_features — clean
  • cargo build --bin qsvmcp -F qsvmcp — clean
  • cargo build --bin qsvlite -F lite — clean
  • cargo build --bin qsvdp -F datapusher_plus — clean
  • python3 scripts/docs-drift-check.py — no drift detected
  • cargo +nightly fmt applied across all commits
  • Smoke test on Pittsburgh-311 live URL (reviewer step)
  • Verify DataPusher+ ingest still parses the dcat-us-v3 output (reviewer step)

Commit walkthrough (14 commits)

Plan-stage commits (1-8 below) followed by 3 review-fix rounds.

# Stage / Job Commit What
1 Stage 1 — scaffold b7ecfc41d profile_spec.rs + projection.rs + discovery_merge.rs + 11 helpers + --profile flag
2 Stage 2 — goldens a01da79aa 3 fixtures × 2 modes captured from legacy engine
3 Stage 3 — DCAT-US v3 YAML 6732c644d dcat-us-v3.yaml authored (23 dataset + 22 distribution fields)
4 handoff #3 86ecf8c26 mid-stream handoff doc
5 Stage 4a — orchestrator swap 551cd443b run() routes through projection engine + parity tests
6 Stages 4b + 5 965f13a27 Delete 4 legacy modules; refactor dcat_validate to consume &ProfileSpec
7 Stage 6 — DCAT-AP v3 3a025e848 dcat-ap-v3.yaml + 4 smoke tests
8 Stage 7 — Croissant 7936bb6d9 croissant.yaml + 5 smoke tests + build_croissant_fields helper
9 Stage 8 — docs 413267486 regenerated docs/help/profile.md, finalized handoff
10 Roborev #2490 38ac2dbf8 7 findings: catalog @context, discovery merge target, spatial null, --dcat-legacy-license wiring, force shaping, validator paths, default impl
11 gitignore cleanup 53c552e7f drop auto-generated stats caches
12 Roborev #2491 1d9780e49 force-vs-formula regression (skip formula passes on forced fields)
13 Roborev #2493 c5cb935da alias-aware force (forcing author also locks publisher)

Three internal roborev rounds resolved 9 findings total. Final state: 0 open reviews.

Architecture notes

The full handoff doc — profile3-handoff.md (root) — captures the engine design, key decisions, file map, and queued follow-ups (SHACL backend for DCAT-AP, Python mlcroissant validator integration, per-distribution merging in discovery_merge).

🤖 Generated with Claude Code

jqnatividad and others added 14 commits May 26, 2026 19:47
… force semantics)

Closes the five gaps that kept `qsv profile` from being an agency-grade
DCAT-US v3 reference tool:

- Vendor the full GSA JSON Schema bundle (26 definitions + 2 qsv
  overlays + MANIFEST.json + refresh README) under resources/dcat-us-v3/,
  pinned to upstream commit cf8789002. `--validate-dcat` now runs against
  the full bundle via `referencing::Registry`, dispatching the Dataset
  or Catalog overlay by the emitted `@type`. A `curie::strip_curies`
  pre-pass bridges qsv's JSON-LD-compact output to GSA's unprefixed
  schema keys without touching the emitted JSON on disk.

- Add `--catalog` flag that wraps the Dataset inside a `dcat:Catalog`
  envelope (`Catalog{dataset:[...]}`) for federation harvesters.

- Emit nine new optional v3 fields with natural data sources:
  Dataset-level `dct:created`, `dcat:version`, `dcat:versionNotes`;
  Distribution-level `dcat:checksum` (SHA-256 via sha2), `dcat:compressFormat`,
  `dcat:packageFormat`, `dcat:spatialResolutionInMeters`, `dct:language`,
  `dct:conformsTo`. Widen `dct:conformsTo` to array per v3 cardinality;
  emit `dct:license` as string and `dcat:byteSize` as string to match
  the GSA schemas' declared shapes.

- Implement full `force: true` override semantics across all three
  --initial-context subtrees. `context::collect_forced_paths` now walks
  package/resource entries through a 47-entry `ckan_to_dcat` mapping
  table; `apply_force_overrides` in `run()` applies forced leaves
  LAST so they beat both inferred and discovered metadata.

Pipeline precedence (low → high): inferred → discovered → dataset_info
pointers → forced leaves → schema validation.

Bumps profile feature: adds `sha2 = "0.10"` as a direct dep. Test
counts: 143 unit (was 96, +47) and 29 integration (was 18, +11) all
passing, plus a new bundle pin guard test that re-hashes every
vendored schema against MANIFEST.json on each run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lay the foundation for the YAML-driven multi-profile projection engine
described in plan §1-§2. New modules are wired into profile.rs but the
orchestrator still calls the legacy dcat.rs path — zero behavior change
shipped here. Subsequent stages (§3-§8) populate the profile YAMLs,
swap the orchestrator, and delete the legacy hardcoded modules.

New modules:
* src/cmd/profile/profile_spec.rs — ProfileSpec serde types, embedded-
  first load() with case-insensitive name resolution, file-path fallback,
  6 unit tests.
* src/cmd/profile/projection.rs — generic project() engine with
  ProjectionMode { Dataset, Catalog }, ProjectionWarning { Severity },
  wrap_as_catalog, for_each_column RecordSet expansion, profile-aware
  lookup/field_mapping closures, dry_compile validator, 9 unit tests.
* src/cmd/profile/discovery_merge.rs — merge() with fill-if-absent,
  overlay-array, never strategies; never_overwrite + forced_paths
  protection; 5 unit tests.

Helper additions (formula_helpers.rs):
* Filters: only_if_absolute_iri, basename, file_stem,
  sanitize_iso_8601_interval, format_mailto.
* Globals: sha256_of (streaming), blake3_of (mmap+rayon), file_size_of,
  compress_format, package_format, build_csvw_schema.

Helpers needing profile state (lookup, field_mapping) live in
projection.rs::register_profile_helpers as closures over the
ProfileSpec; they unwrap_or(UNDEFINED) so | default chains work.

USAGE additions:
* --profile <name|path>: embedded names (dcat-us-v3, dcat-ap-v3,
  croissant) resolved first; falls back to file path. Not yet
  consumed in run() — wired up in Stage 4.

Placeholder YAMLs under resources/profiles/ exist so include_str! resolves
during Stage 1 builds; they will be replaced with real content in
Stages 3 (DCAT-US v3), 6 (DCAT-AP v3), 7 (Croissant).

Verification:
* cargo build --bin qsv -F profile,feature_capable — clean (23 expected
  dead-code warnings for the unused scaffold).
* cargo test cmd::profile:: — 163 passed (+20 new tests).
* cargo test --test tests test_profile:: — 29 passed (no regression).
* cargo +nightly fmt — applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tage 2)

Lock the byte-equivalent output of the current hardcoded dcat.rs engine
against three regression fixtures so Stage 3's YAML-driven projection
can be asserted to produce identical Dataset + Catalog blocks.

Goldens captured by running today's qsv profile against each fixture
with the canonical --initial-context template, then normalizing via jq
to strip the only path-dependent field (qsv:sourcePath inside
dcat:distribution). Everything else in the .dcat block — including
dcat:byteSize, dcat:checksum, dct:modified, csvw:tableSchema — is
deterministic for fixed input and is captured verbatim.

Fixtures (under tests/resources/profile/golden/):
* nyc-311-subset.csv (10 rows) — geocoded urban service requests:
  lat/lon present, mixed Open/Closed status, multi-agency.
* usda-soil-subset.csv (10 rows) — scientific numeric data: pH,
  organic_carbon_pct, nitrogen_pct, clay/sand/silt percentages.
* wprdc-311-subset.csv (10 rows) — Pittsburgh 311 records:
  capitalized headers, X/Y geo, council districts + wards.

Goldens per fixture:
* <fixture>.dataset.expected.json — the .dcat block from Dataset mode.
* <fixture>.catalog.expected.json — the .dcat block from --catalog mode.

.gitignore whitelists tests/resources/profile/golden/*.{csv,expected.json}
so the *.json + *.csv blanket-ignores don't strip them.

These goldens will drive Stage 3's dcat_us_v3_golden_parity_dataset
and dcat_us_v3_golden_parity_catalog tests; CI hard-fails on drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author resources/profiles/dcat-us-v3.yaml — the full DCAT-US v3
projection definition that will replace the hardcoded dcat.rs engine
in Stage 4. The YAML mirrors the legacy add_* functions field-for-field
in declaration order so serde_json::Map insertion preserves wire-shape
parity (verified against the Stage-2 goldens at swap time).

Profile content:
* 4 vocabularies (license_iri, accrual_periodicity, iso_639_1,
  csvw_datatype) — each migrated verbatim from the legacy Rust
  constants. The EU vocab IRIs retain http:// scheme per their
  canonical published identifiers; DevSkim DS137138 suppressed per
  line.
* 53 field_mappings — same CKAN→DCAT pointer table the legacy
  ckan_to_dcat::CKAN_TO_DCAT held, in identical declaration order so
  alias-resolution precedence is preserved.
* dataset.fields[] — 23 entries covering core identity, provenance,
  contact point (required), classification, coverage, US codes
  (recommended), governance, and extended metadata. emit_when guards
  match the legacy `if let Some(...)` shapes.
* distribution.fields[] — 22 entries covering title, description,
  download URL, format/license/restrictions, language/conformance,
  file-derived facts (byteSize, checksum, compress/package format),
  spatial resolution, and csvw:tableSchema.
* catalog block reproduces wrap_as_catalog's envelope (Catalog of
  <title>, dct:conformsTo, dct:publisher inheritance).
* discovery_merge: enabled, never_overwrite=[@context,@type,
  dcat:distribution], fill-if-absent strategy.
* validation: enabled against the vendored GSA bundle under
  resources/dcat-us-v3/ with the same 11 strippable CURIE prefixes.

dry_compile verification:
A new unit test (embedded_dcat_us_v3_parses_and_dry_compiles)
parses the embedded YAML and runs projection::dry_compile() against
it — exercising every template's minijinja compile path. All
templates compile clean.

The actual byte-equivalent parity test (running each Stage-2 fixture
through projection::project() and asserting against goldens) lands
in Stage 4 alongside the orchestrator swap — at that point the
engine actually consumes the YAML.

The reference cross-checked sources for content:
  https://github.com/GSA/dcat-us/
  https://resources.data.gov/resources/dcat-us3/
  the vendored GSA bundle under resources/dcat-us-v3/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the current state after the YAML-driven projection migration's
first three commits. Documents what's wired (scaffold + helpers + flag +
goldens + DCAT-US v3 YAML), what's still on the legacy path (dcat.rs
drives output), and a 9-sub-step Stage 4 plan for the orchestrator swap.

Supersedes profile2-handoff.md for post-PR-#3901 work. Key gotchas
distilled into §5: lookup helpers must return Value::UNDEFINED,
goldens only normalize qsv:sourcePath, field-mapping count is 53 not 47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
profile.rs::run now routes through projection::project() with the
loaded ProfileSpec (default: dcat-us-v3). The YAML engine produces
byte-equivalent output to the legacy dcat.rs path on all 6 golden
fixtures (3 inputs × dataset/catalog modes), verified by new parity
integration tests.

Orchestrator changes:
* Load profile via profile_spec::load(args.flag_profile | "dcat-us-v3")
  at the top of run(), then projection::dry_compile() to fail fast on
  malformed embedded YAML.
* ContextArgs gains a `profile: &ProfileSpec` field; context::build
  threads it to load_initial_context → collect_forced_paths so the
  CKAN→target pointer translation uses profile.field_mappings instead
  of importing ckan_to_dcat.
* Replace dcat::build() call with projection::project(&profile,
  &projection_ctx, mode) — the projection_ctx carries pkg, res, stats,
  dpp, source_label, local_path matching the YAML's template names.
* Replace merge_discovered() with discovery_merge::merge(&profile,
  inferred, discovered, forced_dcat_paths) — same /dcat/<key> forced-
  path semantics, now driven by profile.discovery_merge.
* Catalog wrap baked into project() via ProjectionMode::Catalog
  (chosen upfront based on flag_catalog); orchestrator no longer
  calls catalog::wrap_as_catalog at the warning-filter step.
* Stash key renamed __pending_dcat_warnings →
  __pending_projection_warnings.
* DcatWarning → ProjectionWarning conversion bridges dcat_validate
  and run_profile_validation outputs (Stage 5 will refactor those
  modules to return ProjectionWarning directly).

Engine improvements:
* projection::project sets UndefinedBehavior::Chainable so
  `pkg.dpp_suggestions.spatial_extent.value | default("")` walks
  missing intermediates gracefully (matches legacy dcat.rs semantics
  where absent keys silently fall through).
* New file-aware helpers in formula_helpers.rs:
  - bbox_from_dpps(dpp, stats) — lat/lon column → POLYGON-WKT
    `dct:Location` array, mirroring legacy dcat::bbox_from_dpps.
  - temporal_from_dpps(dpp, stats) — date columns → array of
    `dct:PeriodOfTime`, one per inferred date column.
  - build_csvw_schema(stats) — column-name → stats-blob map walked,
    emitting `{columns: [...]}` with name, titles, datatype,
    qsv:cardinality / nullcount / min / max.
  - csvw_datatype_legacy helper mirrors the legacy mapping
    (Float → double, Integer → integer, Date → date, etc.).

dcat-us-v3.yaml updates:
* dct:spatial / dct:temporal fields call bbox_from_dpps /
  temporal_from_dpps as fallbacks behind the formula-derived WKT
  suggestion.
* dct:license emits a plain string (legacy license_value shape) via
  `{{ lookup("license_iri", raw) | default(raw) }}`, not the previous
  `{"@id": ...}` object form (GSA Distribution.json declares license
  as anyOf:[null,string]).

Tests:
* 2 new integration tests (dcat_us_v3_golden_parity_dataset /
  _catalog) iterate the 3 fixtures and assert byte-equivalent .dcat
  output against the goldens.
* discovery_merge test: forced-path form switched from "/dct:title"
  to "/dcat/dct:title" so it matches the legacy dataset_info pointer
  shape; +1 new test for nested-path force blocking top-level merge.
* All 6 goldens refreshed to current legacy output (the original
  Stage-2 capture had alphabetical stats-cache state).
* Full test sweep: 165 unit + 31 integration tests pass, 0 failures.

The legacy dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules
are still in tree (their tests still run via cmd::profile::*) but no
longer participate in the engine path. Stage 4b deletes them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tages 4b + 5)

The YAML-driven projection engine is now the only path. Stage 4a wired
projection::project() into run() with byte-equivalent output against
the goldens; this commit cleans up by deleting the legacy modules and
refactoring dcat_validate to consume the active ProfileSpec.

Deletions (~2400 LOC):
* src/cmd/profile/dcat.rs (1738 LOC) — the 9 add_* helpers,
  bbox_from_dpps, temporal_from_dpps, csvw_datatype, license_value,
  accrual_periodicity_iri, normalize_iso_639_1. The minijinja-side
  equivalents live in formula_helpers.rs + dcat-us-v3.yaml.
* src/cmd/profile/catalog.rs (154 LOC) — wrap_as_catalog moved into
  projection::wrap_as_catalog.
* src/cmd/profile/ckan_to_dcat.rs (271 LOC) — CKAN_TO_DCAT table
  moved verbatim into dcat-us-v3.yaml's field_mappings:; the lookup
  is now ProfileSpec::translate_ckan_ptr.
* src/cmd/profile/curie.rs (225 LOC) — strip_curies is now an inline
  helper in dcat_validate.rs driven by
  profile.validation.strippable_curie_prefixes.
* mod declarations for the deleted modules in profile.rs.

dcat_validate.rs refactor (Stage 5):
* New public API: validate(profile: &ProfileSpec, block: &Value) ->
  Vec<ProjectionWarning>. When profile.validation.enabled == false
  (DCAT-AP v3, Croissant), returns vec![] without touching the
  schema.
* Inline strip_curies / strip_curie_key replace the deleted curie
  module; the prefix list comes from
  profile.validation.strippable_curie_prefixes (still byte-identical
  to the legacy list for DCAT-US v3).
* classify_severity now returns projection::Severity instead of
  dcat::Severity.
* Test functions migrate to the new (profile, block) signature by
  loading the embedded dcat-us-v3 profile via profile_spec::load.

profile.rs cleanup:
* dcat_validate::validate_dataset_or_catalog() call → validate().
* run_profile_validation now returns Vec<ProjectionWarning> directly;
  the .into_iter().map(From::from) bridge is gone.

projection.rs cleanup:
* impl From<DcatWarning> for ProjectionWarning removed (no longer
  needed — all warning producers return ProjectionWarning).

Verification:
* cargo build --bin qsv -F profile,feature_capable — clean.
* All 4 binaries build clean: qsv (-F all_features), qsvmcp
  (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus).
* cargo test cmd::profile:: → 127 unit tests pass (down from 165;
  the deleted legacy modules carried 38 tests now obsoleted by the
  YAML+goldens parity coverage).
* cargo test --test tests test_profile:: → 31 integration tests pass
  (29 original + 2 new dcat_us_v3_golden_parity_* tests).

Net Rust LOC delta this commit: −2388 deleted, +60 added (inline
strip_curies + tests) = −2328 LOC. Cumulative since Stage 1:
−2328 + 1525 + 546 = −257 LOC vs the pre-YAML-engine state, AND
all engine knowledge now lives in resources/profiles/dcat-us-v3.yaml
where it's editable without recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DCAT-AP v3 (semiceu.github.io/DCAT-AP/releases/3.0.0/) is now an
embedded profile selectable via --profile dcat-ap-v3. The shape is a
DCAT-US v3 subset, with:

* JSON Schema validation disabled (DCAT-AP ships SHACL upstream; a
  SHACL backend is a future enhancement).
* No dcat-us:* extensions (bureauCode, programCode, accessLevel,
  purpose, liabilityStatement) — those are US-specific.
* New `eu_theme` vocabulary mapping CKAN group slugs to EU
  publications-office authority IRIs
  (http://publications.europa.eu/resource/authority/data-theme/...).
* dcat:accessURL required on Distribution per the v3 spec
  (Mandatory cardinality 1..*).
* dct:conformsTo points at the SEMIC v3 release URL.
* Smaller field_mappings (29 entries vs the 53 in dcat-us-v3) since
  many DCAT-US extensions don't apply.

The same minijinja templates and helpers power both profiles; the
only Rust-side change in this commit is the YAML profile + tests.

Smoke tests (tests/test_profile.rs):
* dcat_ap_v3_emits_no_dcat_us_extensions — verifies the projection
  carries zero dcat-us:* keys even with the full initial-context.
* dcat_ap_v3_distribution_carries_access_url — confirms the
  Distribution-mandatory dcat:accessURL is populated.
* dcat_ap_v3_conforms_to_targets_spec_url — confirms downstream
  consumers can detect the profile via dct:conformsTo.
* dcat_ap_v3_validation_is_disabled_noop — confirms --validate-dcat
  with this profile produces no JSON Schema warnings (the validator
  short-circuits when profile.validation.enabled == false).

Source: https://github.com/SEMICeu/DCAT-AP
Cardinality reference: https://semiceu.github.io/DCAT-AP/releases/3.0.0/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Croissant ML metadata format (mlcommons.org/croissant) is now an
embedded profile selectable via --profile croissant. The output is
schema.org-rooted JSON-LD conforming to Croissant 1.0:

* @context inlines the canonical Croissant map: @language=en,
  @vocab=https://schema.org/, plus cr:/dct: prefix shorthands. Per
  the Croissant spec at
  https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec.md.
* @type=sc:Dataset; field paths use schema.org bare keys
  (name/description/url/license/creator/publisher/keywords/etc.)
  rather than dcat:/dct: prefixes.
* conformsTo target IRI: http://mlcommons.org/croissant/1.0.
* Distribution emitted under bare `distribution` (schema.org @vocab
  resolves it) with @type=sc:FileObject.
* Per-column cr:RecordSet/cr:Field expansion via the new
  build_croissant_fields helper — one Field per CSV column with
  schema.org dataType (sc:Text / sc:Integer / sc:Float / sc:Boolean
  / sc:Date / sc:DateTime).
* BLAKE3 hash via cr:fileFingerprint (qsv-native mmap+rayon, markedly
  faster than SHA-256 on multi-GB ML training data; Croissant has no
  SPDX-mandated algorithm so the choice is free).
* validation.enabled: false (Croissant uses a Python validator,
  mlcroissant, not JSON Schema).
* discovery_merge.enabled: false (Croissant doesn't live in
  CKAN-style data portals).

Engine extensions:
* DatasetBlock.context now accepts a `Value` (string or object) so
  the inline Croissant @context map round-trips verbatim. DCAT-US /
  DCAT-AP profiles still use a string URI — backwards-compatible.
* DistributionBlock.path lets profiles override the Distribution
  wrapper key. Croissant emits `distribution`; DCAT defaults remain
  `dcat:distribution`.
* New formula helper build_croissant_fields(stats) walks the per-
  column stats map and emits a flat cr:Field array with schema.org
  dataType IRIs.

Smoke tests (5 in tests/test_profile.rs):
* croissant_uses_schema_org_context_and_sc_dataset_type
* croissant_conforms_to_targets_mlcommons_spec
* croissant_emits_recordset_with_one_field_per_csv_column
* croissant_uses_bare_distribution_key_not_dcat_namespaced
* croissant_distribution_uses_file_object_type

Verification: cargo test cmd::profile:: → 127 unit, test_profile::
→ 40 integration tests pass (29 original + 2 parity + 4 DCAT-AP +
5 Croissant).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs/help/profile.md regenerated via --generate-help-md to surface
  the --profile flag added in Stage 1.
* profile3-handoff.md updated to reflect all 8 stages landed,
  full file map post-deletion, verification commands, captured
  design decisions, and queued follow-ups.
* src/cmd/profile.rs: drop the now-useless DcatWarning → ProjectionWarning
  conversion in the --validate-dcat code path (Stage 5 already
  refactored validate() to return ProjectionWarning directly).

Verification:
* python3 scripts/docs-drift-check.py → no drift detected.
* All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp).
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 40 integration tests pass.
* cargo clippy --bin qsv -F profile,feature_capable → no new findings
  in the YAML-engine code path.

This closes the YAML-driven projection engine migration. The shipped
binary always goes through projection::project(); the legacy
dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules are
deleted. DCAT-US v3 / DCAT-AP v3 / Croissant projection knowledge
lives entirely in resources/profiles/*.yaml — editable without
recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/validate)

7 findings from the YAML-engine branch review at job 2490. Each fix
ships with a regression guard in tests/test_profile.rs.

Medium severity (6):

1. Catalog mode + discovery merge target (src/cmd/profile.rs:398).
   Discovery was merging into the Catalog envelope top-level instead
   of the nested Dataset. Fix: project Dataset always, apply
   discovery_merge::merge, THEN conditionally wrap in Catalog via the
   new projection::wrap_in_catalog_envelope helper. Guard:
   catalog_mode_merges_discovered_into_inner_dataset_not_envelope.

2. Catalog envelope missing @context (src/cmd/profile/projection.rs:296).
   The envelope carried CURIE keys (dct:title, dct:conformsTo,
   dcat:dataset) without a top-level @context, leaving it invalid as
   JSON-LD. Fix: wrap_as_catalog now copies profile.dataset.context
   into the envelope; inner Dataset keeps its own context for
   self-containment. Guard: catalog_envelope_carries_top_level_context.

3. dct:spatial emits string "null" when no bbox
   (resources/profiles/dcat-us-v3.yaml + dcat-ap-v3.yaml). bbox_from_dpps
   returning UNDEFINED rendered as `"null"` via `| tojson` because
   coerce_json_or_string left the literal alone. Fix: emit_when guard
   gates the field on WKT-or-bbox availability. Guard:
   spatial_field_suppressed_when_no_lat_lon_columns.

4. --dcat-legacy-license parsed but never wired
   (src/cmd/profile.rs:380). Flag was documented + collected into
   Args but never reached the YAML engine. Fix: thread the flag into
   projection_ctx as `legacy_license`, add a conditional Dataset-level
   dct:license field in dcat-us-v3.yaml gated on that variable.
   Guards: dcat_legacy_license_emits_dataset_level_license,
   dcat_legacy_license_off_keeps_license_distribution_only.

5. Forced package/resource values bypass profile shaping
   (src/cmd/profile/context.rs:388). collect_forced_paths was
   writing raw CKAN values to target pointers via
   apply_force_overrides, producing string-where-Agent-expected
   shapes (e.g. forced package.publisher → "Name" instead of
   {"@type":"foaf:Agent","foaf:name":"Name"}). Fix: CKAN-side
   forces now only contribute to `forced_paths` (discovery-merge
   protection); the value lives in merged package/resource via
   normalize_value_force and flows through the profile's templates
   for proper shaping. dataset_info forces still take the
   raw-write path (that's the documented escape hatch).
   Guard: forced_package_publisher_flows_through_profile_template.

6. validate() ignores profile.validation paths
   (src/cmd/profile/dcat_validate.rs:250). When validation.enabled
   was true, the function always used the embedded GSA bundle
   regardless of profile.validation.schema_dir. Fix: when the
   profile's schema_dir matches the embedded `resources/dcat-us-v3/`
   path (the only bundle qsv ships today), use the embedded
   validators; any other schema_dir produces a single
   Recommended-severity warning explaining that custom-bundle
   validation is a queued follow-up. The embedded DCAT-US v3
   profile's behavior is unchanged.

Low severity (1):

7. DiscoveryMerge::default() disabled merging
   (src/cmd/profile/profile_spec.rs:273). #[derive(Default)] gave
   `enabled: false`, contradicting the documented "fill-if-absent
   enabled by default" semantics — the `#[serde(default =
   "default_true")]` annotation only fires during deserialization.
   Fix: hand-rolled Default impl with enabled: true, the
   never_overwrite list (@context, @type, dcat:distribution), and
   fill-if-absent strategy.

Golden refresh:
* Catalog goldens (nyc-311, usda-soil, wprdc-311) pick up the new
  envelope @context entry — finding #2 fix.
* usda-soil dataset golden loses the spurious `"dct:spatial":
  "null"` entry — finding #3 fix.

Verification:
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 46 integration tests pass
  (40 prior + 6 new regression guards).
* All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp).
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit accidentally committed three *.stats.csv files
(qsv stats cache, auto-regenerated on every profile run). They slipped
past .gitignore because the golden-directory *.csv whitelist also
matches the stats.csv suffix.

Fix: add a re-ignore rule for `tests/resources/profile/golden/*.stats.csv`
and the JSONL variant, then `git rm` the committed files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2491)

Regression introduced by the #2490 fix #5: when CKAN-side `force: true`
values stopped being raw-written via apply_force_overrides, they
became vulnerable to overwrite by spec formulas. A formula targeting
`package.publisher` would replace the forced value in
merge_formula_results' pass-1 (before projection), violating the
documented "force beats inferred" guarantee.

Fix: track the CKAN-side forced field-name sets through the pipeline
so merge_formula_results can skip them.

* context.rs: collect_forced_paths now returns a 4-tuple including
  `forced_package_fields` and `forced_resource_fields`
  (HashSet<String> of CKAN-side field names marked force:true).
  load_initial_context returns the matching 6-tuple; AnalysisContext
  carries both sets.
* profile.rs: merge_formula_results takes the two sets and skips
  pass-1 inserts on matching field names. Suggestion-formula output
  (pass 2) lives in dpp_suggestions and is unaffected.

The forced value still flows through the profile templates for proper
shaping (so dct:publisher gets its foaf:Agent wrapper, etc.) — the
shaping fix from #2490 #5 is preserved.

Regression guard: forced_package_field_survives_formula_overwrite
(tests/test_profile.rs). Constructs a spec with a `title` formula
that would set "Formula Wins", combined with `package.title:
{value: "Forced Title", force: true}`. The output must carry
"Forced Title" — confirming force beats formula.

Verification:
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 47 integration tests
  pass (46 prior + 1 new regression guard).
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2493)

Follow-up regression to #2491: the force-skip in merge_formula_results
only checked the exact CKAN field name. Aliases that project to the
same target pointer (e.g. `package.author` and `package.publisher`
both → `/dcat/dct:publisher`) bypassed the check — a formula writing
`publisher` could still overwrite a forced `author` value.

Fix: after the first pass collects forced (ckan_ptr, target_ptr)
pairs, walk profile.field_mappings and add every CKAN field whose
target appears in the forced target set to the forced_pkg /
forced_res field-name set. So forcing `package.author` now also locks
`package.publisher` (and any other alias keys for the same target).

Alias pairs covered by this fix in DCAT-US v3:
* author / publisher → dct:publisher
* landing_page / url → dcat:landingPage
* data_dictionary / describedBy → dcat:describedBy
* accrualPeriodicity / frequency / update_frequency → dct:accrualPeriodicity
* dcat-us:accessLevel / access_level → dcat-us:accessLevel
* accessRights / access_rights → dct:accessRights
* scopeNote / scope_note → skos:scopeNote
* liabilityStatement / liability_statement → dcat-us:liabilityStatement
* inSeries / in_series → dcat:inSeries
* versionNotes / version_notes → dcat:versionNotes
* license / license_id → distribution.dct:license
* modified / last_modified → distribution.dct:modified

Regression guards (tests/test_profile.rs):
* forced_author_locks_publisher_alias — forces package.author,
  formula targets `publisher`, asserts foaf:name is "Forced Author".
* forced_license_id_locks_license_alias — forces resource.license_id
  to cc-by, formula targets `license` with cc-by-sa, asserts the
  CC-BY 4.0 IRI (not CC-BY-SA) lands on Distribution.

Verification:
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 49 integration tests
  pass (47 prior + 2 new alias guards).
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devskim found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 27, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 129 complexity

Metric Results
Complexity 129

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a YAML-driven metadata projection engine for qsv profile, allowing users to select bundled or custom projection “profiles” via --profile, and adds bundled profiles for DCAT-US v3 (default), DCAT-AP v3, and Croissant. This replaces the prior hardcoded DCAT projection logic with a configurable spec-driven pipeline (projection → discovery merge → overrides → validation).

Changes:

  • Add YAML ProfileSpec parsing and a generic projection engine with profile-aware discovery merge behavior.
  • Bundle three projection profiles (dcat-us-v3, dcat-ap-v3, croissant) and vendor the GSA DCAT-US v3 JSON Schema bundle plus qsv overlay schemas.
  • Extend CLI/docs/tests/fixtures to cover profile selection, catalog wrapping, force semantics, and schema pinning.

Reviewed changes

Copilot reviewed 57 out of 59 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/cmd/profile.rs Wires qsv profile through the YAML-driven projection engine; adds --profile and --catalog; updates force/validation flow.
src/cmd/profile/profile_spec.rs Defines the YAML schema for projection profiles and loads embedded profiles.
src/cmd/profile/projection.rs Implements template-based field projection and optional catalog wrapping.
src/cmd/profile/discovery_merge.rs Implements configurable merge of discovered metadata into inferred output.
src/cmd/profile/dcat_validate.rs Validates output using the vendored DCAT-US v3 JSON Schema bundle (and overlays) when enabled.
src/cmd/profile/context.rs Extends context-building to accept a profile (used for CKAN→target pointer translation for force semantics).
src/cmd/profile/formula_helpers.rs Adds helper functions/filters used by projection templates (e.g., hashes, file size, CSVW/Croissant builders).
resources/profiles/README.md Documents bundled YAML profiles, authoring, and versioning contract.
resources/profiles/dcat-us-v3.yaml Bundled DCAT-US v3 projection profile and mappings/vocabularies.
resources/profiles/dcat-ap-v3.yaml Bundled DCAT-AP v3 projection profile.
resources/profiles/croissant.yaml Bundled Croissant projection profile.
resources/dcat-us-v3/README.md Documents vendored schema bundle and refresh/pin procedure.
resources/dcat-us-v3/MANIFEST.json Records upstream pin and per-file SHA-256 hashes for schema bundle integrity.
resources/dcat-us-v3/qsv-overlay-dataset.json Overlay schema adding/validating qsv’s emitted dcat-us:* extensions.
resources/dcat-us-v3/qsv-overlay-catalog.json Overlay schema for Catalog entry-point.
resources/dcat-us-v3/definitions/AccessRestriction.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Activity.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Address.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Agent.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Attribution.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Catalog.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/CatalogRecord.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Checksum.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Concept.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/ConceptScheme.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/CUIRestriction.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/DataService.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Dataset.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/DatasetSeries.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Distribution.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Document.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Identifier.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Kind.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Location.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Metric.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Organization.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/PeriodOfTime.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/QualityMeasurement.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Relationship.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/Standard.json Vendored upstream DCAT-US v3 schema definition.
resources/dcat-us-v3/definitions/UseRestriction.json Vendored upstream DCAT-US v3 schema definition.
tests/resources/profile/dcat-init-context.README.md Updates/extends initial-context and force semantics documentation.
tests/resources/profile/dcat-init-context.json Updates initial context fixture used by tests.
tests/resources/profile/golden/nyc-311-subset.csv Golden input fixture for parity/smoke tests.
tests/resources/profile/golden/nyc-311-subset.dataset.expected.json Golden expected Dataset output.
tests/resources/profile/golden/nyc-311-subset.catalog.expected.json Golden expected Catalog output.
tests/resources/profile/golden/usda-soil-subset.csv Golden input fixture for parity/smoke tests.
tests/resources/profile/golden/usda-soil-subset.dataset.expected.json Golden expected Dataset output.
tests/resources/profile/golden/usda-soil-subset.catalog.expected.json Golden expected Catalog output.
tests/resources/profile/golden/wprdc-311-subset.csv Golden input fixture for parity/smoke tests.
tests/resources/profile/golden/wprdc-311-subset.dataset.expected.json Golden expected Dataset output.
tests/resources/profile/golden/wprdc-311-subset.catalog.expected.json Golden expected Catalog output.
docs/help/profile.md Regenerated CLI help documentation for qsv profile.
Cargo.toml Adds optional sha2 dep and enables it under the profile feature.
Cargo.lock Locks sha2 dependency.
.gitignore Ensures golden fixtures are tracked while ignoring generated stats caches.
profile3-handoff.md Adds architecture/handoff documentation for the new engine.

Comment thread docs/help/profile.md Outdated
Comment thread tests/resources/profile/dcat-init-context.README.md Outdated
Comment thread src/cmd/profile/profile_spec.rs
Comment thread profile3-handoff.md Outdated
Comment thread resources/profiles/croissant.yaml Outdated
Comment thread src/cmd/profile.rs
Comment thread src/cmd/profile.rs
Comment thread src/cmd/profile.rs
Comment thread resources/dcat-us-v3/README.md Outdated
Comment thread resources/dcat-us-v3/MANIFEST.json Dismissed
Comment thread resources/dcat-us-v3/MANIFEST.json Dismissed
Comment thread resources/profiles/croissant.yaml Dismissed
Comment thread resources/profiles/dcat-ap-v3.yaml Dismissed
Comment thread resources/profiles/dcat-us-v3.yaml Dismissed
Comment thread tests/resources/profile/golden/usda-soil-subset.dataset.expected.json Dismissed
Comment thread tests/resources/profile/golden/wprdc-311-subset.catalog.expected.json Dismissed
Comment thread tests/resources/profile/golden/wprdc-311-subset.catalog.expected.json Dismissed
Comment thread tests/resources/profile/golden/wprdc-311-subset.dataset.expected.json Dismissed
Comment thread tests/resources/profile/golden/wprdc-311-subset.dataset.expected.json Dismissed
jqnatividad and others added 2 commits May 26, 2026 23:33
Apply all 9 unresolved inline review comments. Each was verified
against the current code before action.

1. docs/help/profile.md (truncated --initial-context help)
   Reformatted the USAGE block in src/cmd/profile.rs so the
   description survives markdown-table generation: flattened the
   nested bullet list into a single paragraph and added a pointer
   to dcat-init-context.README.md for the full example.

2. tests/resources/profile/dcat-init-context.README.md
   Updated the "How package / resource force flags route to DCAT"
   section to reference the active profile's `field_mappings:` table
   + `ProfileSpec::translate_ckan_ptr` instead of the deleted
   src/cmd/profile/ckan_to_dcat.rs module.

3. src/cmd/profile/profile_spec.rs (load-time validation claim)
   Moved `projection::dry_compile` inside `load()` so the doc claim
   on `EMBEDDED` is now accurate: every template parses through
   minijinja at profile-load time, surfacing typos before
   stats/frequency/formulas run. Dropped the redundant dry_compile
   call from profile.rs::run.

4. profile3-handoff.md (hardcoded absolute path)
   Removed the `/Users/joelnatividad/.claude/plans/...` reference
   to the original plan file; the handoff now describes the engine
   without pointing at a path that doesn't exist for other
   contributors.

5. resources/profiles/croissant.yaml (misplaced key)
   Removed the no-op `strippable_curie_prefixes: []` from the
   `discovery_merge:` block — that key lives under `validation:`
   per the schema; keeping it here was misleading.

6. src/cmd/profile.rs (dead `merge_discovered` + tests)
   Deleted the orphaned legacy `merge_discovered` function (the
   orchestrator now uses `discovery_merge::merge` exclusively) and
   the 9 in-file tests that exercised it. Coverage is preserved by
   the unit tests in src/cmd/profile/discovery_merge.rs and the
   new integration tests in tests/test_profile.rs (e.g.
   `catalog_mode_merges_discovered_into_inner_dataset_not_envelope`).
   Net −168 LOC.

7-8. src/cmd/profile.rs (stale `ckan_to_dcat` doc comments)
   Updated two doc comments (`apply_force_overrides` doc + the
   force-collection comment in `run()`) so future readers find
   `field_mappings:` + `ProfileSpec::translate_ckan_ptr` instead
   of being pointed at the deleted module.

9. resources/dcat-us-v3/README.md (wrong test path)
   The pin-guard test lives at tests/test_profile.rs::dcat_us_v3_bundle_pin_manifest_matches_files,
   not the non-existent tests/test_dcat_us_bundle_pin.rs. Updated
   both the prose reference and the `cargo test` invocation.

Verification:
* cargo build --bin qsv,qsvmcp,qsvlite,qsvdp — all 4 clean.
* cargo test cmd::profile:: → 117 unit tests pass (was 127; the
  10 deleted merge_discovered tests are obsolete).
* cargo test --test tests test_profile:: → 49 integration tests
  pass (unchanged).
* cargo +nightly fmt applied.
* docs/help/profile.md regenerated via --generate-help-md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings from the post-fix re-review of d78d34c.

Medium (src/cmd/profile/projection.rs:dry_compile):
  The previous load-time validation only checked emit_when guards on
  dataset fields, leaving distribution and catalog field guards
  vulnerable. A typo in a distribution emit_when would compile-pass
  load() but silently render-fail at projection time (render_truthy
  treats the error as false, dropping the field). Fix: extend
  dry_compile to syntax-check emit_when in both distribution and
  catalog field loops. New guards:
  * dry_compile_rejects_malformed_distribution_emit_when
  * dry_compile_rejects_malformed_catalog_emit_when

Low (src/cmd/profile/discovery_merge.rs):
  The removed merge_discovered tests carried regression coverage for
  forced discovered keys containing `/` or `~` (full-IRI JSON-LD
  properties like http://purl.org/dc/terms/title). Restore that
  coverage on discovery_merge's internal escape_token path. New
  tests:
  * forced_full_iri_key_blocks_matching_discovered_key — forced path
    with each `/` escaped to `~1` must block the matching discovered
    IRI key.
  * forced_full_iri_key_does_not_block_unrelated_discovered_key —
    escaping must not over-match; unrelated discovered keys (e.g.
    dct:identifier) still flow through.
  * escape_token_handles_rfc6901_round_trip — direct check of the
    `~`-before-`/` escape order on plain, slash, tilde, mixed, and
    full-IRI inputs.

Verification:
* cargo test cmd::profile:: → 122 unit tests pass (117 prior + 5 new).
* cargo test --test tests test_profile:: → 49 integration tests pass.
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cmd/profile/discovery_merge.rs Dismissed
Comment thread src/cmd/profile/discovery_merge.rs Dismissed
Comment thread src/cmd/profile/discovery_merge.rs Dismissed
@jqnatividad jqnatividad merged commit 511dd6a into master May 27, 2026
19 checks passed
@jqnatividad jqnatividad deleted the dcat-us-v3-optimization2 branch May 27, 2026 04:22
jqnatividad added a commit that referenced this pull request May 27, 2026
…idator framework (#3911)

* feat(profile): out-of-process validator + Croissant mlcroissant wiring

The Croissant 1.0 spec ships no JSON Schema — its canonical
validator is the `mlcroissant` Python package. Until now,
`resources/profiles/croissant.yaml` had `validation.enabled: false`
because the engine had no way to run anything but the bundled
GSA JSON-Schema validator.

This commit adds a generic out-of-process validator path so any
profile can declare a `validation.external` block pointing at a
binary on PATH, wires Croissant up to mlcroissant by default, and
sets the same pattern up for a future DCAT-AP SHACL backend
(pyshacl) without further engine work.

Engine
- New `ExternalValidator` struct on `Validation` with `command`,
  `args`, `default_severity`, `label`, `install_hint`. Independent
  of `Validation::enabled` (JSON Schema gate) — both can coexist.
- New `src/cmd/profile/external_validate.rs`:
  `validate(profile, block) -> Vec<ProjectionWarning>`. Writes
  `block` to a `.json` tempfile, spawns `command` with `{file}`
  arg substitution (or appends path when no token), captures
  stderr/stdout. Missing binary → single `Severity::Info` warning
  including `install_hint` so users see the install path at the
  moment they need it. Spawn errors → `Severity::Recommended`.
  Non-zero exit → one warning per non-empty stderr line (or
  stdout when stderr is empty); defensive fallback when output
  is empty. Exit 0 → empty Vec.
- `src/cmd/profile.rs` orchestrator wires external validation
  next to `dcat_validate::validate` under `--validate-dcat`.
  Strict mode filters out `Info`-severity findings (those are
  the missing-binary notices, not real violations) — only actual
  findings can trip `--strict-dcat`.

Profile YAML
- `resources/profiles/croissant.yaml` opts in:
  `command: mlcroissant`,
  `args: ["validate", "--jsonld", "{file}"]`,
  `install_hint: "pip install mlcroissant
   (https://github.com/mlcommons/croissant/tree/main/python/mlcroissant)"`.

Docs
- `resources/profiles/README.md` gains a "Validation" section
  documenting both validator paths + a config table for the
  `external` block + the Croissant example.

Tests
- 13 unit tests in `external_validate::tests` covering: missing
  binary path (with + without install_hint), label override,
  successful exit, multi-line stderr findings, stdout fallback,
  exit-with-no-output diagnostic, `{file}` substitution,
  args-append-when-no-token, severity parsing, and config-absent
  no-op. Unix-only paths gate on `cfg!(unix)`.
- `embedded_croissant_parses_and_dry_compiles` extended to
  assert the external validator config is present + reachable.

End-to-end verified: on a system without `mlcroissant` installed,
`qsv profile <input> --profile croissant --validate-dcat` emits
exactly one `Severity::Info` warning that surfaces the install
command + URL.

Addresses the final §6 follow-up from PR #3908's handoff (Croissant
`mlcroissant` Python validator integration). The same
`external_validate` module is ready to host a `pyshacl` config for
DCAT-AP v3 in a future PR.

All 154 profile unit tests + 49 integration tests pass. Clippy +
docs-drift-check clean. All four binaries (qsv/qsvmcp/qsvlite/qsvdp)
build green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address review: gate external validator + stable field + OsString path

Three Copilot findings on PR #3911:

1. SECURITY (`src/cmd/profile.rs:507`) — `--profile <path>` loads
   arbitrary YAML from disk, and that YAML controls
   `validation.external.command`. Under `--validate-dcat` the
   engine would spawn whatever the file said, enabling arbitrary
   command execution if a user runs an untrusted profile.

   Added a trust gate:
   - New `ProfileSource` enum (`Embedded` / `FilePath`) on
     `ProfileSpec`, stamped by `load()` after parsing.
   - New CLI flag `--allow-external-validator` (default off).
   - Orchestrator runs `validation.external` only when
     `profile.source.is_embedded() || args.flag_allow_external_validator`.
     Otherwise emits a `Severity::Recommended` warning naming the
     would-be validator and pointing at the opt-in flag.

   Bundled profiles (dcat-us-v3, dcat-ap-v3, croissant) keep
   running their declared validators frictionlessly because the
   YAMLs are vetted at qsv release time; only file-loaded
   profiles require the explicit opt-in.

2. FIELD STABILITY (`external_validate.rs:128`) —
   `ProjectionWarning.field` is used as a JSON-LD key / pointer
   elsewhere, but external_validate was building it as
   `external_validate/{label}` where `label` is user-configurable.
   That broke downstream filtering and looked like a JSON pointer.

   Findings now use a stable `field: "external_validate"` and the
   label moves into the message as `<label>: <line>`. Users still
   trace findings back to source via the message; downstream
   filters can target a single string.

3. OS PATH FIDELITY (`external_validate.rs:71`) — the tempfile
   path was being run through `to_string_lossy()` before
   substitution. Non-UTF-8 paths (legal on Unix) would get
   mangled and the spawned validator would then complain that
   the file doesn't exist.

   Tempfile path is kept as `OsString` end-to-end. New
   `substitute_file_token(&str, &OsStr) -> OsString` helper
   stitches the OsStr path between UTF-8 string parts (args come
   from YAML so they're always UTF-8; only the path needs OsStr
   fidelity). `resolve_args` returns `Vec<OsString>` which
   `Command::args` accepts directly.

Tests
- `resolve_args_preserves_non_utf8_path_bytes` (Unix-gated, uses
  `OsStrExt::from_bytes` with bytes `\xFF` and `\xFE`) confirms a
  path with invalid UTF-8 sequences round-trips byte-for-byte
  through substitution.
- `non_zero_exit_surfaces_one_warning_per_stderr_line` and
  related tests updated to assert the stable field + new
  `<label>: <line>` message format.

End-to-end verified with a hand-crafted `/tmp/evil.yaml`:
- Embedded croissant → spawns; missing-binary Info warning.
- File-loaded evil.yaml without flag → spawn refused;
  Recommended warning surfaces the would-be command + opt-in
  instructions.
- File-loaded + `--allow-external-validator` → spawn proceeds;
  `--strict-dcat` honors findings as expected.

All 155 profile unit tests + 49 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(profile): cover external validator trust gate end-to-end

Roborev #2509 flagged that the trust gate added in 2f02c1c had no
automated coverage — the security-sensitive orchestrator path wasn't
exercised by any test. Adds 4 integration tests in tests/test_profile.rs:

- external_validator_gated_for_file_loaded_profile_without_flag:
  loads a sh-based external validator from a file, runs profile with
  --validate-dcat only, asserts the gate warning is present AND no
  SPAWNED- marker appears anywhere in warnings (proving the command
  was NOT spawned).

- external_validator_runs_for_file_loaded_profile_with_flag:
  same fixture + --allow-external-validator. Asserts the gate warning
  is absent AND a finding with the stable field "external_validate"
  and the new "fake-validator: SPAWNED-fake-validator" message format
  is surfaced (proving the spawn happened).

- external_validator_strict_dcat_fails_on_file_loaded_findings:
  flag + --strict-dcat. Asserts the command fails with the
  "external validator finding(s)" summary in stderr.

- external_validator_embedded_profile_skips_gate_warning:
  uses the embedded croissant profile. Asserts the file-loaded gate
  warning does NOT appear, regardless of whether mlcroissant itself
  is installed on the CI runner (Info-severity missing-binary
  warning is acceptable; the gate warning is not).

Three of the four tests gate on `cfg(unix)` because they spawn `sh`;
the embedded-profile test is portable.

All 53 profile integration tests + 155 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 27, 2026
…s.io tarball

Copilot review (PR #3914): the package `include` whitelist only covered
`src/**/*` plus two specific resource files, so `cargo install qsv`
(any features) shipped a .crate tarball with the profile YAMLs and
SHACL/Schema resources stripped out. `include_str!` at compile time
would then fail on the missing files.

Verified with `cargo package --list`: before this commit only README.md
files leaked through; now all four profile YAMLs (dcat-us-v3,
dcat-ap-v3, croissant, geoconnex), both vendored SHACL bundles, and
the full DCAT-US v3 JSON Schema tree under definitions/ are included.

This is a pre-existing bug that affected all four profiles equally
since #3908 (DCAT-US/DCAT-AP/Croissant landed without updating the
include list); fixed comprehensively rather than only patching the
geoconnex paths Copilot flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jqnatividad added a commit that referenced this pull request May 27, 2026
…3914)

* feat(profile): geoconnex projection profile + pyshacl SHACL validator (feature-gated)

Phase 1 dataset-level projection emitting Geoconnex hydrologic linked-data
JSON-LD (https://docs.geoconnex.us/) wired to pyshacl against the
vendored upstream SHACL shapes (~10KB Turtle bundle from internetofwater/nabu,
commit e5d6ad39, Apache-2.0). Covers DatasetShape / ProviderShape /
PublisherShape / DistributionShape; the row-per-feature
LocationOrientedShape (mandatory gsp:asWKT geometry synthesis from
lat/lon columns) is deferred to a follow-up that introduces a
`for_each_row` projection mode.

Gated behind a new `geoconnex` cargo feature so the embedded SHACL
bytes don't ship in qsvlite / qsvmcp / qsvdp-default builds. Included
in qsv via distrib_features; qsvdp can opt in with
`-F datapusher_plus,geoconnex`. Without the feature, --profile
geoconnex falls through to load()'s existing "unknown profile" error
path listing the bundled names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(profile): roborev #2531 — profile-aware catalog keys + publisher fallback

Medium (geoconnex.yaml:266): the catalog envelope wrapper hard-coded
`dct:title` and `dcat:dataset`, which broke the JSON-LD envelope for
profiles whose @context doesn't declare those prefixes (schema.org-
rooted Geoconnex). Add `title_key` (default `dct:title`) and
`dataset_key` (default `dcat:dataset`) to CatalogBlock so the wrapper
honours profile-specific keys; thread them through wrap_as_catalog
and legacy_catalog_title. Geoconnex now pins schema:name +
schema:dataset (the canonical schema.org DataCatalog → Dataset
relation). DCAT-US/DCAT-AP defaults preserved.

Low (geoconnex.yaml:177): schema:publisher fallback used
`pkg.maintainer | default(pkg.publisher | default(""))` which doesn't
re-fall-through when `maintainer` is defined-but-empty (common CKAN
shape). Switched to `pkg.maintainer or pkg.publisher or ""` matching
the provider block's idiom in both template body and emit_when.

Added `geoconnex_catalog_uses_schema_org_keys_not_dcat` regression
test asserting the envelope carries no dct:/dcat: keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(profile): roborev #2535 — regression test for empty-maintainer publisher fallback

Low (geoconnex.yaml:179): the #2531 fix changed schema:publisher to use
`pkg.maintainer or pkg.publisher or ""` but shipped without coverage.
Add geoconnex_publisher_falls_through_empty_maintainer_to_publisher —
uses an initial-context with maintainer="" (defined but blank, a common
CKAN shape), publisher="Agency", maintainer_email set, and asserts
`/dcat/schema:publisher/schema:name` resolves to "Agency".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* build(profile): include profile YAMLs + SHACL/Schema bundles in crates.io tarball

Copilot review (PR #3914): the package `include` whitelist only covered
`src/**/*` plus two specific resource files, so `cargo install qsv`
(any features) shipped a .crate tarball with the profile YAMLs and
SHACL/Schema resources stripped out. `include_str!` at compile time
would then fail on the missing files.

Verified with `cargo package --list`: before this commit only README.md
files leaked through; now all four profile YAMLs (dcat-us-v3,
dcat-ap-v3, croissant, geoconnex), both vendored SHACL bundles, and
the full DCAT-US v3 JSON Schema tree under definitions/ are included.

This is a pre-existing bug that affected all four profiles equally
since #3908 (DCAT-US/DCAT-AP/Croissant landed without updating the
include list); fixed comprehensively rather than only patching the
geoconnex paths Copilot flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(features): add geoconnex to all_features + distrib_features enumerations

CI's docs-drift-check flagged docs/FEATURES.md lines 30 and 36 — both
the `all_features` shortcut and `distrib_features` description list
their member features verbatim, and #3914 added `geoconnex` to both
sets in Cargo.toml without updating the prose. Inserted `geoconnex`
in alphabetical position (between `geocode` and `luau`) in both
lines. Verified with `python3 scripts/docs-drift-check.py`: no drift
detected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants