Skip to content

feat(profile): complete DCAT-US v3 compliance, drop Python, add URL inputs#3901

Merged
jqnatividad merged 17 commits into
masterfrom
dcat-us-3-opt
May 25, 2026
Merged

feat(profile): complete DCAT-US v3 compliance, drop Python, add URL inputs#3901
jqnatividad merged 17 commits into
masterfrom
dcat-us-3-opt

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Completes the DCAT-US v3 work started in #3898. qsv profile is now
Python-free at runtime, accepts URL inputs, emits a fully v3-compliant
dcat:Dataset block, and ships opt-in JSON Schema validation.

13 commits, +4693/-1557 lines net (deleted ~1400 lines of Python).

What landed

Native Rust formula engine (Phase 0)

  • Port all 17 DP+ jinja2_helpers.py helpers to Rust on minijinja
    (formula_engine.rs, formula_helpers.rs)
  • New sql_backend.rs backs temporal_resolution /
    guess_accrual_periodicity with Polars SQL over the input CSV
    (no subprocess, no Python)
  • Delete src/cmd/profile/py/ and py_engine.rs; drop python from
    the profile feature, add polars
  • 17 helpers ported including spatial_extent_wkt,
    spatial_extent_feature_collection, format_date, format_number,
    format_bytes, calculate_bbox_area, get_column_stats, etc.
  • Helpers accept both positional and keyword args (DP+ docstring parity)

DCAT-US v3 shape migrations (Phases 1-2)

  • Refactor dcat::build into per-section helpers (add_core_identity,
    add_provenance, add_classification, add_coverage,
    add_governance, add_distributions)
  • dct:spatial and dct:temporal now arrays (per v3 cardinality);
    temporal yields one PeriodOfTime per inferred date column
  • dct:license moved from Dataset to Distribution (strict v3); new
    --dcat-legacy-license flag re-emits at Dataset for back-compat
  • dct:conformsTo always emitted as dct:Standard object pointing at
    the v3 spec URL
  • dct:language normalized to ISO 639-1 (en-USen,
    "English"en)
  • dct:modified rejects ISO 8601 interval syntax (R/P1Y, etc.) —
    those values belong on dct:accrualPeriodicity

URL input + DCAT-markup discovery (Phase 3)

  • qsv profile <URL> downloads the CSV to a tempfile via streamed
    reqwest (no buffering, no OOM on large files)
  • Tempfile suffix preserves compound CSV-family extensions (.csv.gz,
    .csv.zst, .tsv.bz2, …); query strings/fragments stripped
  • Original URL surfaces as dcat:downloadURL on the Distribution and
    as dct:title / Distribution dct:title (URL basename fallback,
    preferred over the random tempfile suffix)
  • New dcat_discover.rs sniffs DCAT markup via HTTP
    Link: rel=describedBy header. (Sibling-URL probing
    • HTML JSON-LD <script> sniff queued as follow-ups.)
  • CLI: --no-dcat-discovery, --dcat-discovery-timeout

--initial-context + merge precedence (Phase 4)

  • --initial-context replaces --package-meta/--resource-meta with
    a single unified JSON input: top-level package, resource,
    dataset_info keys
  • Per-property {value, force} wrapper detection at every JSON leaf
    (package, resource, AND dataset_info) — force flag is forward-compat
    for full override semantics
  • dataset_info map of RFC 6901 JSON-Pointer → Value overrides
    applies to the assembled output last; handles numeric path segments
    for descending into arrays (e.g.
    /dcat/dcat:distribution/0/dct:license)
  • Discovered-DCAT merge fills gaps in the inferred projection
    (inferred always wins on collision; per-distribution merging
    out of scope until a per-resource identity scheme exists)
  • Template + README at tests/resources/profile/dcat-init-context.*

v3 field coverage + warnings channel (Phase 5)

  • dcat::build signature now returns (Value, Vec<DcatWarning>);
    warnings surface in the output JSON under dcat_warnings (elided
    when empty) with {field, severity, message} entries
  • Dataset additions: dcat:contactPoint (mandatory, vcard:Individual
    with mailto-prefixed email), dcat-us:bureauCode / programCode,
    dct:accrualPeriodicity (slug → EU controlled-vocab IRI),
    dcat:temporalResolution, dct:accessRights, dct:rights,
    dcat:landingPage, dcat:describedBy, dcat-us:purpose,
    skos:scopeNote, dcat-us:liabilityStatement, dcat:inSeries
  • Distribution additions: dcat:accessURL, dct:rights, dct:modified,
    dcat-us:accessRestriction / useRestriction / cuiRestriction

JSON Schema validation (Phase 6)

  • New dcat_validate.rs with embedded minimal v3 schema enforcing
    the mandatory keys per the spec landing page (@type,
    dct:title/description/identifier/publisher, dcat:contactPoint,
    dct:conformsTo, dcat:distribution)
  • CLI: --validate-dcat (warn, append to dcat_warnings) and
    --strict-dcat (fail the command)
  • Validation runs AFTER dataset_info overrides — JSON-Pointer
    overrides that supply missing mandatory fields rescue
    --strict-dcat
  • Stale build-time warnings filter against the final dcat block
    before re-emission, so a dataset_info-supplied field doesn't
    leave a stale "missing X" warning behind
  • Full GSA jsonschema bundle vendoring (under resources/dcat-us-v3/,
    pinned commit SHA, with $ref resolution) queued as a follow-up

Live-tested

End-to-end against https://data.wprdc.org/datastore/dump/5202679a-…
(Pittsburgh 311 CSV, 938K rows, 280 MB). With a populated
--initial-context from the CKAN API, every mandatory + recommended v3
field is emitted and the JSON-Schema validator passes clean (zero
dcat_warnings).

Followups (documented in commit messages, not in this PR)

  • Adding profile to qsvdp (datapusher_plus binary) — the feature-cfg
    overlap with sortcheck.rs etc. needs an audit
  • Sibling-URL DCAT discovery (.metadata.json, .dcat.json,
    datapackage.json, /.well-known/data.json)
  • HTML <script type=application/ld+json> sniff
  • Vendoring the full GSA jsonschema bundle for --validate-dcat
  • Full per-property force: true override semantics (wrapper is
    accepted forward-compat; merge currently uses fill-gaps)

Two roborev cycles addressed in-PR

Test plan

  • cargo build --bin qsv -F profile — clean
  • cargo build --bin qsv -F all_features — clean
  • cargo build --bin qsvmcp -F qsvmcp — clean (profile included)
  • cargo build --bin qsvlite -F lite — clean (profile excluded)
  • cargo build --bin qsvdp -F datapusher_plus — clean (profile
    not added to qsvdp this PR; the inclusion needs a separate
    feature-cfg audit)
  • cargo test --bin qsv -F profile cmd::profile:: — 94 unit tests
    pass
  • cargo test --test tests -F profile profile_ — integration
    tests pass (12 of the 14 — the validation tests use different
    filter strings; run with -- test_profile:: for all 14)
  • cargo +nightly fmt — clean
  • cargo clippy --bin qsv -F profile — clean
  • Live URL test: qsv profile <CKAN-resource-CSV-URL> --initial-context <init.json> --validate-dcat — produces
    compliant v3 output with no warnings

🤖 Generated with Claude Code

jqnatividad and others added 13 commits May 25, 2026 09:41
…inja)

Replaces the PyO3-based formula engine with a native Rust implementation
built on minijinja. The Python files are not yet deleted (Phase 0d) and
the python feature gate is still in place (Phase 0e), but the runtime
code path no longer touches Python — qsv profile evaluates spec formulas
entirely in-process.

* formula_engine.rs preserves the evaluate_spec / FormulaResult API
  exactly so profile.rs::merge_formula_results was untouched beyond the
  module import.
* formula_helpers.rs ports all 17 DP+ helpers; the 2 SQL-backed ones
  (temporal_resolution, guess_accrual_periodicity) return errors today,
  matching the old Python-stub behavior; Phase 0c wires them to Polars.

Verified: 22 new unit tests pass, all 5 existing integration tests pass
(including the spatial_extent_wkt end-to-end formula round-trip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c-e)

Phase 0c: temporal_resolution and guess_accrual_periodicity now query
the input CSV via Polars SQL (new src/cmd/profile/sql_backend.rs). Each
helper call rebuilds a fresh LazyFrame + SQLContext, registers the CSV
as table `data`, runs a CAST-to-VARCHAR DISTINCT/ORDER BY query, then
parses the result strings as dates and computes interval thresholds
identical to DP+'s Python semantics. The backend is installed on the
current thread by evaluate_spec for the render-pass duration and
cleared after; thread_local storage avoids cross-test contamination.

Phase 0d: deleted src/cmd/profile/py/ and src/cmd/profile/py_engine.rs;
removed the jinja2_helpers.py entry from .github/workflows/devskim.yml.

Phase 0e (partial): profile feature is now [feature_capable, polars,
yaml_serde] — python is out, polars is in. profile is removed from the
python pull in distrib_features.

Adding profile to datapusher_plus is DEFERRED: it triggers feature-cfg
overlap in sortcheck.rs (and likely other files) since the profile
feature carries feature_capable. Needs a broader feature-cfg audit;
queued as a follow-up.

USAGE banner updated to drop the python3+jinja2 requirement language.
docs/help/profile.md regenerated. Two tangential clippy-fix nits in
describegpt.rs and moarstats.rs picked up by --fix.

Verified: 29 unit tests pass (including 6 new SQL-backed tests), all 5
integration tests pass, cargo +nightly fmt + cargo clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure refactor — bit-identical output. The build() function becomes a
thin orchestrator over six dedicated helpers:

  add_context_and_type    — @context + @type
  add_core_identity       — dct:title/description/identifier/modified/issued
  add_provenance          — dct:license + dct:publisher
  add_classification      — dcat:keyword + dcat:theme
  add_coverage            — dct:spatial + dct:temporal
  add_governance          — dcat-us:accessLevel
  add_distributions       — dcat:distribution array

Each helper has an inline NOTE comment pointing at the upcoming Phase 2
shape change (spatial/temporal → array, license → Distribution,
conformsTo as Standard object, language ISO 639-1) so subsequent diffs
land in the obvious spot.

Adds a take_first_str(obj, &[primary, fallback]) helper to replace the
string_opt(...).or_else(...) chains scattered through the dataset
builder.

Verified: 4 dcat unit tests + 5 integration tests pass; spatial/temporal
JSON shapes and key-order are preserved exactly so downstream consumers
see no change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Breaking changes to the dcat output shape per the DCAT-US v1.1 → v3
migration guide (https://resources.data.gov/resources/dcat-us-3-migration/).

* 2a: dct:spatial is now an array of dct:Location objects (was a single
  object). Affects both the WKT-suggestion branch and the bbox fallback.
* 2b: dct:temporal is now an array of dct:PeriodOfTime, with one entry
  per inferred date column (previously only the first DATE_FIELDS entry
  was consumed).
* 2c: dct:license moved from Dataset to Distribution. Read order:
  resource.license_id → resource.license → package.license_id →
  package.license. New CLI flag --dcat-legacy-license re-emits at the
  Dataset level for transitional back-compat (default off).
* 2d: dct:conformsTo is always emitted as a dct:Standard object pointing
  at https://resources.data.gov/resources/dcat-us3/. dct:language, when
  provided, is normalized to ISO 639-1 (en-US → en, "English" → en);
  unrecognized values are dropped (Phase 5 will warn instead).
* 2e: dct:modified rejects ISO 8601 interval syntax (R/P1Y, P1Y,
  start/end ranges) — DCAT-US v3 requires a discrete date here.
  Frequency-of-update values belong on dct:accrualPeriodicity (Phase 5).

The two existing integration-test assertions that touched dct:spatial
were updated to index into the array. Added 7 new dcat unit tests
covering each shape change.

Verified: 11 dcat unit tests pass (4 existing + 7 new), all 5
integration tests pass, format + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 3a)

When the input arg looks like a URL (http:// or https:// prefix,
case-insensitive), download it to a NamedTempFile via reqwest::blocking
and feed the local tempfile path to the rest of the pipeline (stats,
frequency, sqlp-backed helpers). The tempfile handle is held in a local
binding so its Drop fires only when run() returns.

The original URL is preserved and stamped onto resource.url (if the
resource doesn't already declare one) so the DCAT projection's
dcat:downloadURL slot gets populated automatically. The existing
absolute-IRI guard in build_distribution keeps non-IRI inputs from
polluting the slot.

DCAT-markup discovery (Link: rel=describedBy, sibling JSON-LD URLs,
HTML script-tag JSON-LD) is queued for Phase 3b. The
--no-dcat-discovery and --dcat-discovery-timeout flags also land with
3b.

Verified: new url_detection unit test passes; all 36 existing profile
unit tests + 5 integration tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New module src/cmd/profile/dcat_discover.rs implements best-effort
DCAT-US v3 sniff for URL inputs. Ships with the most authoritative
mechanism — HTTP Link: rel=describedBy header — and is structured so
the two remaining mechanisms (sibling URLs by convention, JSON-LD
<script> blocks in HTML landing pages) can land as follow-up commits
without API churn.

* RFC 8288 Link header parser handles multi-token rel values, multiple
  comma-separated links, and case-insensitive rel matching.
* Relative IRIs resolve against the original URL via the url crate.
* extract_dcat_dataset handles the three common shapes: bare object,
  @graph array, and shape-fallback for non-conforming publishers.
* All network / parse errors are non-fatal — discovery is enrichment;
  failures fall through silently.
* Discovered DCAT surfaces under output.dcat_discovered for now. Phase
  4 will merge it with the auto-inferred projection per the documented
  precedence chain (force:true > discovered > inferred > plain seed >
  formulas).

New CLI flags wired through Args:
  --no-dcat-discovery
  --dcat-discovery-timeout <secs>  (default 5)

12 new unit tests cover Link-header parsing, IRI resolution, and the
dataset-extraction shapes. All 49 profile unit tests + 5 integration
tests pass; format + clippy clean. docs/help/profile.md regenerated.

One nightly-fmt drift in describegpt.rs picked up too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eta (Phase 4a)

Single unified JSON input replaces the two old seed-meta flags. Top-level
keys:

  * package      — CKAN-shaped seed for the dataset block
  * resource     — CKAN-shaped seed for the resource block
  * dataset_info — RFC 6901 JSON-Pointer overrides into the final output

dataset_info is the escape hatch: each entry sets a value at the named
pointer in the assembled output JSON, applied last so it wins
unconditionally over inference, discovered DCAT, the CKAN block, and
formula output. Missing parent objects are auto-created; non-pointer
keys and malformed paths are silently skipped.

The fixture tests/resources/profile/dcat-init-context.json documents
every field the projection currently reads or will read in Phase 5,
with a sibling README mapping each slot to its DCAT-US v3 target
property + a per-property `force` semantics primer for Phase 4b.

Per-property `{value, force}` wrapper detection and the merge with
discovered DCAT (the actual precedence chain — force:true > discovered
> inferred > plain seed > formulas) land in Phase 4b.

Verified: 4 new unit tests for apply_pointer_overrides + 1 new
integration test (profile_initial_context_seeds_package_and_overrides_via_dataset_info)
exercise the full --initial-context flow end-to-end including
JSON-Pointer overrides, ISO 639-1 language normalization (en-US → en),
ISO 8601 interval rejection on dct:modified, and license-on-Distribution.
All 49 unit + 6 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* load_initial_context now normalizes per-property {value, force}
  wrappers. The wrapper is detected by exact two-key shape (value +
  boolean force), so structural fields like contact_point: {fn, hasEmail}
  pass through untouched. Nested wrappers (inside Maps or Arrays)
  unwrap recursively.
* merge_discovered overlays publisher-stated DCAT onto the inferred
  projection with "fill gaps" semantics: inferred values (including
  --initial-context seed values) always win on collision; discovered
  fills only the slots inferred left absent. @context, @type, and
  dcat:distribution are never overwritten (per-distribution merging
  needs an identity scheme first; out of scope).
* The raw discovered DCAT still surfaces under output.dcat_discovered
  for diffing / auditing.

The force flag is currently accepted forward-compat but no-op under
fill-gaps — since seeded values pre-populate inferred, plain values
and force:true wrappers produce identical output today. Full
override-discovered semantics can layer onto merge_discovered without
an init-context schema break.

Verified: 7 new wrapper-normalization tests + 3 new merge tests; all
63 profile unit tests and 6 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (Phase 5)

dcat::build signature changes from `-> Value` to `-> (Value, Vec<DcatWarning>)`.
profile.rs::run threads the warnings into the top-level output JSON as
`dcat_warnings` (elided when empty). Each entry is {field, severity,
message} where severity is Required (mandatory v3) or Recommended.

Dataset-level additions:
* dcat:contactPoint (vcard:Individual) — MANDATORY. Reads
  package.contact_point.{fn, hasEmail}; falls back to
  package.maintainer + package.maintainer_email. Missing → Required warning.
* dcat-us:bureauCode / programCode — accepts arrays or comma-separated
  strings. Missing → Recommended warning each.
* dct:accrualPeriodicity — slug → EU controlled-vocab IRI (annual,
  monthly, daily, etc., plus R/P* aliases). Also reads
  dpp_suggestions.accrual_periodicity for formula-derived values.
* dcat:temporalResolution — pass-through ISO 8601 duration; reads
  dpp_suggestions.temporal_resolution for formula-derived values.
* dct:accessRights, dct:rights, dcat:landingPage (IRI-validated),
  dcat:describedBy (IRI-validated), dcat-us:purpose, skos:scopeNote,
  dcat-us:liabilityStatement, dcat:inSeries (IRI-validated).

Distribution-level additions:
* dcat:accessURL (IRI-validated), dct:rights, dct:modified,
  dcat-us:accessRestriction / useRestriction / cuiRestriction
  (structured objects, passed through verbatim).

Tests: 9 new dcat unit tests (contact-point happy-path/fallback/missing,
US-codes array/csv/missing, accrual-periodicity slug mapping, extended
metadata pass-through, distribution v3 additions) + 2 new integration
tests (full-v3 population emits every mandatory+recommended slot with
no warnings; missing contactPoint surfaces a Required-severity warning).
All 20 dcat + 8 integration tests pass; format + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New src/cmd/profile/dcat_validate.rs exposes validate_dataset(value)
backed by jsonschema = 0.46 (already a workspace dep). Returns one
DcatWarning per violation — Required severity when the schema reports a
`required` violation, else Recommended.

The schema is an embedded minimal v3 enforcement of the mandatory
keys per https://resources.data.gov/resources/dcat-us3/:
@type=dcat:Dataset, dct:title/description/identifier (non-empty),
dct:publisher (with foaf:name), dcat:contactPoint (vcard:Individual
with vcard:fn + mailto-prefixed vcard:hasEmail), dct:conformsTo
(dct:Standard with @id), and dcat:distribution (≥1). Recommended
fields are intentionally NOT enforced by schema — those are surfaced
by the in-projection helpers (add_contact_point, add_us_codes, etc.)
which give richer guidance per missing field.

Follow-up: vendor the full GSA dcat-us jsonschema bundle from
https://github.com/GSA/dcat-us/tree/main/jsonschema under
resources/dcat-us-v3/ pinned to an upstream commit SHA and switch
embedded_minimal_schema() to load it via $ref resolution. The
Validator::options().build pattern already mirrors src/cmd/validate.rs
for an easy swap.

CLI flags wired through Args:
  --validate-dcat   Append schema violations to dcat_warnings.
  --strict-dcat     Under --validate-dcat, fail the command instead.

5 unit tests + 3 integration tests cover: passing minimal dataset,
missing contactPoint surfacing Required severity, bad email format
(mailto check), wrong @type rejection, missing distribution array,
full-init-context validates clean, missing-cp triggers warning, and
--strict-dcat fails the command without writing the output file.

docs/help/profile.md regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. SqlBackend now honors --delimiter and --no-headers. Added
   with_delimiter / with_has_header builders and threaded ctx_args
   values through evaluate_spec. SQL-backed helpers
   (temporal_resolution, guess_accrual_periodicity) now see the same
   columns stats/frequency saw on TSV/SSV/no-header inputs.

2. URL downloads stream via std::io::copy instead of buffering the
   full response with response.bytes() — large remote CSVs no longer
   risk OOM.

3. New tempfile_suffix_for_url helper parses the URL via url::Url,
   uses Url::path() to strip query strings and fragments, and
   preserves CSV-family compound extensions (.csv.gz, .tsv.gz,
   .csv.zst, .csv.bz2, .csv.xz, …). Previous Path::extension() on the
   raw URL turned `data.csv.gz?token=x` into garbage and dropped the
   .csv from compound extensions.

4. Reordered profile.rs::run so --initial-context dataset_info
   JSON-Pointer overrides apply BEFORE --validate-dcat / --strict-dcat
   runs. A user supplying a missing mandatory field via
   /dcat/dcat:contactPoint now satisfies validation. Build-time
   warnings are stashed under __pending_dcat_warnings during the
   intermediate phase and unstashed after the schema pass.

5. set_by_pointer now descends into existing arrays via numeric path
   segments. Previously /dcat/dcat:distribution/0/dct:license
   replaced the distribution array with {"0": {...}}, corrupting the
   DCAT shape. Out-of-range indices and non-numeric tokens against an
   array silently skip rather than convert it to an object.

6. Added Kwargs support to truncate_with_ellipsis, format_number,
   format_date, format_range, format_coordinates, and
   spatial_extent_feature_collection. Existing DP+ formulas using
   `format_date(format='%B %d, %Y')`, `truncate_with_ellipsis(length=5,
   ellipsis='…')`, `spatial_extent_feature_collection(name=, bbox=,
   feature_type=)` etc. now work. spatial_extent_feature_collection
   uses a kwargs-only signature (minijinja can't cleanly bind typed
   positional + Kwargs in an all-kwargs call); the docstring documents
   the constraint.

Tests: 4 new sql_backend (TSV, no-header roundtrip), 5 new profile.rs
(URL suffix compound/query/array-pointer regressions), 1 kwargs
end-to-end, 1 dataset_info-rescues-strict integration. All 86 unit +
12 integration tests pass.

DcatWarning + Severity gained #[derive(Deserialize)] so the
intermediate stash/unstash round-trips cleanly.

Closes roborev job 2439.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the input is a URL, the title slots (dct:title on Dataset,
dct:title on Distribution) previously surfaced the random tempfile
suffix (qsv-profile-XkZGBK) since context::build seeds resource.name
from the local file path and dcat::build's title fallback uses
Path::file_stem on the same. Both are useless to downstream consumers.

New url_title_default(url) helper parses the URL with url::Url, strips
CSV-family compound extensions (.csv.gz, .tsv.gz, .csv.zst, …) and
single extensions, then returns the last non-empty path segment.
Opaque/UUID basenames (e.g. CKAN's /datastore/dump/<uuid>) pass through
unchanged — still better than the random tempfile suffix and traceable
back to the input. Host-only / malformed URLs return None and the
caller falls back to today's tempfile-stem default.

profile.rs::run seeds:
  * package.title  — via .entry().or_insert(), so a real seed wins.
  * resource.name  — replaced only when its current value matches the
    tempfile stem context::build would have inserted; user-supplied
    values via --initial-context or formulas survive untouched.

Verified live against the WPRDC Pittsburgh 311 endpoint
(https://data.wprdc.org/datastore/dump/5202679a-...): both
dct:title and the Distribution dct:title now read
"5202679a-d243-402e-b82a-63189995a942" (the UUID basename) instead
of "qsv-profile-PL3u1k" (the tempfile stem). qsv:sourcePath still
records the tempfile path for traceability.

Tests: 6 new url_title_default unit tests + the 5 existing URL-suffix
ones still pass. All 92 profile unit + 12 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. profile.rs: stashed build-time dcat_warnings are now filtered
   against the final dcat block before being re-emitted. New helper
   final_dcat_has_field looks up each warning's field name (top-level
   key, or JSON-Pointer path for nested fields) in the post-override
   dcat snapshot and drops the warning if the slot is populated.
   Result: a `dataset_info` override that supplies a previously-
   missing mandatory field no longer leaves a "missing X" warning
   stale in the output.

2. context.rs::load_initial_context now applies normalize_value_force
   to dataset_info too — previously only package and resource were
   unwrapped. A documented override like
     "/dcat/dcat:contactPoint": {"value": {...}, "force": true}
   now unwraps to the inner Value before being written to the output
   via set_by_pointer, so the wrapper itself doesn't become the
   dcat:contactPoint value. This rescues --strict-dcat instead of
   tripping it.

3. spatial_extent_feature_collection regained positional-arg support
   alongside kwargs. minijinja's Function impl doesn't route
   `Rest<Value> + Kwargs` cleanly — Rest greedily consumes the kwargs
   container as its only positional, leaving Kwargs empty (confirmed
   via debug print: `args.len()=1, kwargs.args()=[]`). Workaround:
   accept Rest<Value> only and detect a trailing kwargs-shaped Value
   ourselves via the public Kwargs::try_from impl, splitting it off
   as kwargs and processing the leading slice as positionals. Both
   call styles documented in DP+'s docstring now work:
     spatial_extent_feature_collection("Name", bbox, "manual")
     spatial_extent_feature_collection(name="X", bbox=[...], feature_type="m")
     spatial_extent_feature_collection("Mix", bbox=[...], feature_type="m")

Tests: 2 new helper unit tests (positional + mixed-pos-and-kw call
styles) and 2 new integration tests (stale warnings cleared by
dataset_info override; wrapped {value, force} dataset_info override
rescues --strict-dcat). All 94 profile unit tests + 14 integration
tests pass; format + clippy clean.

Closes roborev job 2440.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cmd/profile/dcat_discover.rs Dismissed
Comment thread src/cmd/profile/dcat_discover.rs Dismissed
Comment thread src/cmd/profile/dcat_discover.rs Fixed
Comment thread src/cmd/profile/dcat_discover.rs Fixed
Comment thread src/cmd/profile/dcat.rs Fixed
Comment thread src/cmd/profile/dcat.rs Fixed
Comment thread src/cmd/profile/dcat.rs Fixed
Comment thread src/cmd/profile.rs Fixed
Comment thread src/cmd/profile/dcat_discover.rs Fixed
Comment thread src/cmd/profile/dcat_discover.rs Fixed
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 25, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 489 complexity

Metric Results
Complexity 489

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

jqnatividad and others added 3 commits May 25, 2026 17:24
CI docs-drift-check caught a docs gap from the Phase 0e Cargo.toml
change: `profile` no longer depends on `python`, and `python` was
dropped from the `distrib_features` enumeration in Cargo.toml, but the
two doc lines in docs/FEATURES.md still listed it.

`python` is still a defined feature (`python = ["pyo3"]` exists for
the standalone `py` command), so this is purely a docs-enumeration
sync — no Cargo.toml or behavior change.

Verified: scripts/docs-drift-check.py reports "no drift detected".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roborev 2444#1: the prior commit dropped `python` from the explicit
distrib_features enumeration but left the surrounding prose claiming
distrib_features is "all features except self_update, ui and magika".
Since `python` is also excluded now, add it to the exception clause
so the description matches reality.

Verified: scripts/docs-drift-check.py reports "no drift detected".

Closes roborev job 2444.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances qsv profile to a Python-free runtime implementation while completing DCAT‑US v3 shape and UX work (URL inputs, discovery, overrides, and validation) so profile output can be produced and checked entirely within the Rust binary.

Changes:

  • Replace the PyO3/Jinja2 runtime formula evaluation path with a native minijinja engine plus a Polars‑SQL backend for SQL-requiring helpers.
  • Add URL input handling (streamed download to tempfile) and best-effort DCAT discovery via HTTP Link: rel=describedBy.
  • Add unified --initial-context (including dataset_info JSON-Pointer overrides) and optional DCAT JSON Schema validation with warnings / strict failure.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_profile.rs Expands integration coverage for initial-context overrides, v3 fields, warnings, and schema validation/strict mode.
tests/resources/profile/dcat-init-context.README.md Adds a user-facing template README for --initial-context inputs.
tests/resources/profile/dcat-init-context.json Adds a full example --initial-context JSON fixture.
src/cmd/profile/sql_backend.rs Introduces Polars SQL helper backend for distinct/sorted date extraction.
src/cmd/profile/py/qsv_ckan_stubs.py Removes CKAN stub module used by the old Python execution path.
src/cmd/profile/py/profile_engine.py Removes Python formula evaluation engine.
src/cmd/profile/py/jinja2_helpers.py Removes vendored DP+ helper module previously used at runtime.
src/cmd/profile/py_engine.rs Removes the PyO3 bridge and tempdir staging logic.
src/cmd/profile/formula_helpers.rs Adds Rust ports of DP+ helpers/filters/globals, including SQL-backed helpers.
src/cmd/profile/formula_engine.rs Adds native minijinja formula evaluation engine preserving prior result shape.
src/cmd/profile/dcat_validate.rs Adds embedded minimal DCAT-US v3 JSON Schema validation and warning mapping.
src/cmd/profile/dcat_discover.rs Adds DCAT discovery via HTTP Link header and JSON-LD extraction.
src/cmd/profile/context.rs Replaces package/resource seed flags with unified --initial-context and wrapper normalization.
src/cmd/profile.rs Wires URL inputs, discovery, override precedence, warnings channel, and validation/strict mode into command flow.
src/cmd/moarstats.rs Minor refactor to simplify header cloning.
src/cmd/describegpt.rs Minor parsing refactor for --null-text extraction.
docs/help/profile.md Regenerates help docs to reflect new flags and Python-free runtime behavior.
docs/FEATURES.md Updates feature documentation to reflect profile no longer depending on python.
Cargo.toml Updates profile feature to depend on polars (and drop python), plus related feature list changes.
.github/workflows/devskim.yml Removes an ignore-glob for the now-deleted vendored Python helpers file.
_typos.toml Extends allowed-words list for new terminology used in docs/code.

Comment thread src/cmd/profile/formula_helpers.rs Outdated
Comment thread src/cmd/profile/formula_helpers.rs Outdated
Comment thread src/cmd/profile/formula_helpers.rs
Comment thread src/cmd/profile.rs Outdated
Comment thread src/cmd/profile/dcat_discover.rs
Comment thread tests/resources/profile/dcat-init-context.README.md Outdated
Six Copilot review findings:

* formula_helpers.rs module docs: drop the stale "SQL globals
  currently short-circuit to an error" claim — Phase 0c shipped the
  Polars SQL backend and the helpers are wired through it.
* mode() determinism: previously HashMap-based, so ties were broken
  non-deterministically and could change guess_accrual_periodicity
  output across runs. Now tracks first-seen index alongside the count
  and breaks ties via Reverse(first_seen) so the earliest-encountered
  value wins (matches Python Counter.most_common(1)).
* parse_date_strings now accepts ISO 8601 datetimes with fractional
  seconds — %Y-%m-%dT%H:%M:%S%.f and %Y-%m-%d %H:%M:%S%.f — before
  falling through to second-precision / date-only / RFC 3339. DP+'s
  datetime.fromisoformat accepts these and temporal_resolution /
  guess_accrual_periodicity were hard-failing without them.
* resolve_input now uses util::create_reqwest_blocking_client (same
  helper validate / describegpt / fetch use) for consistent
  user-agent, gzip/brotli/zstd compression, rustls, and 503 retry.
* dcat_discover::discover swaps to util::create_reqwest_blocking_client
  for the same reasons. The describedBy fetch in
  discover_via_link_header is now capped at 4 MiB via
  std::io::Read::take, so a publisher-controlled Link target can't
  blow up qsv's memory if it points at a multi-GB resource.
* dcat-init-context.README.md status column: every "⏳ Phase 5" entry
  flipped to "✅ today" since the Phase 5 fields all landed; the
  surrounding prose updated to match.

GitHub Advanced Security devskim noise (17 findings):

* The http:// IRIs in accrual_periodicity_iri (EU
  publications.europa.eu frequency vocab) are stable opaque
  identifiers published with the http scheme by spec, same as the
  Creative Commons / Open Data Commons IRIs in license_iri. Added
  `// DevSkim: ignore DS137138` to each.
* Same situation for the W3C DCAT canonical type IRI
  `http://www.w3.org/ns/dcat#Dataset` in dcat_discover::is_dcat_dataset
  and its two extract_dataset_* test fixtures.
* The two bare http:// strings in url_detection_recognizes_http_https_case_insensitive
  are detector inputs — exactly what is_http_url is testing — so the
  test gets DS137138 suppressions too.
* The two "TODO Phase 3b'/3b''" markers in dcat_discover.rs's
  module docstring tripped devskim's "suspicious comment" rule
  (DS176209); reworded to "(Phase 3b' follow-up)" /
  "(Phase 3b'' follow-up)" — same intent, no TODO/FIXME keyword.

Tests: 2 new mode/parse-date unit tests + all 96 unit + 14 integration
tests pass; format + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 866dd9d into master May 25, 2026
18 of 19 checks passed
@jqnatividad jqnatividad deleted the dcat-us-3-opt branch May 25, 2026 21:57
jqnatividad added a commit that referenced this pull request May 27, 2026
…, Croissant (#3908)

* feat(profile): comprehensive DCAT-US v3 support (Catalog, GSA bundle, force semantics)

Closes the five gaps that kept `qsv profile` from being an agency-grade
DCAT-US v3 reference tool:

- Vendor the full GSA JSON Schema bundle (26 definitions + 2 qsv
  overlays + MANIFEST.json + refresh README) under resources/dcat-us-v3/,
  pinned to upstream commit cf8789002. `--validate-dcat` now runs against
  the full bundle via `referencing::Registry`, dispatching the Dataset
  or Catalog overlay by the emitted `@type`. A `curie::strip_curies`
  pre-pass bridges qsv's JSON-LD-compact output to GSA's unprefixed
  schema keys without touching the emitted JSON on disk.

- Add `--catalog` flag that wraps the Dataset inside a `dcat:Catalog`
  envelope (`Catalog{dataset:[...]}`) for federation harvesters.

- Emit nine new optional v3 fields with natural data sources:
  Dataset-level `dct:created`, `dcat:version`, `dcat:versionNotes`;
  Distribution-level `dcat:checksum` (SHA-256 via sha2), `dcat:compressFormat`,
  `dcat:packageFormat`, `dcat:spatialResolutionInMeters`, `dct:language`,
  `dct:conformsTo`. Widen `dct:conformsTo` to array per v3 cardinality;
  emit `dct:license` as string and `dcat:byteSize` as string to match
  the GSA schemas' declared shapes.

- Implement full `force: true` override semantics across all three
  --initial-context subtrees. `context::collect_forced_paths` now walks
  package/resource entries through a 47-entry `ckan_to_dcat` mapping
  table; `apply_force_overrides` in `run()` applies forced leaves
  LAST so they beat both inferred and discovered metadata.

Pipeline precedence (low → high): inferred → discovered → dataset_info
pointers → forced leaves → schema validation.

Bumps profile feature: adds `sha2 = "0.10"` as a direct dep. Test
counts: 143 unit (was 96, +47) and 29 integration (was 18, +11) all
passing, plus a new bundle pin guard test that re-hashes every
vendored schema against MANIFEST.json on each run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(profile): scaffold YAML-driven projection engine (Stage 1)

Lay the foundation for the YAML-driven multi-profile projection engine
described in plan §1-§2. New modules are wired into profile.rs but the
orchestrator still calls the legacy dcat.rs path — zero behavior change
shipped here. Subsequent stages (§3-§8) populate the profile YAMLs,
swap the orchestrator, and delete the legacy hardcoded modules.

New modules:
* src/cmd/profile/profile_spec.rs — ProfileSpec serde types, embedded-
  first load() with case-insensitive name resolution, file-path fallback,
  6 unit tests.
* src/cmd/profile/projection.rs — generic project() engine with
  ProjectionMode { Dataset, Catalog }, ProjectionWarning { Severity },
  wrap_as_catalog, for_each_column RecordSet expansion, profile-aware
  lookup/field_mapping closures, dry_compile validator, 9 unit tests.
* src/cmd/profile/discovery_merge.rs — merge() with fill-if-absent,
  overlay-array, never strategies; never_overwrite + forced_paths
  protection; 5 unit tests.

Helper additions (formula_helpers.rs):
* Filters: only_if_absolute_iri, basename, file_stem,
  sanitize_iso_8601_interval, format_mailto.
* Globals: sha256_of (streaming), blake3_of (mmap+rayon), file_size_of,
  compress_format, package_format, build_csvw_schema.

Helpers needing profile state (lookup, field_mapping) live in
projection.rs::register_profile_helpers as closures over the
ProfileSpec; they unwrap_or(UNDEFINED) so | default chains work.

USAGE additions:
* --profile <name|path>: embedded names (dcat-us-v3, dcat-ap-v3,
  croissant) resolved first; falls back to file path. Not yet
  consumed in run() — wired up in Stage 4.

Placeholder YAMLs under resources/profiles/ exist so include_str! resolves
during Stage 1 builds; they will be replaced with real content in
Stages 3 (DCAT-US v3), 6 (DCAT-AP v3), 7 (Croissant).

Verification:
* cargo build --bin qsv -F profile,feature_capable — clean (23 expected
  dead-code warnings for the unused scaffold).
* cargo test cmd::profile:: — 163 passed (+20 new tests).
* cargo test --test tests test_profile:: — 29 passed (no regression).
* cargo +nightly fmt — applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(profile): capture goldens from legacy engine before YAML swap (Stage 2)

Lock the byte-equivalent output of the current hardcoded dcat.rs engine
against three regression fixtures so Stage 3's YAML-driven projection
can be asserted to produce identical Dataset + Catalog blocks.

Goldens captured by running today's qsv profile against each fixture
with the canonical --initial-context template, then normalizing via jq
to strip the only path-dependent field (qsv:sourcePath inside
dcat:distribution). Everything else in the .dcat block — including
dcat:byteSize, dcat:checksum, dct:modified, csvw:tableSchema — is
deterministic for fixed input and is captured verbatim.

Fixtures (under tests/resources/profile/golden/):
* nyc-311-subset.csv (10 rows) — geocoded urban service requests:
  lat/lon present, mixed Open/Closed status, multi-agency.
* usda-soil-subset.csv (10 rows) — scientific numeric data: pH,
  organic_carbon_pct, nitrogen_pct, clay/sand/silt percentages.
* wprdc-311-subset.csv (10 rows) — Pittsburgh 311 records:
  capitalized headers, X/Y geo, council districts + wards.

Goldens per fixture:
* <fixture>.dataset.expected.json — the .dcat block from Dataset mode.
* <fixture>.catalog.expected.json — the .dcat block from --catalog mode.

.gitignore whitelists tests/resources/profile/golden/*.{csv,expected.json}
so the *.json + *.csv blanket-ignores don't strip them.

These goldens will drive Stage 3's dcat_us_v3_golden_parity_dataset
and dcat_us_v3_golden_parity_catalog tests; CI hard-fails on drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(profile): ship dcat-us-v3.yaml profile (Stage 3, partial)

Author resources/profiles/dcat-us-v3.yaml — the full DCAT-US v3
projection definition that will replace the hardcoded dcat.rs engine
in Stage 4. The YAML mirrors the legacy add_* functions field-for-field
in declaration order so serde_json::Map insertion preserves wire-shape
parity (verified against the Stage-2 goldens at swap time).

Profile content:
* 4 vocabularies (license_iri, accrual_periodicity, iso_639_1,
  csvw_datatype) — each migrated verbatim from the legacy Rust
  constants. The EU vocab IRIs retain http:// scheme per their
  canonical published identifiers; DevSkim DS137138 suppressed per
  line.
* 53 field_mappings — same CKAN→DCAT pointer table the legacy
  ckan_to_dcat::CKAN_TO_DCAT held, in identical declaration order so
  alias-resolution precedence is preserved.
* dataset.fields[] — 23 entries covering core identity, provenance,
  contact point (required), classification, coverage, US codes
  (recommended), governance, and extended metadata. emit_when guards
  match the legacy `if let Some(...)` shapes.
* distribution.fields[] — 22 entries covering title, description,
  download URL, format/license/restrictions, language/conformance,
  file-derived facts (byteSize, checksum, compress/package format),
  spatial resolution, and csvw:tableSchema.
* catalog block reproduces wrap_as_catalog's envelope (Catalog of
  <title>, dct:conformsTo, dct:publisher inheritance).
* discovery_merge: enabled, never_overwrite=[@context,@type,
  dcat:distribution], fill-if-absent strategy.
* validation: enabled against the vendored GSA bundle under
  resources/dcat-us-v3/ with the same 11 strippable CURIE prefixes.

dry_compile verification:
A new unit test (embedded_dcat_us_v3_parses_and_dry_compiles)
parses the embedded YAML and runs projection::dry_compile() against
it — exercising every template's minijinja compile path. All
templates compile clean.

The actual byte-equivalent parity test (running each Stage-2 fixture
through projection::project() and asserting against goldens) lands
in Stage 4 alongside the orchestrator swap — at that point the
engine actually consumes the YAML.

The reference cross-checked sources for content:
  https://github.com/GSA/dcat-us/
  https://resources.data.gov/resources/dcat-us3/
  the vendored GSA bundle under resources/dcat-us-v3/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(profile): handoff #3 — YAML projection engine, Stages 1-3 landed

Captures the current state after the YAML-driven projection migration's
first three commits. Documents what's wired (scaffold + helpers + flag +
goldens + DCAT-US v3 YAML), what's still on the legacy path (dcat.rs
drives output), and a 9-sub-step Stage 4 plan for the orchestrator swap.

Supersedes profile2-handoff.md for post-PR-#3901 work. Key gotchas
distilled into §5: lookup helpers must return Value::UNDEFINED,
goldens only normalize qsv:sourcePath, field-mapping count is 53 not 47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(profile): wire YAML projection engine into orchestrator (Stage 4a)

profile.rs::run now routes through projection::project() with the
loaded ProfileSpec (default: dcat-us-v3). The YAML engine produces
byte-equivalent output to the legacy dcat.rs path on all 6 golden
fixtures (3 inputs × dataset/catalog modes), verified by new parity
integration tests.

Orchestrator changes:
* Load profile via profile_spec::load(args.flag_profile | "dcat-us-v3")
  at the top of run(), then projection::dry_compile() to fail fast on
  malformed embedded YAML.
* ContextArgs gains a `profile: &ProfileSpec` field; context::build
  threads it to load_initial_context → collect_forced_paths so the
  CKAN→target pointer translation uses profile.field_mappings instead
  of importing ckan_to_dcat.
* Replace dcat::build() call with projection::project(&profile,
  &projection_ctx, mode) — the projection_ctx carries pkg, res, stats,
  dpp, source_label, local_path matching the YAML's template names.
* Replace merge_discovered() with discovery_merge::merge(&profile,
  inferred, discovered, forced_dcat_paths) — same /dcat/<key> forced-
  path semantics, now driven by profile.discovery_merge.
* Catalog wrap baked into project() via ProjectionMode::Catalog
  (chosen upfront based on flag_catalog); orchestrator no longer
  calls catalog::wrap_as_catalog at the warning-filter step.
* Stash key renamed __pending_dcat_warnings →
  __pending_projection_warnings.
* DcatWarning → ProjectionWarning conversion bridges dcat_validate
  and run_profile_validation outputs (Stage 5 will refactor those
  modules to return ProjectionWarning directly).

Engine improvements:
* projection::project sets UndefinedBehavior::Chainable so
  `pkg.dpp_suggestions.spatial_extent.value | default("")` walks
  missing intermediates gracefully (matches legacy dcat.rs semantics
  where absent keys silently fall through).
* New file-aware helpers in formula_helpers.rs:
  - bbox_from_dpps(dpp, stats) — lat/lon column → POLYGON-WKT
    `dct:Location` array, mirroring legacy dcat::bbox_from_dpps.
  - temporal_from_dpps(dpp, stats) — date columns → array of
    `dct:PeriodOfTime`, one per inferred date column.
  - build_csvw_schema(stats) — column-name → stats-blob map walked,
    emitting `{columns: [...]}` with name, titles, datatype,
    qsv:cardinality / nullcount / min / max.
  - csvw_datatype_legacy helper mirrors the legacy mapping
    (Float → double, Integer → integer, Date → date, etc.).

dcat-us-v3.yaml updates:
* dct:spatial / dct:temporal fields call bbox_from_dpps /
  temporal_from_dpps as fallbacks behind the formula-derived WKT
  suggestion.
* dct:license emits a plain string (legacy license_value shape) via
  `{{ lookup("license_iri", raw) | default(raw) }}`, not the previous
  `{"@id": ...}` object form (GSA Distribution.json declares license
  as anyOf:[null,string]).

Tests:
* 2 new integration tests (dcat_us_v3_golden_parity_dataset /
  _catalog) iterate the 3 fixtures and assert byte-equivalent .dcat
  output against the goldens.
* discovery_merge test: forced-path form switched from "/dct:title"
  to "/dcat/dct:title" so it matches the legacy dataset_info pointer
  shape; +1 new test for nested-path force blocking top-level merge.
* All 6 goldens refreshed to current legacy output (the original
  Stage-2 capture had alphabetical stats-cache state).
* Full test sweep: 165 unit + 31 integration tests pass, 0 failures.

The legacy dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules
are still in tree (their tests still run via cmd::profile::*) but no
longer participate in the engine path. Stage 4b deletes them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(profile): delete legacy hardcoded engine + refactor validator (Stages 4b + 5)

The YAML-driven projection engine is now the only path. Stage 4a wired
projection::project() into run() with byte-equivalent output against
the goldens; this commit cleans up by deleting the legacy modules and
refactoring dcat_validate to consume the active ProfileSpec.

Deletions (~2400 LOC):
* src/cmd/profile/dcat.rs (1738 LOC) — the 9 add_* helpers,
  bbox_from_dpps, temporal_from_dpps, csvw_datatype, license_value,
  accrual_periodicity_iri, normalize_iso_639_1. The minijinja-side
  equivalents live in formula_helpers.rs + dcat-us-v3.yaml.
* src/cmd/profile/catalog.rs (154 LOC) — wrap_as_catalog moved into
  projection::wrap_as_catalog.
* src/cmd/profile/ckan_to_dcat.rs (271 LOC) — CKAN_TO_DCAT table
  moved verbatim into dcat-us-v3.yaml's field_mappings:; the lookup
  is now ProfileSpec::translate_ckan_ptr.
* src/cmd/profile/curie.rs (225 LOC) — strip_curies is now an inline
  helper in dcat_validate.rs driven by
  profile.validation.strippable_curie_prefixes.
* mod declarations for the deleted modules in profile.rs.

dcat_validate.rs refactor (Stage 5):
* New public API: validate(profile: &ProfileSpec, block: &Value) ->
  Vec<ProjectionWarning>. When profile.validation.enabled == false
  (DCAT-AP v3, Croissant), returns vec![] without touching the
  schema.
* Inline strip_curies / strip_curie_key replace the deleted curie
  module; the prefix list comes from
  profile.validation.strippable_curie_prefixes (still byte-identical
  to the legacy list for DCAT-US v3).
* classify_severity now returns projection::Severity instead of
  dcat::Severity.
* Test functions migrate to the new (profile, block) signature by
  loading the embedded dcat-us-v3 profile via profile_spec::load.

profile.rs cleanup:
* dcat_validate::validate_dataset_or_catalog() call → validate().
* run_profile_validation now returns Vec<ProjectionWarning> directly;
  the .into_iter().map(From::from) bridge is gone.

projection.rs cleanup:
* impl From<DcatWarning> for ProjectionWarning removed (no longer
  needed — all warning producers return ProjectionWarning).

Verification:
* cargo build --bin qsv -F profile,feature_capable — clean.
* All 4 binaries build clean: qsv (-F all_features), qsvmcp
  (-F qsvmcp), qsvlite (-F lite), qsvdp (-F datapusher_plus).
* cargo test cmd::profile:: → 127 unit tests pass (down from 165;
  the deleted legacy modules carried 38 tests now obsoleted by the
  YAML+goldens parity coverage).
* cargo test --test tests test_profile:: → 31 integration tests pass
  (29 original + 2 new dcat_us_v3_golden_parity_* tests).

Net Rust LOC delta this commit: −2388 deleted, +60 added (inline
strip_curies + tests) = −2328 LOC. Cumulative since Stage 1:
−2328 + 1525 + 546 = −257 LOC vs the pre-YAML-engine state, AND
all engine knowledge now lives in resources/profiles/dcat-us-v3.yaml
where it's editable without recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(profile): ship dcat-ap-v3 profile + 4 smoke tests (Stage 6)

DCAT-AP v3 (semiceu.github.io/DCAT-AP/releases/3.0.0/) is now an
embedded profile selectable via --profile dcat-ap-v3. The shape is a
DCAT-US v3 subset, with:

* JSON Schema validation disabled (DCAT-AP ships SHACL upstream; a
  SHACL backend is a future enhancement).
* No dcat-us:* extensions (bureauCode, programCode, accessLevel,
  purpose, liabilityStatement) — those are US-specific.
* New `eu_theme` vocabulary mapping CKAN group slugs to EU
  publications-office authority IRIs
  (http://publications.europa.eu/resource/authority/data-theme/...).
* dcat:accessURL required on Distribution per the v3 spec
  (Mandatory cardinality 1..*).
* dct:conformsTo points at the SEMIC v3 release URL.
* Smaller field_mappings (29 entries vs the 53 in dcat-us-v3) since
  many DCAT-US extensions don't apply.

The same minijinja templates and helpers power both profiles; the
only Rust-side change in this commit is the YAML profile + tests.

Smoke tests (tests/test_profile.rs):
* dcat_ap_v3_emits_no_dcat_us_extensions — verifies the projection
  carries zero dcat-us:* keys even with the full initial-context.
* dcat_ap_v3_distribution_carries_access_url — confirms the
  Distribution-mandatory dcat:accessURL is populated.
* dcat_ap_v3_conforms_to_targets_spec_url — confirms downstream
  consumers can detect the profile via dct:conformsTo.
* dcat_ap_v3_validation_is_disabled_noop — confirms --validate-dcat
  with this profile produces no JSON Schema warnings (the validator
  short-circuits when profile.validation.enabled == false).

Source: https://github.com/SEMICeu/DCAT-AP
Cardinality reference: https://semiceu.github.io/DCAT-AP/releases/3.0.0/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(profile): ship croissant 1.0 profile + 5 smoke tests (Stage 7)

Croissant ML metadata format (mlcommons.org/croissant) is now an
embedded profile selectable via --profile croissant. The output is
schema.org-rooted JSON-LD conforming to Croissant 1.0:

* @context inlines the canonical Croissant map: @language=en,
  @vocab=https://schema.org/, plus cr:/dct: prefix shorthands. Per
  the Croissant spec at
  https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec.md.
* @type=sc:Dataset; field paths use schema.org bare keys
  (name/description/url/license/creator/publisher/keywords/etc.)
  rather than dcat:/dct: prefixes.
* conformsTo target IRI: http://mlcommons.org/croissant/1.0.
* Distribution emitted under bare `distribution` (schema.org @vocab
  resolves it) with @type=sc:FileObject.
* Per-column cr:RecordSet/cr:Field expansion via the new
  build_croissant_fields helper — one Field per CSV column with
  schema.org dataType (sc:Text / sc:Integer / sc:Float / sc:Boolean
  / sc:Date / sc:DateTime).
* BLAKE3 hash via cr:fileFingerprint (qsv-native mmap+rayon, markedly
  faster than SHA-256 on multi-GB ML training data; Croissant has no
  SPDX-mandated algorithm so the choice is free).
* validation.enabled: false (Croissant uses a Python validator,
  mlcroissant, not JSON Schema).
* discovery_merge.enabled: false (Croissant doesn't live in
  CKAN-style data portals).

Engine extensions:
* DatasetBlock.context now accepts a `Value` (string or object) so
  the inline Croissant @context map round-trips verbatim. DCAT-US /
  DCAT-AP profiles still use a string URI — backwards-compatible.
* DistributionBlock.path lets profiles override the Distribution
  wrapper key. Croissant emits `distribution`; DCAT defaults remain
  `dcat:distribution`.
* New formula helper build_croissant_fields(stats) walks the per-
  column stats map and emits a flat cr:Field array with schema.org
  dataType IRIs.

Smoke tests (5 in tests/test_profile.rs):
* croissant_uses_schema_org_context_and_sc_dataset_type
* croissant_conforms_to_targets_mlcommons_spec
* croissant_emits_recordset_with_one_field_per_csv_column
* croissant_uses_bare_distribution_key_not_dcat_namespaced
* croissant_distribution_uses_file_object_type

Verification: cargo test cmd::profile:: → 127 unit, test_profile::
→ 40 integration tests pass (29 original + 2 parity + 4 DCAT-AP +
5 Croissant).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(profile): regenerate help + finalize handoff (Stage 8)

* docs/help/profile.md regenerated via --generate-help-md to surface
  the --profile flag added in Stage 1.
* profile3-handoff.md updated to reflect all 8 stages landed,
  full file map post-deletion, verification commands, captured
  design decisions, and queued follow-ups.
* src/cmd/profile.rs: drop the now-useless DcatWarning → ProjectionWarning
  conversion in the --validate-dcat code path (Stage 5 already
  refactored validate() to return ProjectionWarning directly).

Verification:
* python3 scripts/docs-drift-check.py → no drift detected.
* All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp).
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 40 integration tests pass.
* cargo clippy --bin qsv -F profile,feature_capable → no new findings
  in the YAML-engine code path.

This closes the YAML-driven projection engine migration. The shipped
binary always goes through projection::project(); the legacy
dcat.rs / catalog.rs / ckan_to_dcat.rs / curie.rs modules are
deleted. DCAT-US v3 / DCAT-AP v3 / Croissant projection knowledge
lives entirely in resources/profiles/*.yaml — editable without
recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(profile): address roborev #2490 findings (catalog/discovery/force/validate)

7 findings from the YAML-engine branch review at job 2490. Each fix
ships with a regression guard in tests/test_profile.rs.

Medium severity (6):

1. Catalog mode + discovery merge target (src/cmd/profile.rs:398).
   Discovery was merging into the Catalog envelope top-level instead
   of the nested Dataset. Fix: project Dataset always, apply
   discovery_merge::merge, THEN conditionally wrap in Catalog via the
   new projection::wrap_in_catalog_envelope helper. Guard:
   catalog_mode_merges_discovered_into_inner_dataset_not_envelope.

2. Catalog envelope missing @context (src/cmd/profile/projection.rs:296).
   The envelope carried CURIE keys (dct:title, dct:conformsTo,
   dcat:dataset) without a top-level @context, leaving it invalid as
   JSON-LD. Fix: wrap_as_catalog now copies profile.dataset.context
   into the envelope; inner Dataset keeps its own context for
   self-containment. Guard: catalog_envelope_carries_top_level_context.

3. dct:spatial emits string "null" when no bbox
   (resources/profiles/dcat-us-v3.yaml + dcat-ap-v3.yaml). bbox_from_dpps
   returning UNDEFINED rendered as `"null"` via `| tojson` because
   coerce_json_or_string left the literal alone. Fix: emit_when guard
   gates the field on WKT-or-bbox availability. Guard:
   spatial_field_suppressed_when_no_lat_lon_columns.

4. --dcat-legacy-license parsed but never wired
   (src/cmd/profile.rs:380). Flag was documented + collected into
   Args but never reached the YAML engine. Fix: thread the flag into
   projection_ctx as `legacy_license`, add a conditional Dataset-level
   dct:license field in dcat-us-v3.yaml gated on that variable.
   Guards: dcat_legacy_license_emits_dataset_level_license,
   dcat_legacy_license_off_keeps_license_distribution_only.

5. Forced package/resource values bypass profile shaping
   (src/cmd/profile/context.rs:388). collect_forced_paths was
   writing raw CKAN values to target pointers via
   apply_force_overrides, producing string-where-Agent-expected
   shapes (e.g. forced package.publisher → "Name" instead of
   {"@type":"foaf:Agent","foaf:name":"Name"}). Fix: CKAN-side
   forces now only contribute to `forced_paths` (discovery-merge
   protection); the value lives in merged package/resource via
   normalize_value_force and flows through the profile's templates
   for proper shaping. dataset_info forces still take the
   raw-write path (that's the documented escape hatch).
   Guard: forced_package_publisher_flows_through_profile_template.

6. validate() ignores profile.validation paths
   (src/cmd/profile/dcat_validate.rs:250). When validation.enabled
   was true, the function always used the embedded GSA bundle
   regardless of profile.validation.schema_dir. Fix: when the
   profile's schema_dir matches the embedded `resources/dcat-us-v3/`
   path (the only bundle qsv ships today), use the embedded
   validators; any other schema_dir produces a single
   Recommended-severity warning explaining that custom-bundle
   validation is a queued follow-up. The embedded DCAT-US v3
   profile's behavior is unchanged.

Low severity (1):

7. DiscoveryMerge::default() disabled merging
   (src/cmd/profile/profile_spec.rs:273). #[derive(Default)] gave
   `enabled: false`, contradicting the documented "fill-if-absent
   enabled by default" semantics — the `#[serde(default =
   "default_true")]` annotation only fires during deserialization.
   Fix: hand-rolled Default impl with enabled: true, the
   never_overwrite list (@context, @type, dcat:distribution), and
   fill-if-absent strategy.

Golden refresh:
* Catalog goldens (nyc-311, usda-soil, wprdc-311) pick up the new
  envelope @context entry — finding #2 fix.
* usda-soil dataset golden loses the spurious `"dct:spatial":
  "null"` entry — finding #3 fix.

Verification:
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 46 integration tests pass
  (40 prior + 6 new regression guards).
* All 4 binaries build clean (qsv, qsvmcp, qsvlite, qsvdp).
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(profile): drop auto-generated stats caches from golden dir

The previous commit accidentally committed three *.stats.csv files
(qsv stats cache, auto-regenerated on every profile run). They slipped
past .gitignore because the golden-directory *.csv whitelist also
matches the stats.csv suffix.

Fix: add a re-ignore rule for `tests/resources/profile/golden/*.stats.csv`
and the JSONL variant, then `git rm` the committed files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(profile): preserve CKAN-side force against spec formulas (roborev #2491)

Regression introduced by the #2490 fix #5: when CKAN-side `force: true`
values stopped being raw-written via apply_force_overrides, they
became vulnerable to overwrite by spec formulas. A formula targeting
`package.publisher` would replace the forced value in
merge_formula_results' pass-1 (before projection), violating the
documented "force beats inferred" guarantee.

Fix: track the CKAN-side forced field-name sets through the pipeline
so merge_formula_results can skip them.

* context.rs: collect_forced_paths now returns a 4-tuple including
  `forced_package_fields` and `forced_resource_fields`
  (HashSet<String> of CKAN-side field names marked force:true).
  load_initial_context returns the matching 6-tuple; AnalysisContext
  carries both sets.
* profile.rs: merge_formula_results takes the two sets and skips
  pass-1 inserts on matching field names. Suggestion-formula output
  (pass 2) lives in dpp_suggestions and is unaffected.

The forced value still flows through the profile templates for proper
shaping (so dct:publisher gets its foaf:Agent wrapper, etc.) — the
shaping fix from #2490 #5 is preserved.

Regression guard: forced_package_field_survives_formula_overwrite
(tests/test_profile.rs). Constructs a spec with a `title` formula
that would set "Formula Wins", combined with `package.title:
{value: "Forced Title", force: true}`. The output must carry
"Forced Title" — confirming force beats formula.

Verification:
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 47 integration tests
  pass (46 prior + 1 new regression guard).
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(profile): expand forced CKAN fields through alias mappings (roborev #2493)

Follow-up regression to #2491: the force-skip in merge_formula_results
only checked the exact CKAN field name. Aliases that project to the
same target pointer (e.g. `package.author` and `package.publisher`
both → `/dcat/dct:publisher`) bypassed the check — a formula writing
`publisher` could still overwrite a forced `author` value.

Fix: after the first pass collects forced (ckan_ptr, target_ptr)
pairs, walk profile.field_mappings and add every CKAN field whose
target appears in the forced target set to the forced_pkg /
forced_res field-name set. So forcing `package.author` now also locks
`package.publisher` (and any other alias keys for the same target).

Alias pairs covered by this fix in DCAT-US v3:
* author / publisher → dct:publisher
* landing_page / url → dcat:landingPage
* data_dictionary / describedBy → dcat:describedBy
* accrualPeriodicity / frequency / update_frequency → dct:accrualPeriodicity
* dcat-us:accessLevel / access_level → dcat-us:accessLevel
* accessRights / access_rights → dct:accessRights
* scopeNote / scope_note → skos:scopeNote
* liabilityStatement / liability_statement → dcat-us:liabilityStatement
* inSeries / in_series → dcat:inSeries
* versionNotes / version_notes → dcat:versionNotes
* license / license_id → distribution.dct:license
* modified / last_modified → distribution.dct:modified

Regression guards (tests/test_profile.rs):
* forced_author_locks_publisher_alias — forces package.author,
  formula targets `publisher`, asserts foaf:name is "Forced Author".
* forced_license_id_locks_license_alias — forces resource.license_id
  to cc-by, formula targets `license` with cc-by-sa, asserts the
  CC-BY 4.0 IRI (not CC-BY-SA) lands on Distribution.

Verification:
* cargo test cmd::profile:: → 127 unit tests pass.
* cargo test --test tests test_profile:: → 49 integration tests
  pass (47 prior + 2 new alias guards).
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address review: 9 Copilot suggestions on PR #3908

Apply all 9 unresolved inline review comments. Each was verified
against the current code before action.

1. docs/help/profile.md (truncated --initial-context help)
   Reformatted the USAGE block in src/cmd/profile.rs so the
   description survives markdown-table generation: flattened the
   nested bullet list into a single paragraph and added a pointer
   to dcat-init-context.README.md for the full example.

2. tests/resources/profile/dcat-init-context.README.md
   Updated the "How package / resource force flags route to DCAT"
   section to reference the active profile's `field_mappings:` table
   + `ProfileSpec::translate_ckan_ptr` instead of the deleted
   src/cmd/profile/ckan_to_dcat.rs module.

3. src/cmd/profile/profile_spec.rs (load-time validation claim)
   Moved `projection::dry_compile` inside `load()` so the doc claim
   on `EMBEDDED` is now accurate: every template parses through
   minijinja at profile-load time, surfacing typos before
   stats/frequency/formulas run. Dropped the redundant dry_compile
   call from profile.rs::run.

4. profile3-handoff.md (hardcoded absolute path)
   Removed the `/Users/joelnatividad/.claude/plans/...` reference
   to the original plan file; the handoff now describes the engine
   without pointing at a path that doesn't exist for other
   contributors.

5. resources/profiles/croissant.yaml (misplaced key)
   Removed the no-op `strippable_curie_prefixes: []` from the
   `discovery_merge:` block — that key lives under `validation:`
   per the schema; keeping it here was misleading.

6. src/cmd/profile.rs (dead `merge_discovered` + tests)
   Deleted the orphaned legacy `merge_discovered` function (the
   orchestrator now uses `discovery_merge::merge` exclusively) and
   the 9 in-file tests that exercised it. Coverage is preserved by
   the unit tests in src/cmd/profile/discovery_merge.rs and the
   new integration tests in tests/test_profile.rs (e.g.
   `catalog_mode_merges_discovered_into_inner_dataset_not_envelope`).
   Net −168 LOC.

7-8. src/cmd/profile.rs (stale `ckan_to_dcat` doc comments)
   Updated two doc comments (`apply_force_overrides` doc + the
   force-collection comment in `run()`) so future readers find
   `field_mappings:` + `ProfileSpec::translate_ckan_ptr` instead
   of being pointed at the deleted module.

9. resources/dcat-us-v3/README.md (wrong test path)
   The pin-guard test lives at tests/test_profile.rs::dcat_us_v3_bundle_pin_manifest_matches_files,
   not the non-existent tests/test_dcat_us_bundle_pin.rs. Updated
   both the prose reference and the `cargo test` invocation.

Verification:
* cargo build --bin qsv,qsvmcp,qsvlite,qsvdp — all 4 clean.
* cargo test cmd::profile:: → 117 unit tests pass (was 127; the
  10 deleted merge_discovered tests are obsolete).
* cargo test --test tests test_profile:: → 49 integration tests
  pass (unchanged).
* cargo +nightly fmt applied.
* docs/help/profile.md regenerated via --generate-help-md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address roborev #2495: extend dry_compile + restore IRI escape coverage

Two findings from the post-fix re-review of d78d34c.

Medium (src/cmd/profile/projection.rs:dry_compile):
  The previous load-time validation only checked emit_when guards on
  dataset fields, leaving distribution and catalog field guards
  vulnerable. A typo in a distribution emit_when would compile-pass
  load() but silently render-fail at projection time (render_truthy
  treats the error as false, dropping the field). Fix: extend
  dry_compile to syntax-check emit_when in both distribution and
  catalog field loops. New guards:
  * dry_compile_rejects_malformed_distribution_emit_when
  * dry_compile_rejects_malformed_catalog_emit_when

Low (src/cmd/profile/discovery_merge.rs):
  The removed merge_discovered tests carried regression coverage for
  forced discovered keys containing `/` or `~` (full-IRI JSON-LD
  properties like http://purl.org/dc/terms/title). Restore that
  coverage on discovery_merge's internal escape_token path. New
  tests:
  * forced_full_iri_key_blocks_matching_discovered_key — forced path
    with each `/` escaped to `~1` must block the matching discovered
    IRI key.
  * forced_full_iri_key_does_not_block_unrelated_discovered_key —
    escaping must not over-match; unrelated discovered keys (e.g.
    dct:identifier) still flow through.
  * escape_token_handles_rfc6901_round_trip — direct check of the
    `~`-before-`/` escape order on plain, slash, tilde, mixed, and
    full-IRI inputs.

Verification:
* cargo test cmd::profile:: → 122 unit tests pass (117 prior + 5 new).
* cargo test --test tests test_profile:: → 49 integration tests pass.
* cargo +nightly fmt applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants