feat(profile): upgrade croissant profile to Croissant 1.1 by jqnatividad · Pull Request #3916 · dathere/qsv

jqnatividad · 2026-05-28T02:52:31Z

Summary

Upgrades the bundled croissant projection profile so qsv profile --profile croissant emits canonical Croissant 1.1 JSON-LD instead of 1.0. All deltas live in resources/profiles/croissant.yaml; two small test assertions were updated to track the new shape.

Spec deltas applied to `resources/profiles/croissant.yaml`

conformsTo bumped from http://mlcommons.org/croissant/1.0 → http://mlcommons.org/croissant/1.1.
@context expanded to the canonical 1.1 prefix table — adds sc:, rai:, and the cr: shortcut terms (recordSet, field, fileObject, source, extract, column, dataType, citeAs, data, isLiveDataset, key, references, regex, subField, transform, fileSet, includes) so emitted JSON uses the bare property names mlcroissant expects.
Distribution @type switched from sc:FileObject to cr:FileObject (1.1 defines its own FileObject class).
@id slugs added on the FileObject (data.csv) and RecordSet (main-table) so cr:Field.source can reference the FileObject by IRI.
File hash switched from the non-canonical cr:fileFingerprint + cr:Checksum nested shape to the canonical direct sha256 property. Uses sha256_of; BLAKE3 is faster but is not in the Croissant vocabulary, so spec-compliance wins here.
cr:Field source blocks: each field now carries the canonical source: {fileObject: {@id: data.csv}, extract: {column: <name>}}. The existing build_croissant_fields helper only emits a 1.0-style flat field list, so the recordSet template was switched to an inline Jinja loop over stats | items that emits the per-field source block. qsv:cardinality / qsv:nullcount diagnostics preserved.
New 1.1 field citeAs, populated from pkg.citation via a new /package/citation → /projection/citeAs field mapping.
Catalog envelope conforms_to bumped to 1.1.

Test updates

tests/test_profile.rs::croissant_distribution_uses_file_object_type now asserts cr:FileObject (was sc:FileObject).
src/cmd/profile/profile_spec.rs::embedded_croissant_parses_and_dry_compiles bumps field_mappings.len() 16 → 17 for the new citation mapping.

Behavioral note (one minor user-visible change)

The hash algorithm in the emitted JSON-LD changes from BLAKE3 (under cr:fileFingerprint / cr:Checksum) to SHA-256 (direct sha256 property). That's slower on multi-GB inputs but is what mlcroissant validates against. Users who relied on the BLAKE3 fingerprint shape will see it disappear from the projection.

Test plan

cargo test -F all_features croissant — 6/6 pass (1 unit + 5 integration)
End-to-end render on nyc-311-subset.csv produces canonical 1.1-shaped JSON-LD: correct conformsTo, full canonical @context, cr:FileObject with direct sha256, per-column cr:Field entries whose source blocks point back to @id: data.csv
Optional: install mlcroissant locally and run qsv profile --profile croissant --validate ... against a small dataset to confirm the canonical 1.1 validator accepts the output (CI environments without mlcroissant continue to fall through with the standard Recommended-severity warning)

🤖 Generated with Claude Code

@context

The bundled `croissant` profile now emits canonical Croissant 1.1 JSON-LD (https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec-1.1.md) instead of 1.0. Spec deltas applied to resources/profiles/croissant.yaml: * conformsTo IRI bumped from `.../croissant/1.0` → `.../croissant/1.1` * @context expanded to the canonical 1.1 prefix table (adds `sc:`, `rai:`, and the cr: shortcut terms `recordSet`, `field`, `fileObject`, `source`, `extract`, `column`, `dataType`, `citeAs`, `data`, `isLiveDataset`, `key`, `references`, `regex`, `subField`, `transform`, `fileSet`, `includes`) * Distribution `@type` switched from `sc:FileObject` to `cr:FileObject` (Croissant 1.1 defines its own FileObject class) * FileObject and RecordSet now carry stable `@id` slugs (`data.csv` and `main-table`) so cr:Field.source can reference the FileObject by IRI * File hash switched from the non-canonical `cr:fileFingerprint` + `cr:Checksum` nested shape to the canonical direct `sha256` property mlcroissant validates against (uses `sha256_of`; BLAKE3 is faster but not in the Croissant vocabulary) * cr:Field entries now carry the canonical `source.{fileObject:{@id}, extract:{column}}` block. The existing `build_croissant_fields` helper only emits a 1.0-style flat field list, so the recordSet template was switched to an inline Jinja loop over `stats | items` that emits the source block per column. `qsv:cardinality` / `qsv:nullcount` diagnostics preserved. * New 1.1-first-class `citeAs` field, populated from `pkg.citation` via a new `/package/citation → /projection/citeAs` field mapping * Catalog envelope `conforms_to` bumped to 1.1 Test updates: * `tests/test_profile.rs::croissant_distribution_uses_file_object_type` now asserts `cr:FileObject` (was `sc:FileObject`) * `src/cmd/profile/profile_spec.rs::embedded_croissant_parses_and_dry_compiles` bumps `field_mappings.len()` 16 → 17 for the new `citation` mapping Verified end-to-end on `nyc-311-subset.csv`: emitted JSON-LD has the correct conformsTo, full canonical @context, `cr:FileObject` with direct `sha256`, and per-column `cr:Field` entries whose `source` blocks point back to `@id: data.csv`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-28T02:53:17Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results

Complexity 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

@context

Adds a Changed entry under [Unreleased] documenting the spec-version bump, the canonical 1.1 @context / @type / @id changes, the new citeAs field, and the one user-visible behavioral delta (file-hash slot switched from BLAKE3-via-cr:fileFingerprint to canonical direct sha256). References PR #3916. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR upgrades qsv’s bundled croissant projection profile so qsv profile --profile croissant emits Croissant 1.1-shaped JSON-LD (vs 1.0), aligning output with the canonical spec/validator expectations.

Changes:

Updates resources/profiles/croissant.yaml for Croissant 1.1 (expanded canonical @context, conformsTo 1.1, cr:FileObject, direct sha256, per-field source blocks, and new citeAs mapping).
Adjusts two assertions to match the new profile shape (FileObject type + field_mappings count).
Adds an Unreleased changelog entry documenting the behavior change (including hash algorithm/shape).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
resources/profiles/croissant.yaml	Implements Croissant 1.1 canonical JSON-LD shape (context, types, hashing, field source wiring, citeAs).
tests/test_profile.rs	Updates Croissant distribution `@type` assertion to `cr:FileObject`.
src/cmd/profile/profile_spec.rs	Updates embedded Croissant profile expectation for new `citation → citeAs` mapping count.
CHANGELOG.md	Documents the Croissant 1.1 profile upgrade and the hash slot/algorithm change.

@context

Two fixes surfaced by running `mlcroissant validate --jsonld` against the rendered projection block: 1. `contentUrl` is mandatory on `cr:FileObject` per the 1.1 spec, but the previous template gated it behind `only_if_absolute_iri`, so local-file inputs (with no `res.url`) emitted no contentUrl at all — failing mlcroissant's mandatory-property check. The canonical 1.1 example uses a relative path (`"contentUrl": "data/data.csv"`), so a bare basename is spec-conformant. Template now falls back to `res.name | source_label | basename` when no `res.url` is set. 2. mlcroissant's canonical @context includes a longer cr: shortcut term table than what was emitted. mlcroissant logged a "JSON-LD @context is not standard" WARNING listing the missing keys. Added: `equivalentProperty`, `examples`, `fileProperty`, `format`, `jsonPath`, `md5`, `parentField`, `path`, `repeated`, `replace`, `samplingRate`, `separator`. Profile @context now matches the canonical 1.1 context one-for-one; the warning is gone. Verified end-to-end with `pip install mlcroissant` + a populated `--initial-context`: `qsv profile --profile croissant --validate` produces `projection_warnings: []` and a direct `mlcroissant validate --jsonld projection.jsonld` reports "Done." with zero errors and zero warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oborev #2565) Roborev #2565 (Medium): the local-file `contentUrl` fallback used `res.name` before `source_label | basename`, but `context::build` defaults `res.name` from `Path::file_stem()` (no extension), so `in.csv` projected `"contentUrl": "in"` instead of `"in.csv"`. A user-supplied `res.name` via `--initial-context` is also typically a human label, not a usable content URL. Removed `res.name` from the fallback chain; `source_label | basename` preserves the extension and works for both default and seeded cases. Verified: `in.csv` now projects `"contentUrl": "in.csv"`, and `mlcroissant validate` still accepts the output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@context

…uppress DevSkim md5 context-term hit Copilot review #3315150422 (Medium): the per-column qsv stats type → Croissant dataType mapping was inlined inside the recordSet template while `vocabularies.croissant_datatype` already defined the same table. Replaced the inline `({...})[qsv_type] | default("sc:Text")` with `lookup("croissant_datatype", qsv_type) | default("sc:Text")` so the mapping lives in exactly one place. Verified the rendered dataType values are identical (sc:Text / sc:DateTime / sc:Integer / ...) and mlcroissant still validates clean. Copilot review #3315150455 (Medium): added `croissant_distribution_emits_canonical_sha256_not_legacy_fingerprint` asserting (a) `distribution[0].sha256` is a 64-char lowercase-hex SHA-256 digest, and (b) the pre-1.1 `cr:fileFingerprint` / `cr:Checksum` shape never appears anywhere in the projection. Prevents silent regression back to the 1.0 fingerprint layout. Copilot review #3315150471 (Medium): added `croissant_recordset_fields_wire_source_to_file_object` asserting every emitted cr:Field carries a `source.fileObject.@id` pointing at the FileObject's `@id` and a `source.extract.column` matching the field's `name`. That's the canonical 1.1 wiring between schema metadata and actual bytes; without this guard, the per-Field template could silently drop the source block while other tests still pass on field count / type / dataType alone. github-advanced-security[bot] (DevSkim DS126858, 2 hits): suppressed on the `md5: "cr:md5"` line — that's a JSON-LD @context shortcut term declaring the bare `md5` key resolves to the `cr:md5` IRI, NOT actual MD5 hash usage. qsv emits SHA-256 as the canonical Croissant 1.1 fingerprint; MD5 is never computed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…heck (roborev #2568) Roborev #2568 (Low): the SHA-256 regression test claimed "lowercase hex" in its failure message but used `is_ascii_hexdigit()`, which also accepts uppercase A-F — so a regression to uppercase digests would pass silently. Tightened to `c.is_ascii_digit() || ('a'..='f'). contains(&c)` so the check now matches the documented contract. `sha256_of` already emits lowercase hex, so the assertion still passes against current output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 28, 2026

View reviewed changes

Comment thread resources/profiles/croissant.yaml Dismissed

jqnatividad requested a review from Copilot May 28, 2026 03:07

Copilot started reviewing on behalf of jqnatividad May 28, 2026 03:07 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread resources/profiles/croissant.yaml Outdated

Comment thread resources/profiles/croissant.yaml

Comment thread resources/profiles/croissant.yaml

github-advanced-security AI found potential problems May 28, 2026

View reviewed changes

Comment thread resources/profiles/croissant.yaml Fixed

Comment thread resources/profiles/croissant.yaml Fixed

jqnatividad and others added 4 commits May 27, 2026 23:18

chore: rust fmt

19a0ab4

jqnatividad merged commit 11848cd into master May 28, 2026
16 of 17 checks passed

jqnatividad deleted the feat-croissant-1.1 branch May 28, 2026 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profile): upgrade croissant profile to Croissant 1.1#3916

feat(profile): upgrade croissant profile to Croissant 1.1#3916
jqnatividad merged 7 commits into
masterfrom
feat-croissant-1.1

jqnatividad commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

codacy-production Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jqnatividad commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Spec deltas applied to resources/profiles/croissant.yaml

Test updates

Behavioral note (one minor user-visible change)

Test plan

Uh oh!

Uh oh!

codacy-production Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jqnatividad commented May 28, 2026 •

edited

Loading

Spec deltas applied to `resources/profiles/croissant.yaml`

codacy-production Bot commented May 28, 2026 •

edited

Loading