feat(profile): upgrade croissant profile to Croissant 1.1#3916
Merged
Conversation
The bundled `croissant` profile now emits canonical Croissant 1.1 JSON-LD (https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec-1.1.md) instead of 1.0. Spec deltas applied to resources/profiles/croissant.yaml: * conformsTo IRI bumped from `.../croissant/1.0` → `.../croissant/1.1` * @context expanded to the canonical 1.1 prefix table (adds `sc:`, `rai:`, and the cr: shortcut terms `recordSet`, `field`, `fileObject`, `source`, `extract`, `column`, `dataType`, `citeAs`, `data`, `isLiveDataset`, `key`, `references`, `regex`, `subField`, `transform`, `fileSet`, `includes`) * Distribution `@type` switched from `sc:FileObject` to `cr:FileObject` (Croissant 1.1 defines its own FileObject class) * FileObject and RecordSet now carry stable `@id` slugs (`data.csv` and `main-table`) so cr:Field.source can reference the FileObject by IRI * File hash switched from the non-canonical `cr:fileFingerprint` + `cr:Checksum` nested shape to the canonical direct `sha256` property mlcroissant validates against (uses `sha256_of`; BLAKE3 is faster but not in the Croissant vocabulary) * cr:Field entries now carry the canonical `source.{fileObject:{@id}, extract:{column}}` block. The existing `build_croissant_fields` helper only emits a 1.0-style flat field list, so the recordSet template was switched to an inline Jinja loop over `stats | items` that emits the source block per column. `qsv:cardinality` / `qsv:nullcount` diagnostics preserved. * New 1.1-first-class `citeAs` field, populated from `pkg.citation` via a new `/package/citation → /projection/citeAs` field mapping * Catalog envelope `conforms_to` bumped to 1.1 Test updates: * `tests/test_profile.rs::croissant_distribution_uses_file_object_type` now asserts `cr:FileObject` (was `sc:FileObject`) * `src/cmd/profile/profile_spec.rs::embedded_croissant_parses_and_dry_compiles` bumps `field_mappings.len()` 16 → 17 for the new `citation` mapping Verified end-to-end on `nyc-311-subset.csv`: emitted JSON-LD has the correct conformsTo, full canonical @context, `cr:FileObject` with direct `sha256`, and per-column `cr:Field` entries whose `source` blocks point back to `@id: data.csv`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Adds a Changed entry under [Unreleased] documenting the spec-version bump, the canonical 1.1 @context / @type / @id changes, the new citeAs field, and the one user-visible behavioral delta (file-hash slot switched from BLAKE3-via-cr:fileFingerprint to canonical direct sha256). References PR #3916. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR upgrades qsv’s bundled croissant projection profile so qsv profile --profile croissant emits Croissant 1.1-shaped JSON-LD (vs 1.0), aligning output with the canonical spec/validator expectations.
Changes:
- Updates
resources/profiles/croissant.yamlfor Croissant 1.1 (expanded canonical@context,conformsTo1.1,cr:FileObject, directsha256, per-fieldsourceblocks, and newciteAsmapping). - Adjusts two assertions to match the new profile shape (FileObject type + field_mappings count).
- Adds an Unreleased changelog entry documenting the behavior change (including hash algorithm/shape).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| resources/profiles/croissant.yaml | Implements Croissant 1.1 canonical JSON-LD shape (context, types, hashing, field source wiring, citeAs). |
| tests/test_profile.rs | Updates Croissant distribution @type assertion to cr:FileObject. |
| src/cmd/profile/profile_spec.rs | Updates embedded Croissant profile expectation for new citation → citeAs mapping count. |
| CHANGELOG.md | Documents the Croissant 1.1 profile upgrade and the hash slot/algorithm change. |
Two fixes surfaced by running `mlcroissant validate --jsonld` against the rendered projection block: 1. `contentUrl` is mandatory on `cr:FileObject` per the 1.1 spec, but the previous template gated it behind `only_if_absolute_iri`, so local-file inputs (with no `res.url`) emitted no contentUrl at all — failing mlcroissant's mandatory-property check. The canonical 1.1 example uses a relative path (`"contentUrl": "data/data.csv"`), so a bare basename is spec-conformant. Template now falls back to `res.name | source_label | basename` when no `res.url` is set. 2. mlcroissant's canonical @context includes a longer cr: shortcut term table than what was emitted. mlcroissant logged a "JSON-LD @context is not standard" WARNING listing the missing keys. Added: `equivalentProperty`, `examples`, `fileProperty`, `format`, `jsonPath`, `md5`, `parentField`, `path`, `repeated`, `replace`, `samplingRate`, `separator`. Profile @context now matches the canonical 1.1 context one-for-one; the warning is gone. Verified end-to-end with `pip install mlcroissant` + a populated `--initial-context`: `qsv profile --profile croissant --validate` produces `projection_warnings: []` and a direct `mlcroissant validate --jsonld projection.jsonld` reports "Done." with zero errors and zero warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oborev #2565) Roborev #2565 (Medium): the local-file `contentUrl` fallback used `res.name` before `source_label | basename`, but `context::build` defaults `res.name` from `Path::file_stem()` (no extension), so `in.csv` projected `"contentUrl": "in"` instead of `"in.csv"`. A user-supplied `res.name` via `--initial-context` is also typically a human label, not a usable content URL. Removed `res.name` from the fallback chain; `source_label | basename` preserves the extension and works for both default and seeded cases. Verified: `in.csv` now projects `"contentUrl": "in.csv"`, and `mlcroissant validate` still accepts the output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uppress DevSkim md5 context-term hit
Copilot review #3315150422 (Medium): the per-column qsv stats type →
Croissant dataType mapping was inlined inside the recordSet template
while `vocabularies.croissant_datatype` already defined the same
table. Replaced the inline `({...})[qsv_type] | default("sc:Text")`
with `lookup("croissant_datatype", qsv_type) | default("sc:Text")`
so the mapping lives in exactly one place. Verified the rendered
dataType values are identical (sc:Text / sc:DateTime / sc:Integer /
...) and mlcroissant still validates clean.
Copilot review #3315150455 (Medium): added
`croissant_distribution_emits_canonical_sha256_not_legacy_fingerprint`
asserting (a) `distribution[0].sha256` is a 64-char lowercase-hex
SHA-256 digest, and (b) the pre-1.1 `cr:fileFingerprint` /
`cr:Checksum` shape never appears anywhere in the projection. Prevents
silent regression back to the 1.0 fingerprint layout.
Copilot review #3315150471 (Medium): added
`croissant_recordset_fields_wire_source_to_file_object` asserting
every emitted cr:Field carries a `source.fileObject.@id` pointing at
the FileObject's `@id` and a `source.extract.column` matching the
field's `name`. That's the canonical 1.1 wiring between schema
metadata and actual bytes; without this guard, the per-Field template
could silently drop the source block while other tests still pass on
field count / type / dataType alone.
github-advanced-security[bot] (DevSkim DS126858, 2 hits): suppressed
on the `md5: "cr:md5"` line — that's a JSON-LD @context shortcut term
declaring the bare `md5` key resolves to the `cr:md5` IRI, NOT actual
MD5 hash usage. qsv emits SHA-256 as the canonical Croissant 1.1
fingerprint; MD5 is never computed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heck (roborev #2568) Roborev #2568 (Low): the SHA-256 regression test claimed "lowercase hex" in its failure message but used `is_ascii_hexdigit()`, which also accepts uppercase A-F — so a regression to uppercase digests would pass silently. Tightened to `c.is_ascii_digit() || ('a'..='f'). contains(&c)` so the check now matches the documented contract. `sha256_of` already emits lowercase hex, so the assertion still passes against current output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upgrades the bundled
croissantprojection profile soqsv profile --profile croissantemits canonical Croissant 1.1 JSON-LD instead of 1.0. All deltas live inresources/profiles/croissant.yaml; two small test assertions were updated to track the new shape.Spec deltas applied to
resources/profiles/croissant.yamlconformsTobumped fromhttp://mlcommons.org/croissant/1.0→http://mlcommons.org/croissant/1.1.@contextexpanded to the canonical 1.1 prefix table — addssc:,rai:, and the cr: shortcut terms (recordSet,field,fileObject,source,extract,column,dataType,citeAs,data,isLiveDataset,key,references,regex,subField,transform,fileSet,includes) so emitted JSON uses the bare property names mlcroissant expects.@typeswitched fromsc:FileObjecttocr:FileObject(1.1 defines its own FileObject class).@idslugs added on the FileObject (data.csv) and RecordSet (main-table) socr:Field.sourcecan reference the FileObject by IRI.cr:fileFingerprint+cr:Checksumnested shape to the canonical directsha256property. Usessha256_of; BLAKE3 is faster but is not in the Croissant vocabulary, so spec-compliance wins here.cr:Fieldsource blocks: each field now carries the canonicalsource: {fileObject: {@id: data.csv}, extract: {column: <name>}}. The existingbuild_croissant_fieldshelper only emits a 1.0-style flat field list, so the recordSet template was switched to an inline Jinja loop overstats | itemsthat emits the per-field source block.qsv:cardinality/qsv:nullcountdiagnostics preserved.citeAs, populated frompkg.citationvia a new/package/citation → /projection/citeAsfield mapping.conforms_tobumped to 1.1.Test updates
tests/test_profile.rs::croissant_distribution_uses_file_object_typenow assertscr:FileObject(wassc:FileObject).src/cmd/profile/profile_spec.rs::embedded_croissant_parses_and_dry_compilesbumpsfield_mappings.len()16 → 17 for the newcitationmapping.Behavioral note (one minor user-visible change)
The hash algorithm in the emitted JSON-LD changes from BLAKE3 (under
cr:fileFingerprint/cr:Checksum) to SHA-256 (directsha256property). That's slower on multi-GB inputs but is what mlcroissant validates against. Users who relied on the BLAKE3 fingerprint shape will see it disappear from the projection.Test plan
cargo test -F all_features croissant— 6/6 pass (1 unit + 5 integration)nyc-311-subset.csvproduces canonical 1.1-shaped JSON-LD: correctconformsTo, full canonical@context,cr:FileObjectwith directsha256, per-columncr:Fieldentries whosesourceblocks point back to@id: data.csvmlcroissantlocally and runqsv profile --profile croissant --validate ...against a small dataset to confirm the canonical 1.1 validator accepts the output (CI environments withoutmlcroissantcontinue to fall through with the standard Recommended-severity warning)🤖 Generated with Claude Code