Skip to content

feat(profile): upgrade croissant profile to Croissant 1.1#3916

Merged
jqnatividad merged 7 commits into
masterfrom
feat-croissant-1.1
May 28, 2026
Merged

feat(profile): upgrade croissant profile to Croissant 1.1#3916
jqnatividad merged 7 commits into
masterfrom
feat-croissant-1.1

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

@jqnatividad jqnatividad commented May 28, 2026

Summary

Upgrades the bundled croissant projection profile so qsv profile --profile croissant emits canonical Croissant 1.1 JSON-LD instead of 1.0. All deltas live in resources/profiles/croissant.yaml; two small test assertions were updated to track the new shape.

Spec deltas applied to resources/profiles/croissant.yaml

  • conformsTo bumped from http://mlcommons.org/croissant/1.0http://mlcommons.org/croissant/1.1.
  • @context expanded to the canonical 1.1 prefix table — adds sc:, rai:, and the cr: shortcut terms (recordSet, field, fileObject, source, extract, column, dataType, citeAs, data, isLiveDataset, key, references, regex, subField, transform, fileSet, includes) so emitted JSON uses the bare property names mlcroissant expects.
  • Distribution @type switched from sc:FileObject to cr:FileObject (1.1 defines its own FileObject class).
  • @id slugs added on the FileObject (data.csv) and RecordSet (main-table) so cr:Field.source can reference the FileObject by IRI.
  • File hash switched from the non-canonical cr:fileFingerprint + cr:Checksum nested shape to the canonical direct sha256 property. Uses sha256_of; BLAKE3 is faster but is not in the Croissant vocabulary, so spec-compliance wins here.
  • cr:Field source blocks: each field now carries the canonical source: {fileObject: {@id: data.csv}, extract: {column: <name>}}. The existing build_croissant_fields helper only emits a 1.0-style flat field list, so the recordSet template was switched to an inline Jinja loop over stats | items that emits the per-field source block. qsv:cardinality / qsv:nullcount diagnostics preserved.
  • New 1.1 field citeAs, populated from pkg.citation via a new /package/citation → /projection/citeAs field mapping.
  • Catalog envelope conforms_to bumped to 1.1.

Test updates

  • tests/test_profile.rs::croissant_distribution_uses_file_object_type now asserts cr:FileObject (was sc:FileObject).
  • src/cmd/profile/profile_spec.rs::embedded_croissant_parses_and_dry_compiles bumps field_mappings.len() 16 → 17 for the new citation mapping.

Behavioral note (one minor user-visible change)

The hash algorithm in the emitted JSON-LD changes from BLAKE3 (under cr:fileFingerprint / cr:Checksum) to SHA-256 (direct sha256 property). That's slower on multi-GB inputs but is what mlcroissant validates against. Users who relied on the BLAKE3 fingerprint shape will see it disappear from the projection.

Test plan

  • cargo test -F all_features croissant — 6/6 pass (1 unit + 5 integration)
  • End-to-end render on nyc-311-subset.csv produces canonical 1.1-shaped JSON-LD: correct conformsTo, full canonical @context, cr:FileObject with direct sha256, per-column cr:Field entries whose source blocks point back to @id: data.csv
  • Optional: install mlcroissant locally and run qsv profile --profile croissant --validate ... against a small dataset to confirm the canonical 1.1 validator accepts the output (CI environments without mlcroissant continue to fall through with the standard Recommended-severity warning)

🤖 Generated with Claude Code

The bundled `croissant` profile now emits canonical Croissant 1.1
JSON-LD (https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec-1.1.md)
instead of 1.0.

Spec deltas applied to resources/profiles/croissant.yaml:

* conformsTo IRI bumped from `.../croissant/1.0` → `.../croissant/1.1`
* @context expanded to the canonical 1.1 prefix table (adds `sc:`,
  `rai:`, and the cr: shortcut terms `recordSet`, `field`,
  `fileObject`, `source`, `extract`, `column`, `dataType`, `citeAs`,
  `data`, `isLiveDataset`, `key`, `references`, `regex`, `subField`,
  `transform`, `fileSet`, `includes`)
* Distribution `@type` switched from `sc:FileObject` to `cr:FileObject`
  (Croissant 1.1 defines its own FileObject class)
* FileObject and RecordSet now carry stable `@id` slugs (`data.csv` and
  `main-table`) so cr:Field.source can reference the FileObject by IRI
* File hash switched from the non-canonical
  `cr:fileFingerprint` + `cr:Checksum` nested shape to the canonical
  direct `sha256` property mlcroissant validates against (uses
  `sha256_of`; BLAKE3 is faster but not in the Croissant vocabulary)
* cr:Field entries now carry the canonical
  `source.{fileObject:{@id}, extract:{column}}` block. The existing
  `build_croissant_fields` helper only emits a 1.0-style flat field
  list, so the recordSet template was switched to an inline Jinja
  loop over `stats | items` that emits the source block per column.
  `qsv:cardinality` / `qsv:nullcount` diagnostics preserved.
* New 1.1-first-class `citeAs` field, populated from `pkg.citation`
  via a new `/package/citation → /projection/citeAs` field mapping
* Catalog envelope `conforms_to` bumped to 1.1

Test updates:

* `tests/test_profile.rs::croissant_distribution_uses_file_object_type`
  now asserts `cr:FileObject` (was `sc:FileObject`)
* `src/cmd/profile/profile_spec.rs::embedded_croissant_parses_and_dry_compiles`
  bumps `field_mappings.len()` 16 → 17 for the new `citation` mapping

Verified end-to-end on `nyc-311-subset.csv`: emitted JSON-LD has the
correct conformsTo, full canonical @context, `cr:FileObject` with
direct `sha256`, and per-column `cr:Field` entries whose `source`
blocks point back to `@id: data.csv`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread resources/profiles/croissant.yaml Dismissed
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 28, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Adds a Changed entry under [Unreleased] documenting the spec-version
bump, the canonical 1.1 @context / @type / @id changes, the new
citeAs field, and the one user-visible behavioral delta (file-hash
slot switched from BLAKE3-via-cr:fileFingerprint to canonical
direct sha256). References PR #3916.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades qsv’s bundled croissant projection profile so qsv profile --profile croissant emits Croissant 1.1-shaped JSON-LD (vs 1.0), aligning output with the canonical spec/validator expectations.

Changes:

  • Updates resources/profiles/croissant.yaml for Croissant 1.1 (expanded canonical @context, conformsTo 1.1, cr:FileObject, direct sha256, per-field source blocks, and new citeAs mapping).
  • Adjusts two assertions to match the new profile shape (FileObject type + field_mappings count).
  • Adds an Unreleased changelog entry documenting the behavior change (including hash algorithm/shape).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
resources/profiles/croissant.yaml Implements Croissant 1.1 canonical JSON-LD shape (context, types, hashing, field source wiring, citeAs).
tests/test_profile.rs Updates Croissant distribution @type assertion to cr:FileObject.
src/cmd/profile/profile_spec.rs Updates embedded Croissant profile expectation for new citation → citeAs mapping count.
CHANGELOG.md Documents the Croissant 1.1 profile upgrade and the hash slot/algorithm change.

Comment thread resources/profiles/croissant.yaml Outdated
Comment thread resources/profiles/croissant.yaml
Comment thread resources/profiles/croissant.yaml
Two fixes surfaced by running `mlcroissant validate --jsonld` against
the rendered projection block:

1. `contentUrl` is mandatory on `cr:FileObject` per the 1.1 spec, but
   the previous template gated it behind `only_if_absolute_iri`, so
   local-file inputs (with no `res.url`) emitted no contentUrl at all
   — failing mlcroissant's mandatory-property check. The canonical
   1.1 example uses a relative path (`"contentUrl": "data/data.csv"`),
   so a bare basename is spec-conformant. Template now falls back to
   `res.name | source_label | basename` when no `res.url` is set.

2. mlcroissant's canonical @context includes a longer cr: shortcut
   term table than what was emitted. mlcroissant logged a "JSON-LD
   @context is not standard" WARNING listing the missing keys.
   Added: `equivalentProperty`, `examples`, `fileProperty`, `format`,
   `jsonPath`, `md5`, `parentField`, `path`, `repeated`, `replace`,
   `samplingRate`, `separator`. Profile @context now matches the
   canonical 1.1 context one-for-one; the warning is gone.

Verified end-to-end with `pip install mlcroissant` + a populated
`--initial-context`: `qsv profile --profile croissant --validate`
produces `projection_warnings: []` and a direct
`mlcroissant validate --jsonld projection.jsonld` reports "Done."
with zero errors and zero warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread resources/profiles/croissant.yaml Fixed
Comment thread resources/profiles/croissant.yaml Fixed
jqnatividad and others added 4 commits May 27, 2026 23:18
…oborev #2565)

Roborev #2565 (Medium): the local-file `contentUrl` fallback used
`res.name` before `source_label | basename`, but `context::build`
defaults `res.name` from `Path::file_stem()` (no extension), so
`in.csv` projected `"contentUrl": "in"` instead of `"in.csv"`. A
user-supplied `res.name` via `--initial-context` is also typically a
human label, not a usable content URL.

Removed `res.name` from the fallback chain; `source_label | basename`
preserves the extension and works for both default and seeded cases.
Verified: `in.csv` now projects `"contentUrl": "in.csv"`, and
`mlcroissant validate` still accepts the output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uppress DevSkim md5 context-term hit

Copilot review #3315150422 (Medium): the per-column qsv stats type →
Croissant dataType mapping was inlined inside the recordSet template
while `vocabularies.croissant_datatype` already defined the same
table. Replaced the inline `({...})[qsv_type] | default("sc:Text")`
with `lookup("croissant_datatype", qsv_type) | default("sc:Text")`
so the mapping lives in exactly one place. Verified the rendered
dataType values are identical (sc:Text / sc:DateTime / sc:Integer /
...) and mlcroissant still validates clean.

Copilot review #3315150455 (Medium): added
`croissant_distribution_emits_canonical_sha256_not_legacy_fingerprint`
asserting (a) `distribution[0].sha256` is a 64-char lowercase-hex
SHA-256 digest, and (b) the pre-1.1 `cr:fileFingerprint` /
`cr:Checksum` shape never appears anywhere in the projection. Prevents
silent regression back to the 1.0 fingerprint layout.

Copilot review #3315150471 (Medium): added
`croissant_recordset_fields_wire_source_to_file_object` asserting
every emitted cr:Field carries a `source.fileObject.@id` pointing at
the FileObject's `@id` and a `source.extract.column` matching the
field's `name`. That's the canonical 1.1 wiring between schema
metadata and actual bytes; without this guard, the per-Field template
could silently drop the source block while other tests still pass on
field count / type / dataType alone.

github-advanced-security[bot] (DevSkim DS126858, 2 hits): suppressed
on the `md5: "cr:md5"` line — that's a JSON-LD @context shortcut term
declaring the bare `md5` key resolves to the `cr:md5` IRI, NOT actual
MD5 hash usage. qsv emits SHA-256 as the canonical Croissant 1.1
fingerprint; MD5 is never computed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heck (roborev #2568)

Roborev #2568 (Low): the SHA-256 regression test claimed "lowercase
hex" in its failure message but used `is_ascii_hexdigit()`, which
also accepts uppercase A-F — so a regression to uppercase digests
would pass silently. Tightened to `c.is_ascii_digit() || ('a'..='f').
contains(&c)` so the check now matches the documented contract.
`sha256_of` already emits lowercase hex, so the assertion still
passes against current output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 11848cd into master May 28, 2026
16 of 17 checks passed
@jqnatividad jqnatividad deleted the feat-croissant-1.1 branch May 28, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants