Skip to content

feat(profile): emit Croissant descriptive statistics + frequency#3918

Merged
jqnatividad merged 6 commits into
masterfrom
feat-croissant-descriptive-stats
May 28, 2026
Merged

feat(profile): emit Croissant descriptive statistics + frequency#3918
jqnatividad merged 6 commits into
masterfrom
feat-croissant-descriptive-stats

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

What

Stores qsv's per-column statistics and value-frequency distributions in the Croissant 1.1 projection of qsv profile, following the spec's Representing Descriptive Statistics application.

Statistics (via the spec's annotation mechanism)

  • RecordSet-level row count, typed with Wikidata's Cardinality term (Q4049983), per the canonical person/count example.
  • Per-Field annotation arrays of summary stats — min, max, mean, median, stddev, variance, range, sum, q1, q3, mode — each typed with its DDI-CDI SummaryStatisticType term (ddi-stats:*, verified against the live 2.1.2 vocabulary) as an sc:DefinedTerm. min/max also carry the schema.org equivalentProperty (sc:minValue / sc:maxValue) for numeric columns. Annotations are emitted only when qsv actually produced the stat (e.g. quartiles are null for text columns).
{
  "@id": "main-table/population/mean",
  "value": 5483333.3333,
  "dataType": {
    "@type": "sc:DefinedTerm",
    "termCode": "ArithmeticMean",
    "name": "Arithmetic mean",
    "@id": "ddi-stats:7975ed0",
    "inDefinedTermSet": "http://rdf-vocabulary.ddialliance.org/cv/SummaryStatisticType/2.1.2/"
  }
}

Frequency (opt-in via --croissant-frequency)

Croissant 1.1 has no canonical scalar slot for value distributions, so each column's top-N counts are emitted as a dedicated inline cr:RecordSet (<col>-frequency) of {value, count, percentage} rows — a first-class, queryable shape. Off by default to keep the projection compact; the raw counts always remain in the top-level frequency output block regardless of the flag.

Full extended stat set on a fresh run

The profile pipeline previously used the shared StatsMode::Schema (no median/quartiles/mode), so those only appeared when a richer --everything stats cache happened to exist. Added StatsMode::ProfileSchema (Schema + --quartiles --mode) and pointed the profile context builder at it. The schema command's lean Schema mode is untouched.

Other

  • @context gains the ddi-stats prefix and an annotation term aliased to cr:annotation.
  • Threads per-column frequency (dppf) into the projection context.
  • Regenerated docs/help/profile.md.

Testing

  • Updated croissant_emits_recordset_with_one_field_per_csv_column to assert the count annotation, per-field stat annotations, and frequency RecordSets (with --croissant-frequency).
  • New croissant_frequency_off_by_default_extended_stats_on_fresh_run: verifies frequency is opt-in and median/quartiles/mode surface on a fresh run.
  • cargo clippy -F all_features clean; cargo +nightly fmt --check clean.
  • Tests pass: profile 62, schema 58, pivotp 62.

🤖 Generated with Claude Code

Store qsv's per-column statistics and value-frequency distributions in
the Croissant 1.1 projection, following the spec's "Representing
Descriptive Statistics" application
(https://docs.mlcommons.org/croissant/docs/croissant-spec-1.1.html#application-representing-descriptive-statistics).

Statistics use the spec's `annotation` mechanism:
- RecordSet-level row count, typed with Wikidata's Cardinality term
  (Q4049983), per the canonical `person/count` example.
- Per-Field `annotation` arrays of summary stats (min, max, mean,
  median, stddev, variance, range, sum, q1, q3, mode), each typed with
  its DDI-CDI SummaryStatisticType term (ddi-stats:* — verified against
  the live 2.1.2 vocabulary) as an sc:DefinedTerm. min/max carry the
  schema.org equivalentProperty (sc:minValue / sc:maxValue) for numeric
  columns. Annotations are emitted only when qsv produced the stat.

Croissant 1.1 has no canonical scalar slot for value distributions, so
each column's top-N counts are emitted as a dedicated inline
cr:RecordSet (`<col>-frequency`) of {value, count, percentage} rows — a
first-class, queryable shape. This is opt-in via the new
--croissant-frequency flag (off by default to keep the projection
compact); the raw counts always remain in the top-level `frequency`
output block.

To surface the full extended stat set (median/quartiles/mode) on a
fresh run without a pre-built `--everything` stats cache, add a new
StatsMode::ProfileSchema (Schema + --quartiles --mode) and switch the
profile context builder to it. The shared Schema mode used by `schema`
is untouched.

@context gains the `ddi-stats` prefix and an `annotation` term aliased
to cr:annotation. Threads per-column frequency (dppf) into the
projection context.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread resources/profiles/croissant.yaml Fixed
Comment thread resources/profiles/croissant.yaml Fixed
Comment thread tests/test_profile.rs Fixed
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 28, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 21 complexity

Metric Results
Complexity 21

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Croissant 1.1 descriptive-statistics output to qsv profile: per-RecordSet row count annotation, per-Field stat annotations (min/max/mean/median/stddev/variance/range/sum/q1/q3/mode) typed with DDI-CDI terms, and optional per-column value-frequency RecordSets behind a new --croissant-frequency flag. Also introduces a StatsMode::ProfileSchema variant so profile runs stats with --quartiles --mode and surfaces the extended stat set without a pre-built --everything cache.

Changes:

  • New StatsMode::ProfileSchema (Schema + quartiles + mode) wired into profile's context builder.
  • New --croissant-frequency flag threads dppf into the projection and emits inline <col>-frequency RecordSets; Croissant YAML template extended with annotation arrays and DDI-CDI/Wikidata typing.
  • Test coverage updated/added for the new annotations and the opt-in frequency behaviour; help doc regenerated.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/util.rs Adds ProfileSchema enum variant and its stats command construction.
src/cmd/profile/context.rs Switches profile's schema-stats fetch to ProfileSchema.
src/cmd/profile.rs Adds --croissant-frequency flag and threads frequency data into the projection context.
resources/profiles/croissant.yaml Extends @context and recordset template with stat annotations and per-column frequency RecordSets.
docs/help/profile.md Regenerated help to document the new flag.
tests/test_profile.rs Updates existing assertion and adds a new test for opt-in behaviour and extended stats.

Comment thread src/util.rs
jqnatividad and others added 3 commits May 28, 2026 16:08
Copilot: StatsMode::ProfileSchema only took effect when the stats cache
was regenerated. The cache reuse path loaded an existing
stats.csv.data.jsonl whenever it was newer than the input, regardless of
the producing mode — so `qsv schema` (lean, no --quartiles/--mode) then
`qsv profile` reused the lean cache and silently dropped the extended-
stat annotations. Add a content-based `stats_satisfy_mode` check: for
ProfileSchema, reuse only when the loaded stats actually carry mode (any
non-None) and quartiles (q2_median present when numeric columns exist),
else discard and regenerate. New regression test
`croissant_extended_stats_survive_lean_stats_cache_reuse` covers the
schema-then-profile path.

DevSkim DS137138 (HTTP-without-TLS): suppress on the DDI-CDI and Wikidata
vocabulary IRIs — these are http-scheme RDF/JSON-LD term identifiers
(per the Croissant spec example), not fetched endpoints; https would
break term identity. YAML template uses Jinja {# #} comments so the
markers produce no output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…olumns

Low finding (src/util.rs): stats_satisfy_mode tested `mode.is_some()`,
but `stats --mode` emits an empty `mode` (None) for all-unique columns
even though --mode ran. That made a freshly-produced ProfileSchema cache
look lean for all-unique datasets, forcing stats regeneration on every
`profile` run. Test the `mode_count` metadata field instead — it is
Some(0) for all-unique columns when --mode ran, and None in a lean
cache. Verified the cache is now reused (mtime unchanged) across repeat
profile runs on an all-unique dataset; the lean-cache regression test
still regenerates as expected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Low finding: the mode_count cache-sufficiency fix lacked regression
coverage; the existing Croissant cache test uses a non-all-unique
fixture, so reverting the predicate to mode.is_some() would still pass.
Add croissant_all_unique_dataset_reuses_profileschema_cache: profiles an
all-unique dataset twice and asserts (a) the stats cache is reused
(mtime unchanged across runs) and (b) extended Croissant annotations
(median/quartiles) remain present. Verified the test FAILS when the
predicate is reverted to mode.is_some().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tests/test_profile.rs Fixed
Comment thread tests/test_profile.rs Fixed
Comment thread tests/test_profile.rs Fixed
jqnatividad and others added 2 commits May 28, 2026 16:47
…2583)

Low finding: cache reuse requires cache_mtime > input_mtime (strict), so
on coarse (1s) filesystem timestamp resolution the input and the first-
run stats cache could land in the same tick — forcing a regeneration on
run 2 and failing the assertion even with the correct mode_count
predicate. Sleep after creating in.csv so the input is strictly older
than the cache the first run generates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 0563f83 into master May 28, 2026
15 of 17 checks passed
@jqnatividad jqnatividad deleted the feat-croissant-descriptive-stats branch May 28, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants