feat(profile): emit Croissant descriptive statistics + frequency#3918
Conversation
Store qsv's per-column statistics and value-frequency distributions in the Croissant 1.1 projection, following the spec's "Representing Descriptive Statistics" application (https://docs.mlcommons.org/croissant/docs/croissant-spec-1.1.html#application-representing-descriptive-statistics). Statistics use the spec's `annotation` mechanism: - RecordSet-level row count, typed with Wikidata's Cardinality term (Q4049983), per the canonical `person/count` example. - Per-Field `annotation` arrays of summary stats (min, max, mean, median, stddev, variance, range, sum, q1, q3, mode), each typed with its DDI-CDI SummaryStatisticType term (ddi-stats:* — verified against the live 2.1.2 vocabulary) as an sc:DefinedTerm. min/max carry the schema.org equivalentProperty (sc:minValue / sc:maxValue) for numeric columns. Annotations are emitted only when qsv produced the stat. Croissant 1.1 has no canonical scalar slot for value distributions, so each column's top-N counts are emitted as a dedicated inline cr:RecordSet (`<col>-frequency`) of {value, count, percentage} rows — a first-class, queryable shape. This is opt-in via the new --croissant-frequency flag (off by default to keep the projection compact); the raw counts always remain in the top-level `frequency` output block. To surface the full extended stat set (median/quartiles/mode) on a fresh run without a pre-built `--everything` stats cache, add a new StatsMode::ProfileSchema (Schema + --quartiles --mode) and switch the profile context builder to it. The shared Schema mode used by `schema` is untouched. @context gains the `ddi-stats` prefix and an `annotation` term aliased to cr:annotation. Threads per-column frequency (dppf) into the projection context. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 21 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull request overview
Adds Croissant 1.1 descriptive-statistics output to qsv profile: per-RecordSet row count annotation, per-Field stat annotations (min/max/mean/median/stddev/variance/range/sum/q1/q3/mode) typed with DDI-CDI terms, and optional per-column value-frequency RecordSets behind a new --croissant-frequency flag. Also introduces a StatsMode::ProfileSchema variant so profile runs stats with --quartiles --mode and surfaces the extended stat set without a pre-built --everything cache.
Changes:
- New
StatsMode::ProfileSchema(Schema + quartiles + mode) wired intoprofile's context builder. - New
--croissant-frequencyflag threadsdppfinto the projection and emits inline<col>-frequencyRecordSets; Croissant YAML template extended withannotationarrays and DDI-CDI/Wikidata typing. - Test coverage updated/added for the new annotations and the opt-in frequency behaviour; help doc regenerated.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/util.rs | Adds ProfileSchema enum variant and its stats command construction. |
| src/cmd/profile/context.rs | Switches profile's schema-stats fetch to ProfileSchema. |
| src/cmd/profile.rs | Adds --croissant-frequency flag and threads frequency data into the projection context. |
| resources/profiles/croissant.yaml | Extends @context and recordset template with stat annotations and per-column frequency RecordSets. |
| docs/help/profile.md | Regenerated help to document the new flag. |
| tests/test_profile.rs | Updates existing assertion and adds a new test for opt-in behaviour and extended stats. |
Copilot: StatsMode::ProfileSchema only took effect when the stats cache
was regenerated. The cache reuse path loaded an existing
stats.csv.data.jsonl whenever it was newer than the input, regardless of
the producing mode — so `qsv schema` (lean, no --quartiles/--mode) then
`qsv profile` reused the lean cache and silently dropped the extended-
stat annotations. Add a content-based `stats_satisfy_mode` check: for
ProfileSchema, reuse only when the loaded stats actually carry mode (any
non-None) and quartiles (q2_median present when numeric columns exist),
else discard and regenerate. New regression test
`croissant_extended_stats_survive_lean_stats_cache_reuse` covers the
schema-then-profile path.
DevSkim DS137138 (HTTP-without-TLS): suppress on the DDI-CDI and Wikidata
vocabulary IRIs — these are http-scheme RDF/JSON-LD term identifiers
(per the Croissant spec example), not fetched endpoints; https would
break term identity. YAML template uses Jinja {# #} comments so the
markers produce no output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…olumns Low finding (src/util.rs): stats_satisfy_mode tested `mode.is_some()`, but `stats --mode` emits an empty `mode` (None) for all-unique columns even though --mode ran. That made a freshly-produced ProfileSchema cache look lean for all-unique datasets, forcing stats regeneration on every `profile` run. Test the `mode_count` metadata field instead — it is Some(0) for all-unique columns when --mode ran, and None in a lean cache. Verified the cache is now reused (mtime unchanged) across repeat profile runs on an all-unique dataset; the lean-cache regression test still regenerates as expected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Low finding: the mode_count cache-sufficiency fix lacked regression coverage; the existing Croissant cache test uses a non-all-unique fixture, so reverting the predicate to mode.is_some() would still pass. Add croissant_all_unique_dataset_reuses_profileschema_cache: profiles an all-unique dataset twice and asserts (a) the stats cache is reused (mtime unchanged across runs) and (b) extended Croissant annotations (median/quartiles) remain present. Verified the test FAILS when the predicate is reverted to mode.is_some(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2583) Low finding: cache reuse requires cache_mtime > input_mtime (strict), so on coarse (1s) filesystem timestamp resolution the input and the first- run stats cache could land in the same tick — forcing a regeneration on run 2 and failing the assertion even with the correct mode_count predicate. Sleep after creating in.csv so the input is strictly older than the cache the first run generates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Stores qsv's per-column statistics and value-frequency distributions in the Croissant 1.1 projection of
qsv profile, following the spec's Representing Descriptive Statistics application.Statistics (via the spec's
annotationmechanism)Q4049983), per the canonicalperson/countexample.annotationarrays of summary stats — min, max, mean, median, stddev, variance, range, sum, q1, q3, mode — each typed with its DDI-CDISummaryStatisticTypeterm (ddi-stats:*, verified against the live 2.1.2 vocabulary) as ansc:DefinedTerm.min/maxalso carry the schema.orgequivalentProperty(sc:minValue/sc:maxValue) for numeric columns. Annotations are emitted only when qsv actually produced the stat (e.g. quartiles are null for text columns).{ "@id": "main-table/population/mean", "value": 5483333.3333, "dataType": { "@type": "sc:DefinedTerm", "termCode": "ArithmeticMean", "name": "Arithmetic mean", "@id": "ddi-stats:7975ed0", "inDefinedTermSet": "http://rdf-vocabulary.ddialliance.org/cv/SummaryStatisticType/2.1.2/" } }Frequency (opt-in via
--croissant-frequency)Croissant 1.1 has no canonical scalar slot for value distributions, so each column's top-N counts are emitted as a dedicated inline
cr:RecordSet(<col>-frequency) of{value, count, percentage}rows — a first-class, queryable shape. Off by default to keep the projection compact; the raw counts always remain in the top-levelfrequencyoutput block regardless of the flag.Full extended stat set on a fresh run
The profile pipeline previously used the shared
StatsMode::Schema(no median/quartiles/mode), so those only appeared when a richer--everythingstats cache happened to exist. AddedStatsMode::ProfileSchema(Schema +--quartiles --mode) and pointed the profile context builder at it. Theschemacommand's leanSchemamode is untouched.Other
@contextgains theddi-statsprefix and anannotationterm aliased tocr:annotation.dppf) into the projection context.docs/help/profile.md.Testing
croissant_emits_recordset_with_one_field_per_csv_columnto assert the count annotation, per-field stat annotations, and frequency RecordSets (with--croissant-frequency).croissant_frequency_off_by_default_extended_stats_on_fresh_run: verifies frequency is opt-in and median/quartiles/mode surface on a fresh run.cargo clippy -F all_featuresclean;cargo +nightly fmt --checkclean.🤖 Generated with Claude Code