Skip to content

feat(pragmastat): use stats cache to only process numeric/date/datetime columns#3593

Merged
jqnatividad merged 8 commits intomasterfrom
pragmastat-use-stats-cache
Mar 9, 2026
Merged

feat(pragmastat): use stats cache to only process numeric/date/datetime columns#3593
jqnatividad merged 8 commits intomasterfrom
pragmastat-use-stats-cache

Conversation

@jqnatividad
Copy link
Collaborator

No description provided.

jqnatividad and others added 7 commits March 9, 2026 07:37
Use the existing stats cache to opportunistically filter out non-numeric columns when running pragmastat without an explicit --select. Adds numeric_columns_from_cache() which reads a stats JSONL cache (if newer than the input file) and keeps only Integer/Float columns; it never triggers a stats run and only applies to path-based inputs. Integrates this filtering into read_columns and logs how many columns were skipped. Adds tests to verify cache filtering and that explicit --select bypasses the cache.
Use the stats cache to detect Date/DateTime columns and parse them as
epoch milliseconds for pragmastat analysis. Point estimates (center,
bounds) are formatted as RFC3339 dates; dispersion/shift values (spread,
shift, bounds) are formatted as days with millisecond precision.

Also expands the stats cache integration to include Date/DateTime columns
alongside Integer/Float, filtering out only truly non-analysable types
(String, Boolean, NULL).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…race

Replace `get_stats_records` routing in `columns_from_cache` with direct
JSONL file reading. The previous approach could trigger a full stats run
under `QSV_STATSCACHE_MODE=auto` if the cache became stale between the
manual freshness pre-check and the `get_stats_records` call. Reading the
file directly truly guarantees no stats run, removes the dummy SchemaArgs,
and adds a comment documenting the duplicate-column-name assumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…estamps

- Skip mixed Date/Numeric column pairs in two-sample and compare2 with a
  warning, as comparing epoch-ms values against plain numbers is nonsensical
- Return empty string instead of 1970-01-01 for out-of-range timestamps in
  fmt_timestamp (e.g. confidence intervals that overshoot valid date ranges)
- Use chrono's format("%Y-%m-%d") instead of fragile string slicing for dates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…from_cache

The JSONL cache line index already corresponds directly to the column
index in the CSV headers, so re-opening the input file to read headers
and matching by name was unnecessary. This simplifies the code and
eliminates a potential edge case with duplicate column names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes qsv pragmastat a “smart” command by leveraging the stats.csv.data.jsonl cache to (1) automatically exclude non-numeric columns when --select is not provided and (2) add Date/DateTime column support by converting parsed timestamps to epoch milliseconds for analysis and formatting results back as dates/datetimes (with spreads/shifts expressed in days).

Changes:

  • Add opportunistic stats-cache loading in pragmastat to filter analyzable columns and detect Date/DateTime types.
  • Implement Date/DateTime parsing + output formatting (timestamp formatting for center/bounds; day conversion for spread/shift).
  • Add integration tests and update docs/README to reflect the new “smart” behavior; switch pragmastat crate to a patched git fork.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/cmd/pragmastat.rs Loads stats cache for column filtering/type detection; parses Date/DateTime to epoch-ms; formats outputs for dates and day-based spreads/shifts.
tests/test_pragmastat.rs Adds integration tests for stats-cache filtering, --select override behavior, and Date/DateTime output expectations.
docs/PERFORMANCE.md Documents pragmastat as a stats-cache “smart” command and summarizes its cache usage.
README.md Updates pragmastat description to mention stats-cache filtering and Date/DateTime support.
Cargo.toml / Cargo.lock Pins pragmastat to a patched git fork/branch and updates lockfile accordingly.

… validation

- Increase DAY_DECIMAL_PLACES from 5 to 8 for actual millisecond precision
  (1ms / 86_400_000 ms-per-day ≈ 1.16e-8)
- Round epoch-ms values before i64 cast in fmt_timestamp to avoid truncation
- Validate cache record count matches header count in columns_from_cache,
  ignoring caches generated with --select

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit fcbcc9e into master Mar 9, 2026
14 checks passed
@jqnatividad jqnatividad deleted the pragmastat-use-stats-cache branch March 9, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants