Skip to content

perf(stats): in-process sniff + warm-cache reuse for --infer-dates#3924

Merged
jqnatividad merged 4 commits into
masterfrom
in-process-sniff
May 30, 2026
Merged

perf(stats): in-process sniff + warm-cache reuse for --infer-dates#3924
jqnatividad merged 4 commits into
masterfrom
in-process-sniff

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Removes redundant qsv sniff work from the stats --infer-dates default path. --dates-whitelist defaults to sniff, so this affects essentially every stats --infer-dates invocation.

Two changes:

1. Resolve the sniff whitelist in-process (d75424744)

stats --infer-dates forked a qsv sniff --json --stats-types subprocess purely to learn which columns are Date/DateTime — reloading the 138 MB qsv binary, re-sampling the input, and JSON round-tripping a result the sniff code already had in hand as a struct.

  • New sniff::date_columns() reuses get_file_to_sniff (preserving snappy decompression, symlink canonicalization, delimiter handling) and the same Sniffer sampling with the exact subprocess defaults (--sample 1000, auto-delimiter, file dmy/mdy preference), returning Date/DateTime field names in order.
  • resolve_sniff_whitelist calls it directly; the SniffResult shim and simd_json/serde_json round-trip are deleted. The qsv sniff command path is untouched.
  • Behavior is byte-identical: Date/DateTime filter, join order, the _qsv_no_date_columns_found sentinel, and the failure error message are all preserved.

2. Reuse the sniff-resolved whitelist on warm cache hits (afe9886a7)

resolve_sniff_whitelist ran before the stats-cache hit check (its result feeds the cache-key whitelist), so even warm-cache hits on an unchanged file re-sniffed the input just to rebuild the key they then compared.

  • The cache sidecar now records the original unresolved value (flag_dates_whitelist_raw = "sniff") as provenance, alongside the resolved column names.
  • On a sniff request, if a sidecar exists, is newer than the input (file unchanged), and was itself sniff-derived → reuse its resolved whitelist and skip the sniff entirely.
  • Resolved names remain the cache key, so content-based cache sharing with the schema/profile/frequency runs that build stats via get_stats_records is preserved. flag_dates_whitelist_raw is excluded from the validity comparison and is #[serde(default)] so older sidecars deserialize cleanly (one self-healing recompute populates it). --force, stale caches, and explicit whitelists fall back to a fresh in-process sniff.

Performance (M4 Max, release)

scenario before after speedup
cold, 1M-row non-date col 125.0 ms 89.7 ms 1.39×
cold, tiny file 39.3 ms 16.7 ms 2.35×
warm cache, 1000-row date file 46.5 ms 10.5 ms 4.4×
warm cache, 1M-row file 64.9 ms 9.8 ms 6.6×

Warm runs are now ~10 ms regardless of file size, since sniff no longer re-runs.

Verification

  • cargo test sniff -F all_features — 24 passed
  • cargo test stats -F all_features — 752 passed
  • cargo clippy --bin qsv -F all_features -- -D warnings — clean
  • Output byte-identical vs prior 20.1.0 binary (1M non-date col, datetime col, comma date col)
  • New regression test stats_dates_whitelist_sniff_cache_provenance covers cold-write provenance + warm reuse identical output
  • Functional log trace confirms: cold → sniffs; warm/unchanged → "Reusing sniff-resolved dates-whitelist from current stats cache"; changed file → re-sniffs

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 30, 2026 11:16
stats --infer-dates with the default "sniff" dates-whitelist forked a
`qsv sniff --json --stats-types` subprocess solely to learn which columns
are Date/DateTime. That reloaded the 138 MB qsv binary, re-sampled the
input, and JSON round-tripped a result the sniff code already had in hand
as a plain struct — ~35 ms of pure overhead per invocation.

Add sniff::date_columns(), a pub(crate) in-process function that reuses
get_file_to_sniff (preserving snappy decompression, symlink canonicalization
and delimiter handling) and the same Sniffer sampling with the exact
subprocess defaults (--sample 1000, auto-delimiter, file dmy/mdy preference),
returning the Date/DateTime field names in order. resolve_sniff_whitelist
now calls it directly; the SniffResult shim and simd_json/serde_json
round-trip are deleted. The qsv sniff command path is untouched.

Behavior is byte-identical: the Date/DateTime filter, join order, the
_qsv_no_date_columns_found sentinel, and the sniff-failure error message
are all preserved.

Perf (M4 Max, release): tiny file 39.3 -> 16.7 ms (2.35x); 1M-row non-date
column 125.0 -> 89.7 ms (1.39x), shrinking the --infer-dates overhead from
~55 ms to ~19.5 ms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"sniff" is the DEFAULT --dates-whitelist, and resolve_sniff_whitelist ran
*before* the stats-cache hit check (its result feeds the cache-key whitelist),
so every `stats --infer-dates` run — including warm-cache hits on an unchanged
file — re-sniffed the input just to rebuild the key it then compared against.

Record the original, unresolved whitelist value ("sniff") in the cache sidecar
as flag_dates_whitelist_raw (provenance), alongside the resolved column names in
flag_dates_whitelist. On a "sniff" request, if a stats cache sidecar exists, is
newer than the input file (file unchanged since the cache was built), and was
itself sniff-derived, reuse its resolved whitelist and skip the sniff entirely.

The resolved names are still stored as the cache key, so content-based cache
sharing with the resolved-whitelist runs that schema/profile/frequency build via
get_stats_records is preserved. flag_dates_whitelist_raw is excluded from the
cache-validity comparison (zeroed before comparing) and is #[serde(default)] so
older sidecars deserialize cleanly (one self-healing recompute populates it).
--force, stale caches, and explicit (non-"sniff") whitelists all fall back to a
fresh in-process sniff.

Perf (M4 Max, release; warm-cache repeat run vs prior 20.1.0): 1000-row date
file 46.5 -> 10.5 ms (4.4x); 1M-row file 64.9 -> 9.8 ms (6.6x). Warm runs are
now ~10 ms regardless of file size since sniff no longer re-runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 30, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR speeds up qsv stats --infer-dates by eliminating redundant sniff work: it resolves the default "sniff" dates-whitelist in-process (instead of forking qsv sniff) and reuses a previously sniff-resolved whitelist on warm stats-cache hits when the input is unchanged.

Changes:

  • Add sniff::date_columns() to sniff local files in-process and return Date/DateTime column names in order.
  • Teach stats to reuse a sniff-resolved dates whitelist from a current cache sidecar (via provenance flag_dates_whitelist_raw) to avoid re-sniffing on warm runs.
  • Add a regression test for cache provenance + warm reuse, and document the perf win in the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/cmd/sniff.rs Adds in-process date column detection helper for stats --infer-dates.
src/cmd/stats.rs Replaces subprocess sniff with in-process sniff and adds warm-cache whitelist reuse via sidecar provenance.
tests/test_stats.rs Adds regression test covering cache provenance and warm-run output equality.
CHANGELOG.md Documents the stats --infer-dates performance improvements.

Comment thread src/cmd/stats.rs Outdated
Comment thread tests/test_stats.rs Outdated
Comment thread src/cmd/stats.rs Outdated
Comment thread tests/test_stats.rs Outdated
… in test

Copilot review on #3924:

1. read_current_sniff_whitelist could reuse a stats cache built WITHOUT
   --infer-dates. Because "sniff" is the --dates-whitelist default and resolution
   is gated on --infer-dates, such a cache stores the unresolved literal "sniff"
   in flag_dates_whitelist. A later `--infer-dates --dates-whitelist sniff` run
   reused that literal keyword, skipping the sniff and breaking date inference
   (column typed as String instead of Date). Fix: only reuse when
   cached.flag_infer_dates is true and the stored whitelist is not the literal
   "sniff". Added regression test stats_dates_whitelist_sniff_no_reuse_of_non_infer_cache.

2. Provenance test asserted on exact JSON substrings (incl. whitespace). Reworked
   to parse the sidecar JSON and assert on fields, robust to formatting changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 80a47e7 into master May 30, 2026
19 checks passed
@jqnatividad jqnatividad deleted the in-process-sniff branch May 30, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants