perf(stats): in-process sniff + warm-cache reuse for --infer-dates#3924
Merged
Conversation
stats --infer-dates with the default "sniff" dates-whitelist forked a `qsv sniff --json --stats-types` subprocess solely to learn which columns are Date/DateTime. That reloaded the 138 MB qsv binary, re-sampled the input, and JSON round-tripped a result the sniff code already had in hand as a plain struct — ~35 ms of pure overhead per invocation. Add sniff::date_columns(), a pub(crate) in-process function that reuses get_file_to_sniff (preserving snappy decompression, symlink canonicalization and delimiter handling) and the same Sniffer sampling with the exact subprocess defaults (--sample 1000, auto-delimiter, file dmy/mdy preference), returning the Date/DateTime field names in order. resolve_sniff_whitelist now calls it directly; the SniffResult shim and simd_json/serde_json round-trip are deleted. The qsv sniff command path is untouched. Behavior is byte-identical: the Date/DateTime filter, join order, the _qsv_no_date_columns_found sentinel, and the sniff-failure error message are all preserved. Perf (M4 Max, release): tiny file 39.3 -> 16.7 ms (2.35x); 1M-row non-date column 125.0 -> 89.7 ms (1.39x), shrinking the --infer-dates overhead from ~55 ms to ~19.5 ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"sniff" is the DEFAULT --dates-whitelist, and resolve_sniff_whitelist ran
*before* the stats-cache hit check (its result feeds the cache-key whitelist),
so every `stats --infer-dates` run — including warm-cache hits on an unchanged
file — re-sniffed the input just to rebuild the key it then compared against.
Record the original, unresolved whitelist value ("sniff") in the cache sidecar
as flag_dates_whitelist_raw (provenance), alongside the resolved column names in
flag_dates_whitelist. On a "sniff" request, if a stats cache sidecar exists, is
newer than the input file (file unchanged since the cache was built), and was
itself sniff-derived, reuse its resolved whitelist and skip the sniff entirely.
The resolved names are still stored as the cache key, so content-based cache
sharing with the resolved-whitelist runs that schema/profile/frequency build via
get_stats_records is preserved. flag_dates_whitelist_raw is excluded from the
cache-validity comparison (zeroed before comparing) and is #[serde(default)] so
older sidecars deserialize cleanly (one self-healing recompute populates it).
--force, stale caches, and explicit (non-"sniff") whitelists all fall back to a
fresh in-process sniff.
Perf (M4 Max, release; warm-cache repeat run vs prior 20.1.0): 1000-row date
file 46.5 -> 10.5 ms (4.4x); 1M-row file 64.9 -> 9.8 ms (6.6x). Warm runs are
now ~10 ms regardless of file size since sniff no longer re-runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR speeds up qsv stats --infer-dates by eliminating redundant sniff work: it resolves the default "sniff" dates-whitelist in-process (instead of forking qsv sniff) and reuses a previously sniff-resolved whitelist on warm stats-cache hits when the input is unchanged.
Changes:
- Add
sniff::date_columns()to sniff local files in-process and return Date/DateTime column names in order. - Teach
statsto reuse a sniff-resolved dates whitelist from a current cache sidecar (via provenanceflag_dates_whitelist_raw) to avoid re-sniffing on warm runs. - Add a regression test for cache provenance + warm reuse, and document the perf win in the changelog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/cmd/sniff.rs |
Adds in-process date column detection helper for stats --infer-dates. |
src/cmd/stats.rs |
Replaces subprocess sniff with in-process sniff and adds warm-cache whitelist reuse via sidecar provenance. |
tests/test_stats.rs |
Adds regression test covering cache provenance and warm-run output equality. |
CHANGELOG.md |
Documents the stats --infer-dates performance improvements. |
… in test Copilot review on #3924: 1. read_current_sniff_whitelist could reuse a stats cache built WITHOUT --infer-dates. Because "sniff" is the --dates-whitelist default and resolution is gated on --infer-dates, such a cache stores the unresolved literal "sniff" in flag_dates_whitelist. A later `--infer-dates --dates-whitelist sniff` run reused that literal keyword, skipping the sniff and breaking date inference (column typed as String instead of Date). Fix: only reuse when cached.flag_infer_dates is true and the stored whitelist is not the literal "sniff". Added regression test stats_dates_whitelist_sniff_no_reuse_of_non_infer_cache. 2. Provenance test asserted on exact JSON substrings (incl. whitespace). Reworked to parse the sidecar JSON and assert on fields, robust to formatting changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes redundant
qsv sniffwork from thestats --infer-datesdefault path.--dates-whitelistdefaults tosniff, so this affects essentially everystats --infer-datesinvocation.Two changes:
1. Resolve the
sniffwhitelist in-process (d75424744)stats --infer-datesforked aqsv sniff --json --stats-typessubprocess purely to learn which columns are Date/DateTime — reloading the 138 MB qsv binary, re-sampling the input, and JSON round-tripping a result the sniff code already had in hand as a struct.sniff::date_columns()reusesget_file_to_sniff(preserving snappy decompression, symlink canonicalization, delimiter handling) and the sameSniffersampling with the exact subprocess defaults (--sample 1000, auto-delimiter, file dmy/mdy preference), returning Date/DateTime field names in order.resolve_sniff_whitelistcalls it directly; theSniffResultshim andsimd_json/serde_jsonround-trip are deleted. Theqsv sniffcommand path is untouched._qsv_no_date_columns_foundsentinel, and the failure error message are all preserved.2. Reuse the sniff-resolved whitelist on warm cache hits (
afe9886a7)resolve_sniff_whitelistran before the stats-cache hit check (its result feeds the cache-key whitelist), so even warm-cache hits on an unchanged file re-sniffed the input just to rebuild the key they then compared.flag_dates_whitelist_raw = "sniff") as provenance, alongside the resolved column names.sniffrequest, if a sidecar exists, is newer than the input (file unchanged), and was itself sniff-derived → reuse its resolved whitelist and skip the sniff entirely.schema/profile/frequencyruns that buildstatsviaget_stats_recordsis preserved.flag_dates_whitelist_rawis excluded from the validity comparison and is#[serde(default)]so older sidecars deserialize cleanly (one self-healing recompute populates it).--force, stale caches, and explicit whitelists fall back to a fresh in-process sniff.Performance (M4 Max, release)
Warm runs are now ~10 ms regardless of file size, since sniff no longer re-runs.
Verification
cargo test sniff -F all_features— 24 passedcargo test stats -F all_features— 752 passedcargo clippy --bin qsv -F all_features -- -D warnings— cleanstats_dates_whitelist_sniff_cache_provenancecovers cold-write provenance + warm reuse identical output🤖 Generated with Claude Code