perf(stats): in-process sniff + warm-cache reuse for --infer-dates by jqnatividad · Pull Request #3924 · dathere/qsv

jqnatividad · 2026-05-30T15:55:29Z

Summary

Removes redundant qsv sniff work from the stats --infer-dates default path. --dates-whitelist defaults to sniff, so this affects essentially every stats --infer-dates invocation.

Two changes:

1. Resolve the `sniff` whitelist in-process (`d75424744`)

stats --infer-dates forked a qsv sniff --json --stats-types subprocess purely to learn which columns are Date/DateTime — reloading the 138 MB qsv binary, re-sampling the input, and JSON round-tripping a result the sniff code already had in hand as a struct.

New sniff::date_columns() reuses get_file_to_sniff (preserving snappy decompression, symlink canonicalization, delimiter handling) and the same Sniffer sampling with the exact subprocess defaults (--sample 1000, auto-delimiter, file dmy/mdy preference), returning Date/DateTime field names in order.
resolve_sniff_whitelist calls it directly; the SniffResult shim and simd_json/serde_json round-trip are deleted. The qsv sniff command path is untouched.
Behavior is byte-identical: Date/DateTime filter, join order, the _qsv_no_date_columns_found sentinel, and the failure error message are all preserved.

2. Reuse the sniff-resolved whitelist on warm cache hits (`afe9886a7`)

resolve_sniff_whitelist ran before the stats-cache hit check (its result feeds the cache-key whitelist), so even warm-cache hits on an unchanged file re-sniffed the input just to rebuild the key they then compared.

The cache sidecar now records the original unresolved value (flag_dates_whitelist_raw = "sniff") as provenance, alongside the resolved column names.
On a sniff request, if a sidecar exists, is newer than the input (file unchanged), and was itself sniff-derived → reuse its resolved whitelist and skip the sniff entirely.
Resolved names remain the cache key, so content-based cache sharing with the schema/profile/frequency runs that build stats via get_stats_records is preserved. flag_dates_whitelist_raw is excluded from the validity comparison and is #[serde(default)] so older sidecars deserialize cleanly (one self-healing recompute populates it). --force, stale caches, and explicit whitelists fall back to a fresh in-process sniff.

Performance (M4 Max, release)

scenario	before	after	speedup
cold, 1M-row non-date col	125.0 ms	89.7 ms	1.39×
cold, tiny file	39.3 ms	16.7 ms	2.35×
warm cache, 1000-row date file	46.5 ms	10.5 ms	4.4×
warm cache, 1M-row file	64.9 ms	9.8 ms	6.6×

Warm runs are now ~10 ms regardless of file size, since sniff no longer re-runs.

Verification

cargo test sniff -F all_features — 24 passed
cargo test stats -F all_features — 752 passed
cargo clippy --bin qsv -F all_features -- -D warnings — clean
Output byte-identical vs prior 20.1.0 binary (1M non-date col, datetime col, comma date col)
New regression test stats_dates_whitelist_sniff_cache_provenance covers cold-write provenance + warm reuse identical output
Functional log trace confirms: cold → sniffs; warm/unchanged → "Reusing sniff-resolved dates-whitelist from current stats cache"; changed file → re-sniffs

🤖 Generated with Claude Code

stats --infer-dates with the default "sniff" dates-whitelist forked a `qsv sniff --json --stats-types` subprocess solely to learn which columns are Date/DateTime. That reloaded the 138 MB qsv binary, re-sampled the input, and JSON round-tripped a result the sniff code already had in hand as a plain struct — ~35 ms of pure overhead per invocation. Add sniff::date_columns(), a pub(crate) in-process function that reuses get_file_to_sniff (preserving snappy decompression, symlink canonicalization and delimiter handling) and the same Sniffer sampling with the exact subprocess defaults (--sample 1000, auto-delimiter, file dmy/mdy preference), returning the Date/DateTime field names in order. resolve_sniff_whitelist now calls it directly; the SniffResult shim and simd_json/serde_json round-trip are deleted. The qsv sniff command path is untouched. Behavior is byte-identical: the Date/DateTime filter, join order, the _qsv_no_date_columns_found sentinel, and the sniff-failure error message are all preserved. Perf (M4 Max, release): tiny file 39.3 -> 16.7 ms (2.35x); 1M-row non-date column 125.0 -> 89.7 ms (1.39x), shrinking the --infer-dates overhead from ~55 ms to ~19.5 ms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

"sniff" is the DEFAULT --dates-whitelist, and resolve_sniff_whitelist ran *before* the stats-cache hit check (its result feeds the cache-key whitelist), so every `stats --infer-dates` run — including warm-cache hits on an unchanged file — re-sniffed the input just to rebuild the key it then compared against. Record the original, unresolved whitelist value ("sniff") in the cache sidecar as flag_dates_whitelist_raw (provenance), alongside the resolved column names in flag_dates_whitelist. On a "sniff" request, if a stats cache sidecar exists, is newer than the input file (file unchanged since the cache was built), and was itself sniff-derived, reuse its resolved whitelist and skip the sniff entirely. The resolved names are still stored as the cache key, so content-based cache sharing with the resolved-whitelist runs that schema/profile/frequency build via get_stats_records is preserved. flag_dates_whitelist_raw is excluded from the cache-validity comparison (zeroed before comparing) and is #[serde(default)] so older sidecars deserialize cleanly (one self-healing recompute populates it). --force, stale caches, and explicit (non-"sniff") whitelists all fall back to a fresh in-process sniff. Perf (M4 Max, release; warm-cache repeat run vs prior 20.1.0): 1000-row date file 46.5 -> 10.5 ms (4.4x); 1M-row file 64.9 -> 9.8 ms (6.6x). Warm runs are now ~10 ms regardless of file size since sniff no longer re-runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-30T15:56:18Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR speeds up qsv stats --infer-dates by eliminating redundant sniff work: it resolves the default "sniff" dates-whitelist in-process (instead of forking qsv sniff) and reuses a previously sniff-resolved whitelist on warm stats-cache hits when the input is unchanged.

Changes:

Add sniff::date_columns() to sniff local files in-process and return Date/DateTime column names in order.
Teach stats to reuse a sniff-resolved dates whitelist from a current cache sidecar (via provenance flag_dates_whitelist_raw) to avoid re-sniffing on warm runs.
Add a regression test for cache provenance + warm reuse, and document the perf win in the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/cmd/sniff.rs`	Adds in-process date column detection helper for `stats --infer-dates`.
`src/cmd/stats.rs`	Replaces subprocess sniff with in-process sniff and adds warm-cache whitelist reuse via sidecar provenance.
`tests/test_stats.rs`	Adds regression test covering cache provenance and warm-run output equality.
`CHANGELOG.md`	Documents the `stats --infer-dates` performance improvements.

… in test Copilot review on #3924: 1. read_current_sniff_whitelist could reuse a stats cache built WITHOUT --infer-dates. Because "sniff" is the --dates-whitelist default and resolution is gated on --infer-dates, such a cache stores the unresolved literal "sniff" in flag_dates_whitelist. A later `--infer-dates --dates-whitelist sniff` run reused that literal keyword, skipping the sniff and breaking date inference (column typed as String instead of Date). Fix: only reuse when cached.flag_infer_dates is true and the stored whitelist is not the literal "sniff". Added regression test stats_dates_whitelist_sniff_no_reuse_of_non_infer_cache. 2. Provenance test asserted on exact JSON substrings (incl. whitespace). Reworked to parse the sidecar JSON and assert on fields, robust to formatting changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jqnatividad and others added 2 commits May 30, 2026 11:16

docs(changelog): note in-process sniff + warm-cache reuse perf wins

472f539

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jqnatividad requested a review from Copilot May 30, 2026 15:57

Copilot started reviewing on behalf of jqnatividad May 30, 2026 15:57 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Comment thread src/cmd/stats.rs Outdated

Comment thread tests/test_stats.rs Outdated

Comment thread src/cmd/stats.rs Outdated

Comment thread tests/test_stats.rs Outdated

jqnatividad merged commit 80a47e7 into master May 30, 2026
19 checks passed

jqnatividad deleted the in-process-sniff branch May 30, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stats): in-process sniff + warm-cache reuse for --infer-dates#3924

perf(stats): in-process sniff + warm-cache reuse for --infer-dates#3924
jqnatividad merged 4 commits into
masterfrom
in-process-sniff

jqnatividad commented May 30, 2026

Uh oh!

codacy-production Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 30, 2026

Summary

1. Resolve the sniff whitelist in-process (d75424744)

2. Reuse the sniff-resolved whitelist on warm cache hits (afe9886a7)

Performance (M4 Max, release)

Verification

Uh oh!

codacy-production Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Resolve the `sniff` whitelist in-process (`d75424744`)

2. Reuse the sniff-resolved whitelist on warm cache hits (`afe9886a7`)

codacy-production Bot commented May 30, 2026 •

edited

Loading