Skip to content

fill: review-driven cleanup, fix MapSelected unsorted-selection bug, ~22% faster hot path#3762

Merged
jqnatividad merged 1 commit intomasterfrom
fill-review-202604
Apr 27, 2026
Merged

fill: review-driven cleanup, fix MapSelected unsorted-selection bug, ~22% faster hot path#3762
jqnatividad merged 1 commit intomasterfrom
fill-review-202604

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • Correctness fix: MapSelected previously assumed selection indices were sorted and unique, so user-supplied selections like qsv fill -- "3,1" file.csv or qsv fill -- "1,1" file.csv silently dropped columns from the fill. Sort + dedup at construction; added regression tests.
  • Doc/UX: USAGE now states that --default takes precedence over --first / forward-fill (previously silent).
  • Hot-path perf (~22% faster): filled rows are now written directly into a reusable csv::ByteRecord via a new fill_into() method, eliminating two Vec<Vec<u8>> allocations per row on the common no---backfill path. Grouper lookup uses get_mut() so key.clone() and the default-value clone only happen on insertion of a new group.
  • Smaller cleanups: // safety: comment on the final-buffer-drain unwrap(), as_ref().or_else().cloned() in GroupValues::fill, hoisted default_value clone into or_insert_with, is_empty() instead of == b"", hoisted the (default, first) match into a MemorizeKind enum computed once outside the loop.

Benchmark

NYC_311_SR_2010-2020-sample-1M.csv (1M rows, 514 MB, 3 selected columns with high empty rates), 5 runs each via hyperfine:

Mode Baseline This PR Speedup
forward fill 1.829 s 1.429 s 1.28×
--groupby fill 1.870 s 1.456 s 1.28×
--default fill 1.758 s 1.372 s 1.28×
--backfill 1.815 s 1.837 s within noise (slow path unchanged)

Output is bit-identical to baseline (md5 verified for forward + groupby modes on the full 1M-row file).

Test plan

  • All existing fill tests pass (cargo t fill -F all_features)
  • Three new regression tests added:
    • fill_forward_unsorted_selection — covers MapSelected ordering bug
    • fill_forward_duplicate_selection — covers dedup behavior
    • fill_default_overrides_first — locks in the documented precedence
  • cargo +nightly clippy -F all_features -- -W clippy::perf — clean
  • cargo +nightly fmt — applied
  • docs/help/fill.md regenerated via qsv --generate-help-md
  • Manual md5 comparison of full output vs baseline (forward + groupby)

🤖 Generated with Claude Code

…~22% faster hot path

Correctness
- MapSelected assumed ascending, deduped column indices; user-supplied
  selections like "3,1" or "1,1" silently dropped columns. Sort + dedup
  in the constructor and add regression tests.
- Document that --default takes precedence over --first / forward-fill.
- Add // safety: comment to the final-buffer-drain unwrap.

Style / small perf
- GroupValues::fill: as_ref().or_else().cloned().unwrap_or(field) avoids
  cloning the default value bytes for every empty field.
- Hoist default_value clone into or_insert_with so it only runs on
  group insertion, not every row.
- Replace `row[i] == b""` with is_empty().

Hot-path perf (~22% faster on the common no-backfill path)
- Add fill_into() which writes filled rows directly into a reusable
  csv::ByteRecord using an O(1) HashSet<usize> for selected-column
  membership, eliminating two Vec<Vec<u8>> allocations per row.
- Use get_mut() in the grouper lookup so key.clone() and the default
  value clone only happen on insertion of a new group, not every row.
- Hoist the (default, first) match into a MemorizeKind enum computed
  once before the loop instead of per-row.

Benchmarks on NYC_311_SR_2010-2020-sample-1M.csv (1M rows, 514MB,
3 selected cols with high empty rates), 5 runs each, hyperfine:

  forward fill: 1.829s -> 1.429s  (1.28x faster)
  groupby fill: 1.870s -> 1.456s  (1.28x faster)
  default fill: 1.758s -> 1.372s  (1.28x faster)
  backfill:     1.815s -> 1.837s  (within noise; slow path unchanged)

Output bit-identical to baseline (md5 verified for forward + groupby).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 7 complexity

Metric Results
Complexity 7

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the fill command’s correctness, clarifies option precedence in the help text, and optimizes the non---backfill hot path by avoiding per-row heap allocations.

Changes:

  • Fixes a correctness bug where MapSelected could mis-handle unsorted or duplicate selection indices by sorting + deduping at construction time (with regression tests).
  • Documents that --default takes precedence over forward-fill / --first (making them no-ops when default is set).
  • Optimizes the no---backfill path by filling directly into a reusable csv::ByteRecord and reducing cloning on group insertion.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/cmd/fill.rs Fixes selection-order/dup handling, adds fill_into() fast path, and updates USAGE to document --default precedence.
tests/test_fill.rs Adds regression tests for unsorted/duplicate selection and --default precedence over --first.
docs/help/fill.md Regenerates help text to reflect --default precedence.

Comment thread src/cmd/fill.rs
@jqnatividad jqnatividad merged commit 18df0da into master Apr 27, 2026
21 checks passed
@jqnatividad jqnatividad deleted the fill-review-202604 branch April 27, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants