-
Notifications
You must be signed in to change notification settings - Fork 104
Recipe Diff and Audit
Tier: Intermediate
Commands used: blake3, sortcheck, extsort, diff, extdedup, validate
Anchor dataset: weekly regulatory CSV exports (any periodic dataset works)
You receive a CSV every week from a partner (a vendor list, a regulatory filing, a sensor export). You need to know — quickly:
- Did anything change since last week? (fingerprint check)
- If yes, what changed? (row-level diff: added, removed, modified)
- Is the new file internally consistent? (primary-key uniqueness, sort order, no encoding glitches)
- Can the change be explained and signed off by humans?
qsv's diff does the row-level work in < 600 ms on 1M × 9 columns. blake3 handles the fingerprint level. sortcheck + extsort + extdedup handle the preconditions.
# Place this week's and last week's files side by side
ls this_week.csv last_week.csv
qsv sniff this_week.csvFor the recipe to work, both files must have a unique primary key column (or composite of columns). Examples: a contract number, a parcel ID, a Case Enquiry ID.
THIS=$(qsv blake3 this_week.csv | awk '{print $1}')
LAST=$(qsv blake3 last_week.csv | awk '{print $1}')
if [ "$THIS" = "$LAST" ]; then
echo "No change — same content hash."
exit 0
fi
echo "Content changed; running row-level diff…"BLAKE3 is multithreaded and mmap-backed — instant for files of any size. Use this as the gate for the rest of the pipeline.
qsv extdedup --select case_id this_week.csv --no-output 2>&1 | grep -E '\\b(0|[1-9])\\b'
qsv extdedup --select case_id last_week.csv --no-output 2>&1 | grep -E '\\b(0|[1-9])\\b'extdedup --no-output prints the duplicate count to stderr. A non-zero count means the primary key isn't unique and diff will refuse to run.
qsv sortcheck --select case_id this_week.csv && echo "this_week sorted"
qsv sortcheck --select case_id last_week.csv && echo "last_week sorted"
# If not sorted, sort them (in-memory). For files > RAM, use extsort.
qsv sort --select case_id this_week.csv > this_sorted.csv
qsv sort --select case_id last_week.csv > last_sorted.csvqsv diff --select case_id last_sorted.csv this_sorted.csv > delta.csv
qsv count delta.csv
# 142 rows of changesThe output CSV has a diffresult column with values Add, Remove, or Modify, plus a side-by-side view of changed fields.
qsv search --select diffresult '^Add$' delta.csv > added.csv
qsv search --select diffresult '^Remove$' delta.csv > removed.csv
qsv search --select diffresult '^Modify$' delta.csv > modified.csv
wc -l added.csv removed.csv modified.csv
# Sign-off-ready tripleqsv blake3 -l 8 this_week.csv | head -c 16 > this_week.fingerprint
echo "Recorded fingerprint $(cat this_week.fingerprint) for this_week.csv"Short BLAKE3 hashes (8 bytes hex = 16 chars) are great as cache keys and lineage markers.
qsv diff --delimiter-left $'\t' --delimiter-right ',' \
--select id \
vendor_export.tsv our_canonical.csvqsv diff --no-headers-right \
--select 1 \
with_header.csv legacy_no_headers.csvqsv diff --select 'case_id,fiscal_year' last.csv this.csv > delta.csvqsv validate this_week.csv contract.schema.json
if [ -f this_week.csv.invalid.csv ]; then
echo "::error::Schema validation failed; see this_week.csv.validation-errors.tsv"
exit 1
fiSee Recipe: JSON Schema Validation.
# Files > RAM: use ext-* commands
qsv extsort --select case_id this_week.csv > this_sorted.csv # multithreaded
qsv extdedup --select case_id this_sorted.csv > this_unique.csv # on-disk hash
qsv diff --select case_id last_unique.csv this_unique.csv > delta.csvSee Recipe: Larger-than-RAM CSV.
cat <<EOF | qsv clipboard --save
Weekly diff summary for $(date)
Added: $(qsv count added.csv)
Removed: $(qsv count removed.csv)
Modified: $(qsv count modified.csv)
Fingerprint: $(qsv blake3 this_week.csv | awk '{print $1}')
EOF-
diffbenchmark: 1M × 9 columns in < 600 ms on an M2 Pro. -
blake3is mmap-backed and multithreaded — instant for files of any size. -
extsortandextdedupare both multithreaded and stream to disk — good for files orders of magnitude larger than RAM. - For very stable files where rows almost never change, the BLAKE3 step alone usually short-circuits the whole pipeline. For files where rows change often, the full pipeline runs in a few seconds for typical (single-digit-million-row) weekly datasets.
- Indexing, Compression & Diff — every command in this recipe
- Aggregation & Statistics → extdedup — for huge files
- Aggregation & Statistics → extsort — same
- Validation & Schema — schema-level audit alongside row-level diff
- Recipe: Larger-than-RAM CSV
docs/help/diff.md- BLAKE3 spec
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation