Recipe Diff and Audit

Recipe: Diff & Audit

Tier: Intermediate Commands used: blake3, sortcheck, extsort, diff, extdedup, validate Anchor dataset: weekly regulatory CSV exports (any periodic dataset works)

Problem

You receive a CSV every week from a partner (a vendor list, a regulatory filing, a sensor export). You need to know — quickly:

Did anything change since last week? (fingerprint check)
If yes, what changed? (row-level diff: added, removed, modified)
Is the new file internally consistent? (primary-key uniqueness, sort order, no encoding glitches)
Can the change be explained and signed off by humans?

qsv's diff does the row-level work in < 600 ms on 1M × 9 columns. blake3 handles the fingerprint level. sortcheck + extsort + extdedup handle the preconditions.

Data

# Place this week's and last week's files side by side
ls this_week.csv last_week.csv
qsv sniff this_week.csv

For the recipe to work, both files must have a unique primary key column (or composite of columns). Examples: a contract number, a parcel ID, a Case Enquiry ID.

Solution

1. Fingerprint-level "did anything change at all?"

THIS=$(qsv blake3 this_week.csv | awk '{print $1}')
LAST=$(qsv blake3 last_week.csv | awk '{print $1}')
if [ "$THIS" = "$LAST" ]; then
  echo "No change — same content hash."
  exit 0
fi
echo "Content changed; running row-level diff…"

BLAKE3 is multithreaded and mmap-backed — instant for files of any size. Use this as the gate for the rest of the pipeline.

2. Verify primary key uniqueness on both sides

qsv extdedup --select case_id this_week.csv --no-output 2>&1 | grep -E '\\b(0|[1-9])\\b'
qsv extdedup --select case_id last_week.csv --no-output 2>&1 | grep -E '\\b(0|[1-9])\\b'

extdedup --no-output prints the duplicate count to stderr. A non-zero count means the primary key isn't unique and diff will refuse to run.

3. Make sure both sides are sorted (or sort them)

qsv sortcheck --select case_id this_week.csv && echo "this_week sorted"
qsv sortcheck --select case_id last_week.csv && echo "last_week sorted"

# If not sorted, sort them (in-memory). For files > RAM, use extsort.
qsv sort --select case_id this_week.csv > this_sorted.csv
qsv sort --select case_id last_week.csv > last_sorted.csv

4. Diff

qsv diff --select case_id last_sorted.csv this_sorted.csv > delta.csv
qsv count delta.csv
# 142 rows of changes

The output CSV has a diffresult column with values Add, Remove, or Modify, plus a side-by-side view of changed fields.

5. Split the diff by type for human review

qsv search --select diffresult '^Add$' delta.csv > added.csv
qsv search --select diffresult '^Remove$' delta.csv > removed.csv
qsv search --select diffresult '^Modify$' delta.csv > modified.csv

wc -l added.csv removed.csv modified.csv
# Sign-off-ready triple

6. Add a fingerprint column to every input for downstream lineage

qsv blake3 -l 8 this_week.csv | head -c 16 > this_week.fingerprint
echo "Recorded fingerprint $(cat this_week.fingerprint) for this_week.csv"

Short BLAKE3 hashes (8 bytes hex = 16 chars) are great as cache keys and lineage markers.

Variations

Different delimiters on each side

qsv diff --delimiter-left $'\t' --delimiter-right ',' \
  --select id \
  vendor_export.tsv our_canonical.csv

Headers on one side, none on the other

qsv diff --no-headers-right \
  --select 1 \
  with_header.csv legacy_no_headers.csv

Composite primary key

qsv diff --select 'case_id,fiscal_year' last.csv this.csv > delta.csv

Validate schema as part of the audit

qsv validate this_week.csv contract.schema.json
if [ -f this_week.csv.invalid.csv ]; then
  echo "::error::Schema validation failed; see this_week.csv.validation-errors.tsv"
  exit 1
fi

See Recipe: JSON Schema Validation.

Larger-than-RAM diff prerequisites

# Files > RAM: use ext-* commands
qsv extsort --select case_id this_week.csv > this_sorted.csv   # multithreaded
qsv extdedup --select case_id this_sorted.csv > this_unique.csv  # on-disk hash
qsv diff --select case_id last_unique.csv this_unique.csv > delta.csv

See Recipe: Larger-than-RAM CSV.

Email a change summary

cat <<EOF | qsv clipboard --save
Weekly diff summary for $(date)
  Added: $(qsv count added.csv)
  Removed: $(qsv count removed.csv)
  Modified: $(qsv count modified.csv)
  Fingerprint: $(qsv blake3 this_week.csv | awk '{print $1}')
EOF

Performance notes

diff benchmark: 1M × 9 columns in < 600 ms on an M2 Pro.
blake3 is mmap-backed and multithreaded — instant for files of any size.
extsort and extdedup are both multithreaded and stream to disk — good for files orders of magnitude larger than RAM.
For very stable files where rows almost never change, the BLAKE3 step alone usually short-circuits the whole pipeline. For files where rows change often, the full pipeline runs in a few seconds for typical (single-digit-million-row) weekly datasets.

Recipe Diff and Audit

Recipe: Diff & Audit

Problem

Data

Solution

1. Fingerprint-level "did anything change at all?"

2. Verify primary key uniqueness on both sides

3. Make sure both sides are sorted (or sort them)

4. Diff

5. Split the diff by type for human review

6. Add a fingerprint column to every input for downstream lineage

Variations

Different delimiters on each side

Headers on one side, none on the other

Composite primary key

Validate schema as part of the audit

Larger-than-RAM diff prerequisites

Email a change summary

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference