Skip to content

Recipe Diff and Audit

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: Diff & Audit

Tier: Intermediate Commands used: blake3, sortcheck, extsort, diff, extdedup, validate Anchor dataset: weekly regulatory CSV exports (any periodic dataset works)

Problem

You receive a CSV every week from a partner (a vendor list, a regulatory filing, a sensor export). You need to know — quickly:

  • Did anything change since last week? (fingerprint check)
  • If yes, what changed? (row-level diff: added, removed, modified)
  • Is the new file internally consistent? (primary-key uniqueness, sort order, no encoding glitches)
  • Can the change be explained and signed off by humans?

qsv's diff does the row-level work in < 600 ms on 1M × 9 columns. blake3 handles the fingerprint level. sortcheck + extsort + extdedup handle the preconditions.

Data

# Place this week's and last week's files side by side
ls this_week.csv last_week.csv
qsv sniff this_week.csv

For the recipe to work, both files must have a unique primary key column (or composite of columns). Examples: a contract number, a parcel ID, a Case Enquiry ID.

Solution

1. Fingerprint-level "did anything change at all?"

THIS=$(qsv blake3 this_week.csv | awk '{print $1}')
LAST=$(qsv blake3 last_week.csv | awk '{print $1}')
if [ "$THIS" = "$LAST" ]; then
  echo "No change — same content hash."
  exit 0
fi
echo "Content changed; running row-level diff…"

BLAKE3 is multithreaded and mmap-backed — instant for files of any size. Use this as the gate for the rest of the pipeline.

2. Verify primary key uniqueness on both sides

qsv extdedup --select case_id this_week.csv --no-output 2>&1 | grep -E '\\b(0|[1-9])\\b'
qsv extdedup --select case_id last_week.csv --no-output 2>&1 | grep -E '\\b(0|[1-9])\\b'

extdedup --no-output prints the duplicate count to stderr. A non-zero count means the primary key isn't unique and diff will refuse to run.

3. Make sure both sides are sorted (or sort them)

qsv sortcheck --select case_id this_week.csv && echo "this_week sorted"
qsv sortcheck --select case_id last_week.csv && echo "last_week sorted"

# If not sorted, sort them (in-memory). For files > RAM, use extsort.
qsv sort --select case_id this_week.csv > this_sorted.csv
qsv sort --select case_id last_week.csv > last_sorted.csv

4. Diff

qsv diff --select case_id last_sorted.csv this_sorted.csv > delta.csv
qsv count delta.csv
# 142 rows of changes

The output CSV has a diffresult column with values Add, Remove, or Modify, plus a side-by-side view of changed fields.

5. Split the diff by type for human review

qsv search --select diffresult '^Add$' delta.csv > added.csv
qsv search --select diffresult '^Remove$' delta.csv > removed.csv
qsv search --select diffresult '^Modify$' delta.csv > modified.csv

wc -l added.csv removed.csv modified.csv
# Sign-off-ready triple

6. Add a fingerprint column to every input for downstream lineage

qsv blake3 -l 8 this_week.csv | head -c 16 > this_week.fingerprint
echo "Recorded fingerprint $(cat this_week.fingerprint) for this_week.csv"

Short BLAKE3 hashes (8 bytes hex = 16 chars) are great as cache keys and lineage markers.

Variations

Different delimiters on each side

qsv diff --delimiter-left $'\t' --delimiter-right ',' \
  --select id \
  vendor_export.tsv our_canonical.csv

Headers on one side, none on the other

qsv diff --no-headers-right \
  --select 1 \
  with_header.csv legacy_no_headers.csv

Composite primary key

qsv diff --select 'case_id,fiscal_year' last.csv this.csv > delta.csv

Validate schema as part of the audit

qsv validate this_week.csv contract.schema.json
if [ -f this_week.csv.invalid.csv ]; then
  echo "::error::Schema validation failed; see this_week.csv.validation-errors.tsv"
  exit 1
fi

See Recipe: JSON Schema Validation.

Larger-than-RAM diff prerequisites

# Files > RAM: use ext-* commands
qsv extsort --select case_id this_week.csv > this_sorted.csv   # multithreaded
qsv extdedup --select case_id this_sorted.csv > this_unique.csv  # on-disk hash
qsv diff --select case_id last_unique.csv this_unique.csv > delta.csv

See Recipe: Larger-than-RAM CSV.

Email a change summary

cat <<EOF | qsv clipboard --save
Weekly diff summary for $(date)
  Added: $(qsv count added.csv)
  Removed: $(qsv count removed.csv)
  Modified: $(qsv count modified.csv)
  Fingerprint: $(qsv blake3 this_week.csv | awk '{print $1}')
EOF

Performance notes

  • diff benchmark: 1M × 9 columns in < 600 ms on an M2 Pro.
  • blake3 is mmap-backed and multithreaded — instant for files of any size.
  • extsort and extdedup are both multithreaded and stream to disk — good for files orders of magnitude larger than RAM.
  • For very stable files where rows almost never change, the BLAKE3 step alone usually short-circuits the whole pipeline. For files where rows change often, the full pipeline runs in a few seconds for typical (single-digit-million-row) weekly datasets.

See also

Clone this wiki locally