Skip to content

Recipe Date Enrichment

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: Date Enrichment

Tier: Intermediate Commands used: datefmt, luau, partition, apply Anchor dataset: NYC 311 Service Requests (1M-row sample, nyc311samp.csv)

This recipe expands the original Cookbook entry. The original snippets are preserved at the bottom of Cookbook for backwards compatibility.

Problem

NYC 311 service-request records have a Created Date like 12/31/2014 11:59:45 PM. For analytics you typically want:

  • ISO 8601 normalized timestamp
  • a Quarter column (Q1, Q2, …) for fiscal-year reporting
  • a Year column for partitioning
  • a Weekday column for "is this complaint a weekday or weekend phenomenon?"
  • Turnaround Time (TAT) — how long between Created Date and Closed Date?
  • partitioned output: one file per quarter for parallel downstream processing

All of this in one qsv pipeline.

Data

curl -LO https://raw.githubusercontent.com/wiki/dathere/qsv/files/nyc311samp.csv
ls -lh nyc311samp.csv
qsv headers nyc311samp.csv

Key columns: Created Date, Closed Date, Borough, Complaint Type, Status, Resolution Action Updated Date.

For larger experiments, also try the 1M-row bundled sample: resources/test/NYC_311_SR_2010-2020-sample-1M.csv.

Solution

1. Normalize Created Date to ISO 8601

qsv datefmt 'Created Date' nyc311samp.csv > step1.csv

datefmt recognizes 19 input formats automatically. The default output is RFC 3339 (2014-12-31T23:59:45Z).

2. Add Year and Year-Month columns

qsv datefmt 'Created Date' --formatstr '%Y' --new-column Year step1.csv > step2.csv
qsv datefmt 'Created Date' --formatstr '%Y-%m' --new-column YearMonth step2.csv > step3.csv

--formatstr takes chrono strftime specifiers.

3. Add Weekday and Quarter columns

qsv datefmt 'Created Date' --formatstr '%A' --new-column Weekday step3.csv > step4.csv

%A is the full weekday name. For numeric quarter, use a small Luau script (since strftime doesn't have a quarter specifier):

-- getquarter.lua (also at docs/cookbook/lua/getquarter.lua)
local month = tonumber(string.sub(_['Created Date'], 6, 7))  -- ISO YYYY-MM
local q = math.ceil(month / 3)
return "Q" .. q
qsv luau map Quarter -x -f getquarter.lua step4.csv > step5.csv

The -x flag enables global column variables (so _['Created Date'] works).

4. Compute Turnaround Time (TAT)

# turnaroundtime.lua: returns TAT in days, or "" if Closed Date is blank
local fmt = "!yyyy-MM-dd'T'HH:mm:ssX"   -- qsv's normalized ISO
local opened = _['Created Date']
local closed = _['Closed Date']
if closed == "" then return "" end
return tostring(
  (qsv_parse_date(closed, fmt) - qsv_parse_date(opened, fmt)) / 86400
)
qsv luau map TAT -x -f turnaroundtime.lua step5.csv > step6.csv

Both scripts are bundled at docs/cookbook/lua/ — use them as starting points.

5. Partition into one file per quarter

qsv partition Quarter nyc311byqtr/ step6.csv
ls nyc311byqtr/
# Q1.csv  Q2.csv  Q3.csv  Q4.csv

Now you have four files for parallel downstream processing.

Full pipeline (one shot)

qsv datefmt 'Created Date' nyc311samp.csv \
  | qsv datefmt 'Created Date' --formatstr '%Y' --new-column Year \
  | qsv datefmt 'Created Date' --formatstr '%Y-%m' --new-column YearMonth \
  | qsv datefmt 'Created Date' --formatstr '%A' --new-column Weekday \
  | qsv luau map Quarter -x -f getquarter.lua \
  | qsv luau map TAT -x -f turnaroundtime.lua \
  > nyc311_enriched.csv

# Then partition
qsv partition Quarter nyc311byqtr/ nyc311_enriched.csv

Variations

Average TAT per Borough, in Brooklyn only

qsv search --select Borough -i brooklyn nyc311samp.csv \
  | qsv luau map TAT -x -f turnaroundtime.lua \
  | tee brooklyn-311-details.csv \
  | qsv stats --everything > brooklyn-tat-stats.csv

tee writes the intermediate to a file and pipes the same data to stats for aggregation.

Cast all *_date columns to ISO 8601 in one pass

qsv datefmt '/(?i)_date$/' nyc311samp.csv > all_dates_iso.csv

The regex column selector matches any column whose name ends in _date (case-insensitive).

Use a SQL-style aggregation instead

qsv sqlp nyc311samp.csv \
  "SELECT
     strftime('%Y-Q', \"Created Date\") || ((strftime('%m', \"Created Date\") - 1) / 3 + 1) AS quarter,
     Borough,
     COUNT(*) AS complaints,
     AVG(julianday(\"Closed Date\") - julianday(\"Created Date\")) AS avg_tat_days
   FROM nyc311samp
   WHERE \"Closed Date\" != ''
   GROUP BY quarter, Borough
   ORDER BY quarter, complaints DESC"

(Adjust strftime syntax for Polars SQL or DuckDB depending on the engine; see SQL & Polars.)

Filter business-hours-only events

qsv luau filter \
  "local h = tonumber(string.sub(_['Created Date'], 12, 13)); return h >= 9 and h <= 17" \
  nyc311_enriched.csv > business_hours.csv

Performance notes

  • datefmt is multithreaded (🚀 in the README legend) — no index required.
  • luau is single-process but the interpreter is fast (millions of rows/sec on simple expressions).
  • On the full 27M-row NYC 311 export, the whole pipeline above takes ~2 minutes on an M2 Pro with an index.
  • For a no-index NYC 311 1M-sample run: ~6 seconds.

See also

Clone this wiki locally