Skip to content

Release 0.5

Pre-release
Pre-release

Choose a tag to compare

@dunnock dunnock released this 24 May 09:58
· 26 commits to main since this release

v0.5.0 — Cycle-9 polish

  • --sample: replaced full-file read with RowSelection-based sampling — now O(sample size), not O(file size). Massive perf win on multi-GB files.
  • --columns composes with --schema — projection now also filters schema output, not just rows.
  • Timestamp format consistency: scan-stats timestamps now match the same RFC 3339 / Z format used by --csv and --ndjson dump modes.
  • n_distinct null-adjustment semantics documented in SKILL.md (distinct count subtracts 1 when null_count > 0).
  • Exit codes unified (POSIX): user-input errors → 2 (was 3), file errors → 1. SKILL.md updated.

v0.4.0 — Cycles 6–8

  • --diff A B schema comparison; exits 0 if identical, 1 if different. Supports --json for machine-readable diff.
  • n_distinct column in --scan-stats — distinct count per column when scanning.
  • Arrow IPC decode in --kv-meta: the ARROW:schema key is now decoded as FlatBuffer schema (no more opaque base64).
  • Structured type fields in --diff JSON output.
  • -h/--help strings rewritten with examples and full descriptions for every flag.
  • --scan-stats timestamp quoting fix in detail mode.

v0.3.0 — Cycles 3–4

  • -d --scan-stats: when a file was written without embedded statistics (common with Arrow writers), pqls now scans the whole file to compute min/max/null counts.
  • Arrow-written file note: detail mode prints a hint when stats are absent, pointing at --scan-stats.
  • Various cycle-4 bug fixes (help exit code, diff JSON shape, scan-stats timing edge cases).

v0.2.0 — Cycle 2

First major feature expansion after v0.

  • --schema — print schema only (column name + type), no row count or path decoration.
  • --json — machine-readable JSON output for --schema, --kv-meta, --check, --partition-stats, --diff.
  • --ndjson — stream rows as newline-delimited JSON. Combines with --head N and --sample N.
  • --sample N — emit N randomly-sampled rows; requires --ndjson or --csv.
  • --columns col1,col2 — comma-separated projection list.
  • --kv-meta — print Parquet key-value metadata (writer version, custom properties).
  • SKILL.md shipped at repo root — agent contract (when to use, input/output contracts, exit codes, common patterns).
  • README: comparison table vs parquet-tools / DuckDB / fastparquet, agent usage section.

v0.1.0 — Initial release

  • Single static Linux binary, curl | sh installer.
  • Default mode: schema, row count, row groups, file size.
  • -d/--detail: per-row-group statistics with embedded min/max/null counts.
  • --csv [--head N]: dump rows as CSV.
  • -r/--recursive: walk directories; list .parquet files.
  • Hive-partitioned dataset discovery (directory mode).
  • GitHub Actions: CI on PR + push to main, release matrix on tag push (x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu via cross).
  • MIT OR Apache-2.0 dual license.