Skip to content

v1.1.0 - Major Feature Release: 23 New Data Processing Commands

Choose a tag to compare

@ivbeg ivbeg released this 18 Jan 08:01
· 5 commits to master since this release

πŸŽ‰ Major Feature Release

This release adds 23 new data processing commands across three phases, along with major improvements to schema generation, statistics performance, and database ingestion.

✨ New Commands

Phase 1 - Fundamental Data Processing (7 commands):

  • count - Count rows with DuckDB optimization
  • table - Pretty-print data as aligned table
  • head - Extract first N rows
  • tail - Extract last N rows
  • enum - Add row numbers, UUIDs, or constants
  • reverse - Reverse row order
  • fixlengths - Normalize field counts

Phase 2 - Data Cleaning & Transformation (9 commands):

  • sort - Sort by columns with numeric/descending options
  • sample - Random sampling (fixed count or percentage)
  • search - Regex-based search and filtering
  • dedup - Remove duplicates with key-field options
  • fill - Fill empty/null values with strategies
  • rename - Rename fields by mapping or regex
  • explode - Split columns by separator
  • replace - String replacement with regex support
  • cat - Concatenate files by rows or columns

Phase 3 - Advanced Data Processing (7 commands):

  • join - Relational joins (inner, left, right, full outer)
  • diff - Compare files and show differences
  • exclude - Remove rows based on keys
  • transpose - Swap rows and columns
  • sniff - Detect file properties
  • slice - Extract rows by range or index
  • fmt - Reformat CSV with formatting options

πŸš€ Performance Improvements

  • Stats Command: 10-100x faster with DuckDB engine for CSV, JSONL, JSON, and Parquet files
  • DuckDB Integration: Automatic engine selection for optimal performance
  • Batch Operations: Improved performance with write_bulk() for large datasets

πŸ“‹ Schema Improvements

  • Format Exports: Support for JSON Schema, Avro, Parquet, and Cerberus formats
  • Full Output Support: Text, JSON, and YAML output formats now work correctly
  • AI Documentation: Working AI-powered field descriptions with provider selection
  • Record Counting: Statistics now include record counts in schema output

πŸ—„οΈ Database Ingestion

  • MySQL Support: Auto-create table, upsert, and batch operations
  • SQLite Support: File and in-memory databases with PRAGMA optimizations
  • Improved Performance: Better support for PostgreSQL, DuckDB, MongoDB, and Elasticsearch

πŸ”„ Migration & Deprecations

  • Iterabledata Migration: All commands now use external iterabledata library
  • Resource Management: Improved cleanup with try/finally blocks
  • Deprecated: Local IterableData and DataWriter classes (use open_iterable() instead)
  • Deprecated: scheme command (use schema --format cerberus instead)

πŸ› Bug Fixes

  • Fixed resource leaks in statistics, textproc, and ingester commands
  • Fixed schema command output format options being ignored
  • Fixed schema command AI documentation not working
  • Fixed missing record counting in schema output

Full Changelog: v1.0.18...v1.1.0