Batch data format converter — stream files over 10GB between CSV, JSON, Parquet, YAML, Avro, and more.
⭐ Star this repo if you work with data files — it helps other devs find DataMorph!
⭐ Star this repo if you work with data formats — it helps other developers find DataMorph!
Part of the Revenue Holdings developer tool ecosystem.
Every data engineer writes throwaway scripts to convert CSV to Parquet, JSON to YAML, or Avro to JSON. DataMorph replaces all of them with one CLI command. It streams row-by-row so files over 10GB fit in memory. It infers schemas automatically so you don't hand-write them. And it validates data against schemas in CI so malformed data never reaches production.
Before DataMorph:
# One-off script #47 you'll never find again
import pandas as pd
df = pd.read_csv('huge_file.csv') # OOM on 5GB+
df.to_parquet('output.parquet')After DataMorph:
datamorph convert huge_file.csv output.parquet # Streams, no OOM| Who | What they do with DataMorph |
|---|---|
| Data engineers | Convert CSV exports to Parquet for Athena/BigQuery analytics |
| Backend devs | Transform JSON API responses to CSV for reporting |
| DevOps teams | Validate data file schemas in CI before deployment |
| ML engineers | Batch-convert training data between formats (JSONL → Parquet) |
| QA teams | Inspect and validate data file schemas without writing code |
- 6+ format pairs: CSV ↔ JSON ↔ JSONL ↔ YAML ↔ Parquet ↔ Avro ↔ Protobuf
- Streaming: Row-by-row processing for files >10GB — never OOM
- Schema inference: Auto-detect field types from data, export as JSON
- Schema validation: Check data files against expected schemas (CI-friendly, exit codes)
- Batch mode: Convert entire directories at once with
--recursive - CLI commands:
convert,batch,schema,validate,formats
pip (Python):
pip install git+https://github.com/Coding-Dev-Tools/datamorph.gitnpm (Node.js wrapper — publishing pending):
# Not yet available — install via pip insteadThen run: datamorph --help
Homebrew (macOS/Linux):
brew tap Coding-Dev-Tools/tap
brew install datamorphScoop (Windows):
scoop bucket add Coding-Dev-Tools https://github.com/Coding-Dev-Tools/scoop-bucket
scoop install datamorph# Convert a single file
datamorph convert input.csv output.parquet
datamorph convert input.json output.csv
datamorph convert input.yaml output.json
datamorph convert input.parquet output.csv
# Batch convert all files in a directory
datamorph batch ./csv_data/ ./parquet_data/ --from csv --to parquet --recursive
# Inspect schema
datamorph schema data.parquet
datamorph schema data.csv --json-output
# Validate data against a schema
datamorph validate data.csv # structural check
datamorph validate data.csv --schema schema.json # against expected schema
datamorph validate data.csv --strict --json-output # strict mode, JSON output (CI)
# Export schema for validation
datamorph schema data.csv --json-output > schema.json
# List supported formats
datamorph formatsDataMorph's validate command exits with code 1 on schema mismatches — perfect for CI pipelines:
# GitHub Actions example
- name: Validate data schemas
run: |
datamorph validate data/events.csv --schema schemas/events.json --strict
datamorph validate data/users.json --schema schemas/users.json --strict# Generic CI — fail on schema drift
datamorph validate data.csv --schema schema.json || echo "Schema mismatch detected!"# Auto-convert data files as a build step
datamorph batch ./raw_data/ ./processed/ --from csv --to parquet| Feature | DataMorph | pandas | csvkit | frictionless |
|---|---|---|---|---|
| CSV → Parquet | ✅ | ✅ | ❌ | ✅ |
| Streaming >10GB | ✅ | ❌ (OOM) | ❌ | ✅ |
| Schema validation | ✅ | ❌ | ❌ | ✅ |
| Batch directory convert | ✅ | ❌ | ❌ | ❌ |
| 6+ format pairs | ✅ | ✅ (with libs) | ❌ | ✅ |
| CI exit codes | ✅ | ❌ | ❌ | ✅ |
| Zero-config | ✅ | ❌ | ✅ | ❌ |
| Single binary install | ✅ | ❌ | ✅ | ❌ |
DataMorph vs pandas: pandas loads the entire file into memory. DataMorph streams row-by-row — no OOM on large files, no DataFrame boilerplate.
DataMorph vs csvkit: csvkit only handles CSV. DataMorph supports 6+ formats including Parquet and Avro.
DataMorph vs frictionless: frictionless is a framework with a Python API. DataMorph is a zero-config CLI — pip install and go.
| Tier | Price | Features |
|---|---|---|
| Free | $0 | CLI only, 100 conversions/mo |
| Pro | $12/mo | Unlimited conversions, streaming, batch mode, all formats |
| Suite | $49/mo ($39/mo annual) | All 11 Revenue Holdings tools |
Get a license key at devforge.dev/pricing.
Part of Revenue Holdings — a suite of 11 developer CLI tools built by autonomous AI agents. Also check out ConfigDrift (config drift detection), SchemaForge (ORM/schema conversion), Envault (env sync/secret rotation), API Contract Guardian (breaking change detection), APIGhost (mock servers), DeployDiff (infrastructure diffs), json2sql (JSON → SQL), click-to-mcp (CLI → MCP server), and DeadCode (dead code cleanup).
MIT — Revenue Holdings