A deterministic CLI that profiles every column of a CSV/TSV file and lints it for data-quality problems — ragged rows, type drift, missing values, duplicates, mixed date formats, numeric outliers, and optional schema violations — with a quality score, A–F grade and JSON/Markdown reports.
datalint reads your CSV/TSV files, infers each column's type, profiles the
data, and reports every quality issue that would trip up an import or analysis —
100% locally, no API key, no server, and no dependency on a data library
(the CSV parser is hand-rolled).
CSV is the universal data format, and it's almost always messy. A file that "looks fine" in a spreadsheet hides:
- Ragged rows — an unescaped comma silently shifts every column after it.
- Type drift — a number column with a stray
N/A,—, or1.2.3. - Mixed date formats —
2024-01-05next to01/06/2024(which is which?). - Missing values, duplicates, stray whitespace, inconsistent casing
(
USvsus), and outliers that are really data-entry errors.
Eyeballing this doesn't scale, and feeding a 50k-row file to an LLM gets you a confident-but-wrong summary. You want a deterministic, repeatable audit you can run on every export and gate in CI. That's datalint.
- 🧱 Dependency-free CSV/TSV parser — RFC 4180 quotes, embedded newlines, escaped quotes, CRLF/LF, plus automatic delimiter detection.
- 🔎 Column profiling — inferred type, empty rate, distinct count, min/max/mean, and top values for every column.
- 🚦 12 built-in checks — ragged rows, duplicate/empty headers, empty columns/rows, missing values, type drift, whitespace, mixed date formats, inconsistent casing, duplicate rows, and numeric outliers (Tukey/IQR).
- 📐 Optional schema — required, type, enum, min/max, regex pattern, unique, not-null constraints per column.
- 📊 Quality score + A–F grade, per file and overall.
- 📄 JSON & Markdown export, colored console output, CI gate exit codes.
- ⚙️ Config file, custom delimiter, headerless mode, per-rule severities.
- 🔒 Runs entirely offline. Nothing is uploaded.
# run without installing
npx @didrod2539/datalint scan data.csv
# or install
npm install -g @didrod2539/datalint # global CLI (provides `datalint`)
npm install -D @didrod2539/datalint # project dev-dependency (for CI)Node ≥ 18. ESM + CJS + TypeScript types.
datalint scan data.csvdata.csv 42/100 (F) 12 rows × 8 cols · comma
• id integer · 11 distinct
• email email · 12 distinct
• country string · 5 distinct
• signup_date date · 11 distinct
• amount decimal · 10 distinct
• note string · 3 distinct 75% empty
✗ 1 row(s) have a different column count than the header (8)
✗ Duplicate header "email" (columns 3 and 4)
⚠ Column "note" is 75.0% empty (9/12)
⚠ Column "amount" looks decimal but 1 value(s) don't match
⚠ Column "signup_date" mixes 2 date formats
⚠ 1 duplicate row(s)
ℹ Column "country" has 1 value(s) that differ only by case
Overall 42/100 (F) 1 file(s), 12 row(s), 2 error(s), 4 warning(s), 1 info
datalint scan [...targets] # analyze CSV/TSV files or directories
datalint report <input.json> # re-render a saved JSON report as Markdown
datalint init # scaffold datalint.config.json (with a schema)
datalint --help
datalint --versionscan options:
| Option | Description |
|---|---|
--config <file> |
Path to a config file (otherwise auto-detected) |
--delimiter <char> |
, \t ; | or auto (default) |
--no-header |
Treat the first row as data (synthesize column names) |
--json <file> |
Write a JSON report |
--md <file> |
Write a Markdown report |
--min-score <n> |
Exit non-zero if the overall score < n (CI gate) |
--quiet |
Hide info-level issues in the console |
Point scan at a directory and it finds every *.csv, *.tsv, *.txt
recursively.
Full reports for the bundled sample files are in
examples/sample-report.md and
examples/sample-report.json.
📸 Screenshot / demo GIF placeholder:
./docs/screenshot.png— record the terminal runningnpx @didrod2539/datalint scan examples/messy.csv.
Create datalint.config.json (or run datalint init):
{
"delimiter": "auto",
"hasHeader": true,
"maxEmptyRate": 0.1,
"enumThreshold": 20,
"outlierIqrFactor": 1.5,
"minScore": 80,
"disableRules": [],
"ruleSeverity": { "inconsistent-case": "warning" },
"schema": [
{ "name": "id", "type": "integer", "required": true, "unique": true },
{ "name": "email", "type": "email", "notNull": true },
{ "name": "amount", "type": "decimal", "min": 0, "max": 100000 },
{ "name": "country", "enum": ["US", "CA", "UK"] }
]
}| Field | Meaning |
|---|---|
delimiter |
"auto" or a literal delimiter |
hasHeader |
Whether row 1 is a header |
maxEmptyRate |
Warn columns above this empty rate (0–1) |
enumThreshold |
Max distinct values for casing checks to apply |
outlierIqrFactor |
Tukey IQR multiplier (1.5 default; 0 disables outliers) |
minScore |
CI gate threshold (overridable with --min-score) |
disableRules |
Rule ids to turn off |
ruleSeverity |
Override severity per rule id |
schema |
Optional per-column constraints |
Rule ids: ragged-rows, duplicate-headers, empty-column, empty-row,
missing-values, type-drift, whitespace, mixed-date-formats,
inconsistent-case, duplicate-rows, outliers, and schema-*.
- Gate a data pipeline in CI. Add
datalint scan ./exports --min-score 85to your workflow. A nightly export that arrives with shifted columns or a broken date format fails the build instead of corrupting downstream tables. - Vet a file before import. Before loading a vendor/marketing CSV into your
warehouse, run
datalint scan leads.csv --md audit.mdand fix what it finds. - Profile an unfamiliar dataset. Run
datalint scan dataset.csvto instantly see each column's type, null rate, distinct count and ranges — a fast EDA pass without spinning up a notebook.
import { analyze, buildReport, toMarkdown } from "@didrod2539/datalint";
const ds = analyze({ source: "data.csv", content });
console.log(ds.score, ds.grade, ds.profiles, ds.issues);
const report = buildReport([ds], { version: "0.1.0" });
await fs.writeFile("report.md", toMarkdown(report));- Excel (
.xlsx) and Parquet input. - Cross-file referential checks (foreign keys across CSVs).
- A
--fixmode to auto-trim whitespace and normalize obvious issues. - An HTML report with charts.
- A GitHub Action that comments data-quality on PRs.
- Streaming mode for very large files.
Does it send my data anywhere? No. datalint runs entirely on your machine — no API key, no telemetry, no uploads, no network calls.
Do I need to define a schema? No. datalint is useful with zero config — it infers column types and catches drift, duplicates, missing values, etc. A schema is optional for stricter checks.
How does it parse CSV? With a small, hand-rolled RFC 4180 parser (no external CSV library) that handles quoted fields, embedded delimiters/newlines, escaped quotes and CRLF/LF — so behavior is fully predictable. Delimiter is auto-detected or set via config.
How are dates / types detected?
By deterministic pattern matching (src/infer.ts). Type inference is
conservative; ambiguous cells fall back to string. The date check recognizes
common ISO and slash/dot formats and flags a column that mixes more than one.
Is the quality score official?
No — it's a transparent metric: each issue costs a base penalty plus an amount
scaled by how much of the data it affects, weighted by severity (src/score.ts).
Use it to track and gate quality.
My valid data is being flagged — how do I silence it?
Use disableRules, ruleSeverity, maxEmptyRate, or outlierIqrFactor in the
config. Every heuristic is tunable.
Contributions welcome! Each check is a small, self-contained rule in
src/rules/. See CONTRIBUTING.md and the
Code of Conduct.
git clone https://github.com/didrod205/datalint.git
cd datalint
npm install
npm test
npm run build
node dist/cli.js scan examples/messy.csvMIT © datalint contributors
datalint is free, MIT-licensed, and built in spare time. If it caught a bad export before it hit production, please consider supporting it:
- ⭐ Star this repo — free, and it helps others find it.
- 🍋 Sponsor via Lemon Squeezy — one-time or recurring.
Where your support goes: Excel/Parquet input, cross-file referential checks,
a --fix autoclean mode, an HTML report, a PR-commenting GitHub Action, and fast
issue responses.