Knit

High-performance synthetic data generation toolkit.

Knit generates realistic, blueprint-driven synthetic datasets at scale. Define your data model in a declarative TOML or JSON blueprint, and Knit handles execution planning, deterministic generation, output formatting, and optional noise injection — all from a single CLI command.

Installation

cargo install knit

Or build from source:

git clone https://github.com/Yaming-Hub/knit.git
cd knit
cargo build --release

Features

Declarative blueprint language — Define entities, fields, generators, and relationships in .knit.toml files with inheritance via extends, modular composition via include, reusable field groups via mixins, and custom domain types.
Rich generator library — Sequences (with jitter and cyclic values), distributions (normal, uniform, Pareto, Zipf, Dirichlet, multinomial, …), patterns, UUIDs, one-of, derived expressions (63+ built-in functions with pipe |> operator), temporal generators (event streams, relative offsets), conditional distributions, correlated fields (copulas), and graph topologies.
Expression engine — Full arithmetic, comparison, boolean, math, string, and type-cast functions in derived field expressions with Pratt-parser evaluation and vectorized Arrow execution.
Deterministic output — Seeded RNG tree ensures identical datasets across runs for any given seed.
Multiple output formats — Parquet, CSV, JSON, JSONL, Arrow IPC, Avro, and SQL INSERT with configurable compression (Snappy, LZ4, Zstd).
Noise injection — Post-generation perturbation pipeline with 11 built-in perturbators (null injection, Gaussian noise, typos, outliers, duplicates, value drift, format corruption, swap, truncation, FK violation, temporal spikes) plus scoped/conditional noise with where predicates and missing_field at the serialization layer.
Time series — Composable numeric time series (trend, seasonality, AR, mean-reversion, regime-switching, holiday effects) and irregular event streams with exponential arrivals, seasonality, and business-hour filtering.
Graph modeling — 7 topology models (ER, Barabási–Albert, Watts–Strogatz, lattice, stochastic block, configuration, complete) with edge properties, degree distributions, self-referential hierarchies, and selection strategies.
Reverse engineering — Ingest existing data, profile distributions, and fit blueprints automatically (knit learn).
Behavioral modeling — Define actor personas with trait distributions, activity-driven row counts, temporal biases, social graphs, and conversation threading. Learn behavioral patterns from existing data with knit learn --actors.
Incremental learning — Process datasets larger than memory in bounded chunks with streaming statistics and persistent state files.
Dictionary extraction — Automatically extracts domain-specific vocabularies from eligible high-cardinality string columns for realistic text generation.
Foreign-key integrity — Automatic topological ordering and key stores ensure referential integrity across entities.
Plugin architecture — Register custom generators at runtime via the GeneratorPlugin trait, or load WASM modules dynamically with --plugin path/to/gen.wasm (requires wasm-plugins feature).
Scalable — Batch-oriented Arrow columnar engine with Rayon parallelism; sampled key stores for 100M+ row entities.

Architecture

flowchart LR
    blueprint([Blueprint TOML/JSON]) -->|blueprint| model[DataModel]
    model -->|plan| plan[ExecutionPlan]
    plan -->|gen| batches[RecordBatches]
    batches -->|noise| perturbed[Perturbed Batches]
    perturbed -->|bind| output([Parquet / CSV / JSON / Arrow])
    learn[learn] -->|ingest + profile| model
    cli[cli] --> blueprint

Quick Start

Install

# Clone and build
git clone https://github.com/Yaming-Hub/knit.git
cd knit
cargo build --release

# The binary is at target/release/knit

Create a Blueprint

Create a file called demo.knit.toml:

blueprint_version = "1.0"

[model]
name = "demo"
seed = 42

[[entities]]
name = "users"
count = 10000

[[entities.fields]]
name = "id"
data_type = "int"
primary_key = true
[entities.fields.generator]
type = "sequence"
start = 1
step = 1

[[entities.fields]]
name = "email"
data_type = "string"
[entities.fields.generator]
type = "pattern"
pattern = "user####@example.com"

[[entities.fields]]
name = "age"
data_type = "int"
[entities.fields.generator]
type = "distribution"
kind = "normal"
[entities.fields.generator.params]
mean = 35.0
std_dev = 12.0

Generate

# Validate blueprint
knit validate demo.knit.toml

# Preview execution plan
knit plan demo.knit.toml

# Generate data (default: Parquet)
knit generate demo.knit.toml -o ./data

# Generate as CSV with a specific seed
knit generate demo.knit.toml --format csv --seed 123 -o ./data

# Dry run — validate and plan without generating
knit generate demo.knit.toml --dry-run

Blueprint Composition

Build large blueprints from reusable fragments using include:

# main.knit.toml
include = ["users.knit.toml", "products.knit.toml"]

[model]
name = "my_project"
seed = 42

# Add entities specific to this blueprint
[[entities]]
name = "orders"
count = 5000
# ...

[[relationships]]
name = "orders_to_users"
from = "orders"
to = "users"
kind = "many_to_one"

Rules:

Fragments define entities, relationships, personas, etc. — but no [model] section
Name conflicts between included fragments are errors; the main blueprint silently overrides
Includes are recursive and diamond-safe (each file loaded at most once)
Security: absolute paths and .. traversal are rejected

See examples/modular/ for a working example.

Module Structure

Knit is published as a single crate. Internally it is organized into modules:

Module	Description
`knit::core`	Shared types: `DataModel`, `Entity`, `Field`, `Value`, `GeneratorSpec`
`knit::blueprint`	TOML/JSON parsing, validation, blueprint inheritance (`extends`)
`knit::plan`	Compiles a `DataModel` into an `ExecutionPlan` with RNG tree
`knit::gen`	Generation engine: executes plans → Arrow `RecordBatch`es
`knit::noise`	Post-generation perturbation pipeline (11 perturbators)
`knit::bind`	Output sinks: Parquet, CSV, JSON, JSONL, Arrow IPC, Avro, SQL
`knit::learn`	Data ingestion, profiling, distribution fitting, blueprint inference, behavioral persona discovery
`knit::scale`	Multi-dimensional scaling: actor, time, and custom categorical dimensions
`knit::tokenize`	Dataset tokenization for safe sharing (string replacement, numeric/temporal shifts)
`knit::enrich`	Enrich models with statistical knowledge from reference data samples
`knit::model`	Serialization and conversion of learned model directories
`knit::decision`	Decision logging and reporting for pipeline transparency
`knit::cli`	Binary commands: `validate`, `plan`, `generate`, `blueprint`, `init`, `learn`, `inspect`, `completions`, `generators`, `scale`, `tokenize`, `enrich`, `model`

Examples

The examples/ directory contains 25+ sample blueprints. Highlights:

ecommerce.knit.toml — Users, products, orders, reviews with FK relationships
ecommerce_behavioral.knit.toml — Persona-driven purchasing: 4 customer segments, activity-driven orders, temporal shopping biases, review threading
email_traffic.knit.toml — Email messaging with sender/receiver personas
financial.knit.toml — Accounts and transactions with risk scoring
hr_org.knit.toml — Employees with behavioral personas, activity-driven tasks, manager hierarchy, and work-hour temporal biases
iot_sensors.knit.toml — Devices, sensor readings, and alerts with FK chains
server_logs.knit.toml — Servers, HTTP requests, and error logs
social_platform.knit.toml — Social network with actor graphs, persona-driven temporal patterns, burst sessions, and posts/comments/DMs
time_series_metrics.knit.toml — Composable numeric time series with trend, seasonality, AR, mean-reversion, and holiday effects
event_stream.knit.toml — Irregular time series with exponential arrivals
conditional_distribution.knit.toml — Distribution-dependent field correlations
vector_distributions.knit.toml — Dirichlet and multinomial distributions
relative_offset.knit.toml — Distribution, constant, and simple offset modes
scoped_noise.knit.toml — Conditional noise injection with scope predicates
pipe_operator.knit.toml — Expression function composition with |>
mixins.knit.toml — Reusable field groups across entities
custom_types.knit.toml — Domain type aliases
sequence_jitter.knit.toml — Temporal randomization with jitter offsets
sequence_values.knit.toml — Round-robin cyclic value sequences
timezone_business_hours.knit.toml — Timezone-aware event generation
selection_strategies.knit.toml — FK selection strategies (sequential, clustered)
edge_properties.knit.toml — Relationship properties on graph edges
hierarchy.knit.toml — Self-referential hierarchies with depth control
holiday_effect.knit.toml — Date-based time series spikes/dips
count_expressions.knit.toml — Parameterized entity counts
degree_distribution.knit.toml — Power-law cardinality patterns
nested_objects.knit.toml — Hierarchical struct fields
modular/ — Modular composition example: users.knit.toml and products.knit.toml fragments composed via include in ecommerce.knit.toml
cli_test.knit.toml — Minimal blueprint for integration testing

Generate all examples:

for schema in examples/*.knit.toml; do
  knit generate "$schema" -o data/$(basename "$schema" .knit.toml) --format csv
done

Reverse Engineering

Infer a blueprint from existing data and re-generate:

# Learn a blueprint from CSV files
knit learn ./my-data/ -o inferred.knit.toml

# Learn from a large dataset (sample first 10k rows per table)
knit learn ./big-data/ -o inferred.knit.toml --sample 10000

# Review and customize the inferred blueprint, then generate
knit generate inferred.knit.toml -o ./synthetic-data --format parquet

Incremental Learning

For datasets too large to fit in memory, use incremental mode to process data in chunks. Each invocation updates a persistent state file:

# Process data in batches — each call updates the state file
knit learn ./chunk1/ --state learn.state
knit learn ./chunk2/ --state learn.state
knit learn ./chunk3/ --state learn.state

# Finalize: emit blueprint from accumulated statistics
knit learn --state learn.state --finalize -o schema.knit.toml

Incremental mode uses streaming algorithms (Welford for mean/variance, HyperLogLog for cardinality, reservoir sampling for distribution fitting) so memory usage stays bounded regardless of dataset size.

Dictionary Extraction

When learning from data, Knit automatically extracts domain-specific dictionaries for eligible high-cardinality string columns (e.g., product names, person names) that don't match a standard faker pattern:

knit learn ./products/ -o schema.knit.toml
# Creates: schema.knit.toml + products_name.dict.txt (alongside the schema)

The learned blueprint references the dictionary file, and generation draws values from it — producing output that matches the domain vocabulary of the original data. Dictionary extraction works in both batch and incremental modes. Extracted dictionaries are capped at ~10,000 entries for large vocabularies.

Behavioral Modeling

Define actor personas with distinct behavioral traits to generate data with realistic human-like patterns:

# Define behavioral segments
[[personas]]
name = "power_user"
weight = 0.15
[personas.traits]
activity_rate = 20.0    # events/month
peak_hours = 9.0        # preferred hour of day

[[personas]]
name = "casual_user"
weight = 0.85
[personas.traits]
activity_rate = 3.0
peak_hours = 20.0

# Mark an entity as an actor with persona assignment
[[entities]]
name = "users"
count = 1000
actor = true
persona_distribution = "personas"

# Activity-driven row counts (total rows = sum of per-actor trait values)
[[entities]]
name = "events"
count = 5000  # fallback estimate
[entities.activity_count]
actor_field = "user_id"
trait = "activity_rate"

# FK field linking events to their actor
[[entities.fields]]
name = "user_id"
data_type = "int"
actor_column = true

# Temporal bias — timestamps cluster around each actor's peak_hours
[[entities.fields]]
name = "created_at"
data_type = "datetime"
[entities.fields.generator]
type = "actor_temporal"
trait = "peak_hours"

# Relationship required for activity_count resolution
[[relationships]]
name = "event_user"
from = "events"
to = "users"
kind = "many_to_one"
foreign_key = "user_id"

Learn behavioral patterns from existing data:

# Infer personas and actor relationships from data
knit learn ./my-data/ --actors -o behavioral.knit.toml

# Inspect discovered behavioral structure
knit inspect behavioral.knit.toml --actors

# Generate with persona-driven realism
knit generate behavioral.knit.toml -o ./synthetic

See examples/social_platform.knit.toml and examples/ecommerce_behavioral.knit.toml for complete behavioral blueprints.

Parameterized Blueprints

Derived expressions can reference --param values using ${param.key} syntax:

[[entities.fields]]
name = "email"
data_type = "string"
[entities.fields.generator]
type = "derived"
expr = "${name}@${param.domain}"
depends_on = ["name"]

knit generate schema.knit.toml -o out/ --param domain=example.com

Unresolved params stay as literal ${param.key} in the output.

CLI Reference

knit [OPTIONS] <COMMAND>

Commands:
  validate     Parse and validate a blueprint file
  plan         Show execution plan (dry run)
  generate     Generate synthetic data
  blueprint    Blueprint manipulation (expand, normalize, diff)
  init         Create a starter blueprint
  learn        Infer blueprint from data
  inspect      Inspect state files or blueprint summaries
  generators   List available generator types
  completions  Generate shell completions

Global options:
  --seed <N>            Override blueprint seed
  --format <FMT>        Output format (parquet|csv|json|jsonl|arrow|avro|sql)
  --compression <ALG>   Compression (none|snappy|gzip|lz4|zstd)
  --parallel <N>        Worker threads (0 = auto)
  --batch-size <N>      Rows per batch (default: 8192)
  --count <N|Nx>        Override row count (absolute or multiplier, e.g. 100, 0.1x, 10x)
  --param key=value     Override blueprint parameter (repeatable)
  --json                Machine-readable JSON output
  --dry-run             Validate and plan only
  --no-noise            Skip noise injection
  -q, --quiet           Suppress non-error output
  -v, --verbose         Debug logging
  --version             Show version

Learn-specific options:
  --sample <N>          Limit rows per table (faster profiling on large data)
  --state <PATH>        Incremental mode: persist statistics to a state file
  --finalize            Emit blueprint from state without processing new data
  --strict              Error on reprocessing same source into same state (default: warn)
  --actors              Enable behavioral modeling (persona discovery, actor graphs)

Inspect options:
  --actors              Show behavioral summary (personas, relationships, generators)

Contributing

See CONTRIBUTING.md for build instructions, coding conventions, and PR guidelines.

License

This project is licensed under the MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
.github		.github
benches		benches
docs		docs
examples		examples
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
agents.md		agents.md
build.rs		build.rs
design.md		design.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knit

Installation

Features

Architecture

Quick Start

Install

Create a Blueprint

Generate

Blueprint Composition

Module Structure

Examples

Reverse Engineering

Incremental Learning

Dictionary Extraction

Behavioral Modeling

Parameterized Blueprints

CLI Reference

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knit

Installation

Features

Architecture

Quick Start

Install

Create a Blueprint

Generate

Blueprint Composition

Module Structure

Examples

Reverse Engineering

Incremental Learning

Dictionary Extraction

Behavioral Modeling

Parameterized Blueprints

CLI Reference

Contributing

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages