Data Caterer - Data Generation and Validation Tool

Overview

Generator data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly.

Full docs can be found here. A demo of the UI found here.

Features

Metadata discovery
Batch and/or event data generation
Maintain referential integrity across any dataset
Create custom data generation/validation scenarios
Clean up generated data
Data validation
Suggest data validations

Quick start

Mac download
Windows download
1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
3. Click on 'More info' then at the bottom, click 'Run anyway'
4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
Linux download

Docker

docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0

Open localhost:9898.

Run Scala/Java examples

git clone git@github.com:data-catering/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder

Integrations

Supported data sources

Data Caterer supports the following data sources:

Data Source Type	Data Source	Sponsor
Database	Postgres, MySQL, Cassandra	N
File	CSV, JSON, ORC, Parquet	N
Messaging	Kafka, Solace	Y
HTTP	REST API	Y
Metadata	Marquez, OpenMetadata, OpenAPI/Swagger	Y

Supported use cases

Insert into single data sink
Insert into multiple data sinks
1. Foreign keys associated between data sources
2. Number of records per column value
Set random seed at column and whole data generation level
Generate real-looking data (via DataFaker) and edge cases
1. Names, addresses, places etc.
2. Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
3. Nullability
Send events progressively
Automatically insert data into data source
1. Read metadata from data source and insert for all sub data sources (e.g. tables)
2. Get statistics from existing data in data source if exists
Track and delete generated data
Extract data profiling and metadata from given data sources
1. Calculate the total number of combinations
Validate data
1. Basic column validations (not null, contains, equals, greater than)
2. Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at least one transaction)
3. Upstream data source validations (generate data and then check same data is inserted in another data source with potential transformations)
4. Column name validations (check count and ordering of column names)
Data migration validations
1. Ensure row counts are equal
2. Check both data sources have same values for key columns

Run Configurations

Different ways to run Data Caterer based on your use case:

Sponsorship

Data Caterer is set up under a sponsorware model where all features are available to sponsors. The core features are available here in this project for all to use/fork/update/improve etc., as the open core.

Sponsors have access to the following features:

Metadata discovery
All data sources (see here for all data sources)
Batch and Event generation
Auto generation from data connections or metadata sources
Suggest data validations
Clean up generated data
Run as many times as you want, not charged by usage
Plus more to come

Find out more details here to help with sponsorship.

This is inspired by the mkdocs-material project which follows the same model.

Contributing

View details here about how you can contribute to the project.

Additional Details

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

UI

Allow the application to run with UI enabled
Runs as a long-lived app with UI that interacts with the existing app as a single container
Ability to run as UI, Spark job or both
Persist data in files or database (Postgres)
UI will show the history of data generation/validation runs, delete generated data, create new scenarios, define data connections

Distribution

Docker

gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898

Jpackage

JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
api		api
app		app
design		design
gradle/wrapper		gradle/wrapper
load-test		load-test
misc		misc
script		script
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
docker-action.sh		docker-action.sh
docker-compose.yaml		docker-compose.yaml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
run-docker.sh		run-docker.sh
settings.gradle.kts		settings.gradle.kts
workspace.xml		workspace.xml

License

data-catering/data-caterer

Folders and files

Latest commit

History

Repository files navigation

Data Caterer - Data Generation and Validation Tool

Overview

Features

Quick start

Run Scala/Java examples

Integrations

Supported data sources

Supported use cases

Run Configurations

Sponsorship

Contributing

Additional Details

Design

Roadmap

UI

Distribution

Docker

Jpackage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages