Generator data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly.
Full docs can be found here. A demo of the UI found here.
- Metadata discovery
- Batch and/or event data generation
- Maintain referential integrity across any dataset
- Create custom data generation/validation scenarios
- Clean up generated data
- Data validation
- Suggest data validations
- Mac download
- Windows download
- After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
- Double-click 'DataCaterer-1.0.0' to install Data Caterer
- Click on 'More info' then at the bottom, click 'Run anyway'
- Go to '/Program Files/DataCaterer' folder and run DataCaterer application
- If your browser doesn't open, go to http://localhost:9898 in your preferred browser
- Linux download
- Docker
Open localhost:9898.
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.7.0
git clone git@github.com:data-catering/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder
Data Caterer supports the following data sources:
Data Source Type | Data Source | Sponsor |
---|---|---|
Database | Postgres, MySQL, Cassandra | N |
File | CSV, JSON, ORC, Parquet | N |
Messaging | Kafka, Solace | Y |
HTTP | REST API | Y |
Metadata | Marquez, OpenMetadata, OpenAPI/Swagger | Y |
- Insert into single data sink
- Insert into multiple data sinks
- Foreign keys associated between data sources
- Number of records per column value
- Set random seed at column and whole data generation level
- Generate real-looking data (via DataFaker) and edge cases
- Names, addresses, places etc.
- Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
- Nullability
- Send events progressively
- Automatically insert data into data source
- Read metadata from data source and insert for all sub data sources (e.g. tables)
- Get statistics from existing data in data source if exists
- Track and delete generated data
- Extract data profiling and metadata from given data sources
- Calculate the total number of combinations
- Validate data
- Basic column validations (not null, contains, equals, greater than)
- Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at least one transaction)
- Upstream data source validations (generate data and then check same data is inserted in another data source with potential transformations)
- Column name validations (check count and ordering of column names)
- Data migration validations
- Ensure row counts are equal
- Check both data sources have same values for key columns
Different ways to run Data Caterer based on your use case:
Data Caterer is set up under a sponsorware model where all features are available to sponsors. The core features are available here in this project for all to use/fork/update/improve etc., as the open core.
Sponsors have access to the following features:
- Metadata discovery
- All data sources (see here for all data sources)
- Batch and Event generation
- Auto generation from data connections or metadata sources
- Suggest data validations
- Clean up generated data
- Run as many times as you want, not charged by usage
- Plus more to come
Find out more details here to help with sponsorship.
This is inspired by the mkdocs-material project which follows the same model.
View details here about how you can contribute to the project.
Design motivations and details can be found here.
- Allow the application to run with UI enabled
- Runs as a long-lived app with UI that interacts with the existing app as a single container
- Ability to run as UI, Spark job or both
- Persist data in files or database (Postgres)
- UI will show the history of data generation/validation runs, delete generated data, create new scenarios, define data connections
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"