diff --git a/README.md b/README.md index b26f07e1..371d95ba 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,65 @@ -

+

Datafold

-

-data-diff: compare datasets fast, within or across SQL databases -

+

+data-diff: Compare datasets fast, within or across SQL databases +![data-diff-logo](docs/data-diff-logo.png) +


+# Use Cases + +## Data Migration & Replication Testing +Compare source to target and check for discrepancies when moving data between systems: +- Migrating to a new data warehouse (e.g., Oracle > Snowflake) +- Converting SQL to a new transformation framework (e.g., stored procedures > dbt) +- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift) + + +## Data Development Testing +Test SQL code and preview changes by comparing development/staging environment data to production: +1. Make a change to some SQL code +2. Run the SQL code to create a new dataset +3. Compare the dataset with its production version or another iteration + +

+ dbt +

+ +
+ data-diff integrates with dbt Core to seamlessly compare local development to production datasets + + + +![data-development-testing](docs/development_testing.png) + +
+ +> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing) + +:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** + +**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)** + +Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode) + +Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support + + # How it works When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison: -## joindiff +## `joindiff` - Recommended for comparing data within the same database - Uses the outer join operation to diff the rows as efficiently as possible within the same database - Fully relies on the underlying database engine for computation - Requires both datasets to be queryable with a single SQL query - Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset -## hashdiff +## `hashdiff` - Recommended for comparing datasets across different databases - Can also be helpful in diffing very large tables with few expected differences within the same database - Employs a divide-and-conquer algorithm based on hashing and binary search @@ -52,52 +92,23 @@ data-diff \ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference. -# Use cases - -## Data Migration & Replication Testing -Compare source to target and check for discrepancies when moving data between systems: -- Migrating to a new data warehouse (e.g., Oracle > Snowflake) -- Converting SQL to a new transformation framework (e.g., stored procedures > dbt) -- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift) - - -## Data Development Testing -Test SQL code and preview changes by comparing development/staging environment data to production: -1. Make a change to some SQL code -2. Run the SQL code to create a new dataset -3. Compare the dataset with its production version or another iteration - -

- dbt -

- -`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets. - -:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** - -**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)** - -Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode) - -Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support - # Supported databases | Database | Status | Connection string | |---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------| -| PostgreSQL >=10 | 💚 | `postgresql://:@:5432/` | -| MySQL | 💚 | `mysql://:@:5432/` | -| Snowflake | 💚 | `"snowflake://[:]@//?warehouse=&role=[&authenticator=externalbrowser]"` | -| BigQuery | 💚 | `bigquery:///` | -| Redshift | 💚 | `redshift://:@:5439/` | -| Oracle | 💛 | `oracle://:@/database` | -| Presto | 💛 | `presto://:@:8080/` | -| Databricks | 💛 | `databricks://:@//` | -| Trino | 💛 | `trino://:@:8080/` | -| Clickhouse | 💛 | `clickhouse://:@:9000/` | -| Vertica | 💛 | `vertica://:@:5433/` | -| DuckDB | 💛 | | +| PostgreSQL >=10 | 🟢 | `postgresql://:@:5432/` | +| MySQL | 🟢 | `mysql://:@:5432/` | +| Snowflake | 🟢 | `"snowflake://[:]@//?warehouse=&role=[&authenticator=externalbrowser]"` | +| BigQuery | 🟢 | `bigquery:///` | +| Redshift | 🟢 | `redshift://:@:5439/` | +| Oracle | 🟡 | `oracle://:@/database` | +| Presto | 🟡 | `presto://:@:8080/` | +| Databricks | 🟡 | `databricks://:@//` | +| Trino | 🟡 | `trino://:@:8080/` | +| Clickhouse | 🟡 | `clickhouse://:@:9000/` | +| Vertica | 🟡 | `vertica://:@:5433/` | +| DuckDB | 🟡 | | | ElasticSearch | 📝 | | | Planetscale | 📝 | | | Pinot | 📝 | | @@ -105,8 +116,8 @@ Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archive | Kafka | 📝 | | | SQLite | 📝 | | -* 💚: Implemented and thoroughly tested. -* 💛: Implemented, but not thoroughly tested yet. +* 🟢: Implemented and thoroughly tested. +* 🟡: Implemented, but not thoroughly tested yet. * ⏳: Implementation in progress. * 📝: Implementation planned. Contributions welcome. diff --git a/docs/data-diff-logo.png b/docs/data-diff-logo.png new file mode 100644 index 00000000..545d6414 Binary files /dev/null and b/docs/data-diff-logo.png differ diff --git a/docs/development_testing.png b/docs/development_testing.png new file mode 100644 index 00000000..4c306efe Binary files /dev/null and b/docs/development_testing.png differ