diff --git a/README.md b/README.md index 01491a25..1a941bfb 100644 --- a/README.md +++ b/README.md @@ -5,11 +5,11 @@ # **data-diff** ## What is `data-diff`? -data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables. It's fast, easy to use, and reliable. Even at massive scale. +data-diff is a **free, open-source tool** that enables data professionals to detect differences in values between any two tables. ## Documentation -[**πŸ—Ž Documentation website**](https://docs.datafold.com/os_diff/about) - our detailed documentation has everything you need to start diffing. +[**πŸ—Ž Documentation website**](https://docs.datafold.com/guides/os_data_diff) - our detailed documentation has everything you need to start diffing. ### Databases we support @@ -27,31 +27,11 @@ data-diff is a **free, open-source tool** that enables data professionals to det - DuckDB >=0.6 - SQLite (coming soon) -For their corresponding connection strings, check out our [detailed table](https://docs.datafold.com/os_diff/databases_we_support). +For their corresponding connection strings, check out our [detailed table](https://github.com/datafold/data-diff/blob/master/docs/supported-databases.md). #### Looking for a database not on the list? If a database is not on the list, we'd still love to support it. [Please open an issue](https://github.com/datafold/data-diff/issues) to discuss it, or vote on existing requests to push them up our todo list. -## Use cases - -### Diff Tables Between Databases -#### Quickly identify issues when moving data between databases - -

- diff2 -

- -### Diff Tables Within a Database -#### Improve code reviews by identifying data problems you don't have tests for -

- - Intro to Diff - -

- -  -  - ## Get started ### Installation @@ -92,7 +72,7 @@ Once you've installed `data-diff`, you can run it from the command line. data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS] ``` -Be sure to read [the docs](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line) for detailed instructions how to build one of these commands depending on your database setup. +Be sure to read [the docs](https://docs.datafold.com/reference/open_source/cli) for detailed instructions how to build one of these commands depending on your database setup. #### Code Example: Diff Tables Between Databases Here's an example command for your copy/pasting, taken from the screenshot above when we diffed data between Snowflake and Postgres. @@ -130,22 +110,18 @@ In both code examples, I've used `<>` carrots to represent values that **should We know that in some cases, the data-diff command can become long and dense. And maybe you're new to the command line. -* We're here to help [on slack](https://locallyoptimistic.slack.com/archives/C03HUNGQV0S) if you have ANY questions as you use `data-diff` in your workflow. -* You can also post a question in [GitHub Discussions](https://github.com/datafold/data-diff/discussions). - - -To get a Slack invite - [click here](https://locallyoptimistic.com/community/) +* We're here to help! Post a question in [GitHub Discussions](https://github.com/datafold/data-diff/discussions). ## How to Use -* [How to use from the shell (or: command-line)](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_command_line) -* [How to use from Python](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_python) -* [How to use with TOML configuration file](https://docs.datafold.com/os_diff/how_to_use/how_to_use_with_toml) -* [Usage Analytics & Data Privacy](https://docs.datafold.com/os_diff/usage_analytics_data_privacy) +* [Examples with dbt, joindiff, and hashdiff](https://docs.datafold.com/reference/open_source/cli#examples) +* [Examples with Python](https://data-diff.readthedocs.io/en/latest/python-api.html) +* [How to use with TOML configuration file](https://docs.datafold.com/reference/open_source/cli#toml-config-file) ## How to Contribute * Feel free to open an issue or contribute to the project by working on an existing issue. * Please read the [contributing guidelines](https://github.com/datafold/data-diff/blob/master/CONTRIBUTING.md) to get started. +* To add a new database driver, check out [docs](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst). Big thanks to everyone who contributed so far: @@ -155,7 +131,10 @@ Big thanks to everyone who contributed so far: ## Technical Explanation -Check out this [technical explanation](https://docs.datafold.com/os_diff/technical_explanation) of how data-diff works. +Check out this [technical explanation](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md) of how data-diff works. + +## Analytics +- [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md) ## License diff --git a/docs/common_use_cases.md b/docs/common_use_cases.md new file mode 100644 index 00000000..bfe6fdc6 --- /dev/null +++ b/docs/common_use_cases.md @@ -0,0 +1,14 @@ +# Common Use Cases + +## joindiff +- **Inspect differences between branches**. Make sure your code results in only expected changes. +- **Validate stability of critical downstream tables**. When refactoring a data pipeline, rest assured that the changes you make to upstream models have not impacted critical downstream models depended on by users and systems. +- **Conduct better code reviews**. No matter how thoughtfully you review the code, run a diff to ensure that you don't accidentally approve an error. + +## hashdiff +- **Verify data migrations**. Verify that all data was copied when doing a critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS. +- **Verify data pipelines**. Moving data from a relational database to a warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline. +- **Maintain data integrity SLOs**. You can create and monitor your SLO of e.g. 99.999% data integrity, and alert your team when data is missing. +- **Debug complex data pipelines**. Data can get lost in pipelines that may span a half-dozen systems. data-diff helps you efficiently track down where a row got lost without needing to individually inspect intermediate datastores. +- **Detect hard deletes for an `updated_at`-based pipeline**. If you're copying data to your warehouse based on an `updated_at`-style column, data-diff can find any hard-deletes that you may have missed. +- **Make your replication self-healing**. You can use data-diff to self-heal by using the diff output to write/update rows in the target database. diff --git a/docs/index.rst b/docs/index.rst index b20d77ed..476508c5 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,22 +4,12 @@ :hidden: python-api + python_examples data-diff --------- -**Data-diff** is a command-line tool and Python library to efficiently diff -rows across two different databases. - -⇄ Verifies across many different databases (e.g. *PostgreSQL* -> *Snowflake*) ! - -πŸ” Outputs diff of rows in detail - -🚨 Simple CLI/API to create monitoring and alerts - -πŸ”₯ Verify 25M+ rows in <10s, and 1B+ rows in ~5min. - -♾️ Works for tables with 10s of billions of rows +**Data-diff** is a command-line tool and Python library for comparing tables in and across databases. For more information, `See our README `_ @@ -32,4 +22,4 @@ Resources - :doc:`python-api` - The rest of the `documentation`_ -.. _documentation: https://docs.datafold.com/os_diff/about/ +.. _documentation: https://docs.datafold.com/guides/os_data_diff diff --git a/docs/python_examples.rst b/docs/python_examples.rst new file mode 100644 index 00000000..a3a3bf0f --- /dev/null +++ b/docs/python_examples.rst @@ -0,0 +1,44 @@ +Python API Examples +--------- + +**Example 1: Diff tables in mysql and postgresql** + +.. code-block:: python + # Optional: Set logging to display the progress of the diff + import logging + logging.basicConfig(level=logging.INFO) + + from data_diff import connect_to_table, diff_tables + + table1 = connect_to_table("postgresql:///", "table_name", "id") + table2 = connect_to_table("mysql:///", "table_name", "id") + + for different_row in diff_tables(table1, table2): + plus_or_minus, columns = different_row + print(plus_or_minus, columns) + + +**Example 2: Connect to snowflake using dictionary configuration** + +.. code-block:: python + SNOWFLAKE_CONN_INFO = { + "driver": "snowflake", + "user": "erez", + "account": "whatever", + "database": "TESTS", + "warehouse": "COMPUTE_WH", + "role": "ACCOUNTADMIN", + "schema": "PUBLIC", + "key": "snowflake_rsa_key.p8", + } + + snowflake_table = connect_to_table(SNOWFLAKE_CONN_INFO, "table_name") # Uses id by default + +Run `help(connect_to_table)` and `help(diff_tables)` or read our API reference to learn more about the different options: + +- connect_to_table_ + +- diff_tables_ + +.. _connect_to_table: https://data-diff.readthedocs.io/en/latest/python-api.html#data_diff.connect_to_table +.. _diff_tables: https://data-diff.readthedocs.io/en/latest/python-api.html#data_diff.diff_tables \ No newline at end of file diff --git a/docs/usage_analytics.md b/docs/usage_analytics.md new file mode 100644 index 00000000..76b88b94 --- /dev/null +++ b/docs/usage_analytics.md @@ -0,0 +1,22 @@ +# Usage Analytics & Data Privacy + +data-diff collects anonymous usage data to help our team improve the tool and to apply development efforts to where our users need them most. + +We capture two events: one when the data-diff run starts, and one when it is finished. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to: + +- Operating System and Python version +- Types of databases used (postgresql, mysql, etc.) +- Sizes of tables diffed, run time, and diff row count (numbers only) +- Error message, if any, truncated to the first 20 characters. +- A persistent UUID to indentify the session, stored in `~/.datadiff.toml` + +To disable, use one of the following methods: + +* **CLI**: use the `--no-tracking` flag. +* **Config file**: set `no_tracking = true` (for example, under `[run.default]`) +* **Python API**: + ```python + import data_diff + # Invoke the following before making any API calls + data_diff.disable_tracking() + ```