New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DuckDB data diffs work better #716
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sungchun12
changed the title
Feature/motherduck-support
Make DuckDB data diffs work better
Sep 26, 2023
sungchun12
force-pushed
the
feature/motherduck-support
branch
from
October 2, 2023 22:46
d77e435
to
a1c755f
Compare
Working dry run using motherduck with datafold_demo:
target: dev
outputs:
dev:
type: duckdb
schema: development
# path: 'datafold_demo.duckdb'
path: 'md:datafold_demo?motherduck_token={{ env_var("motherduck_token") }}'
threads: 16 ~/De/data-diff/data_diff_demo main !3 ?1 data-diff --dbt --debug ✔ data_diff_demo 🐍 12:39:19 PM
Running with data-diff=0.9.2 (Update 0.9.3 is available!)
12:39:25 INFO Parsing file dbt_project.yml dbt_parser.py:287
INFO Parsing file /Users/sung/Desktop/data-diff/data_diff_demo/target/manifest.json dbt_parser.py:280
INFO Parsing file target/run_results.json dbt_parser.py:253
INFO config: prod_database='datafold_demo' prod_schema='production' dbt_parser.py:159
prod_custom_schema=None datasource_id=6357
INFO Parsing file /Users/sung/Desktop/data-diff/data_diff_demo/profiles.yml dbt_parser.py:294
DEBUG Found PKs via Uniqueness tests: {'order_id'} dbt_parser.py:458
12:39:27 DEBUG Running SQL (DuckDB): SET GLOBAL TimeZone='UTC' base.py:879
DEBUG Running SQL (DuckDB): SELECT column_name, data_type, datetime_precision, numeric_precision, base.py:879
numeric_scale FROM datafold_demo.information_schema.columns WHERE table_name = 'orders' AND
table_schema = 'production'
DEBUG Running SQL (DuckDB): SELECT column_name, data_type, datetime_precision, numeric_precision, base.py:879
numeric_scale FROM datafold_demo.information_schema.columns WHERE table_name = 'orders' AND
table_schema = 'development'
DEBUG Running SQL (DuckDB): SELECT column_name, data_type, datetime_precision, numeric_precision, base.py:879
numeric_scale FROM datafold_demo.information_schema.columns WHERE table_name = 'orders' AND
table_schema = 'production'
DEBUG Running SQL (DuckDB): SELECT TRIM("status") FROM "datafold_demo"."production"."orders" base.py:879
LIMIT 64
DEBUG [DuckDB] Schema = {'order_id': Integer(precision=0, python_type=<class 'int'>), schema.py:12
'customer_id': Integer(precision=0, python_type=<class 'int'>), 'order_date':
UnknownColType(text='DATE'), 'status': String_VaryingAlphanum(), 'credit_card_amount':
Float(precision=13), 'coupon_amount': Float(precision=13), 'bank_transfer_amount':
Float(precision=13), 'gift_card_amount': Float(precision=13), 'amount':
Float(precision=13)}
DEBUG Running SQL (DuckDB): SELECT column_name, data_type, datetime_precision, numeric_precision, base.py:879
numeric_scale FROM datafold_demo.information_schema.columns WHERE table_name = 'orders' AND
table_schema = 'development'
DEBUG Running SQL (DuckDB): SELECT TRIM("status") FROM "datafold_demo"."development"."orders" base.py:879
LIMIT 64
12:39:28 DEBUG [DuckDB] Schema = {'order_id': Integer(precision=0, python_type=<class 'int'>), schema.py:12
'customer_id': Integer(precision=0, python_type=<class 'int'>), 'order_date':
UnknownColType(text='DATE'), 'status': String_VaryingAlphanum(), 'credit_card_amount':
Float(precision=13), 'coupon_amount': Float(precision=13), 'bank_transfer_amount':
Float(precision=13), 'gift_card_amount': Float(precision=13), 'amount':
Float(precision=13)}
DEBUG Testing for duplicate keys joindiff_tables.py:230
INFO Validating that the are no duplicate keys in columns: ['order_id'] joindiff_tables.py:243
DEBUG Running SQL (DuckDB): SELECT count(*) AS "total", count(distinct base.py:879
coalesce("order_id"::VARCHAR, '<null>')) AS "total_distinct" FROM
"datafold_demo"."production"."orders"
DEBUG Collecting stats for table #1 joindiff_tables.py:270
DEBUG Querying for different rows joindiff_tables.py:208
DEBUG Running SQL (DuckDB): SELECT sum("amount") AS "sum_amount", sum("credit_card_amount") AS base.py:879
"sum_credit_card_amount", sum("customer_id") AS "sum_customer_id", sum("gift_card_amount")
AS "sum_gift_card_amount", sum("bank_transfer_amount") AS "sum_bank_transfer_amount",
sum("coupon_amount") AS "sum_coupon_amount", count(*) AS "count" FROM
"datafold_demo"."production"."orders"
DEBUG Running SQL (DuckDB): SELECT * FROM (SELECT ("tmp2"."order_id" IS NULL) AS base.py:879
"is_exclusive_a", ("tmp1"."order_id" IS NULL) AS "is_exclusive_b", CASE WHEN
"tmp1"."order_id" is distinct from "tmp2"."order_id" THEN 1 ELSE 0 END AS
"is_diff_order_id", CASE WHEN "tmp1"."amount" is distinct from "tmp2"."amount" THEN 1 ELSE
0 END AS "is_diff_amount", CASE WHEN "tmp1"."order_date" is distinct from
"tmp2"."order_date" THEN 1 ELSE 0 END AS "is_diff_order_date", CASE WHEN
"tmp1"."credit_card_amount" is distinct from "tmp2"."credit_card_amount" THEN 1 ELSE 0 END
AS "is_diff_credit_card_amount", CASE WHEN "tmp1"."customer_id" is distinct from
"tmp2"."customer_id" THEN 1 ELSE 0 END AS "is_diff_customer_id", CASE WHEN
"tmp1"."gift_card_amount" is distinct from "tmp2"."gift_card_amount" THEN 1 ELSE 0 END AS
"is_diff_gift_card_amount", CASE WHEN "tmp1"."bank_transfer_amount" is distinct from
"tmp2"."bank_transfer_amount" THEN 1 ELSE 0 END AS "is_diff_bank_transfer_amount", CASE
WHEN "tmp1"."status" is distinct from "tmp2"."status" THEN 1 ELSE 0 END AS
"is_diff_status", CASE WHEN "tmp1"."coupon_amount" is distinct from "tmp2"."coupon_amount"
THEN 1 ELSE 0 END AS "is_diff_coupon_amount", "tmp1"."order_id"::VARCHAR AS "order_id_a",
"tmp2"."order_id"::VARCHAR AS "order_id_b", "tmp1"."amount"::DECIMAL(38, 13)::VARCHAR AS
"amount_a", "tmp2"."amount"::DECIMAL(38, 13)::VARCHAR AS "amount_b",
"tmp1"."order_date"::VARCHAR AS "order_date_a", "tmp2"."order_date"::VARCHAR AS
"order_date_b", "tmp1"."credit_card_amount"::DECIMAL(38, 13)::VARCHAR AS
"credit_card_amount_a", "tmp2"."credit_card_amount"::DECIMAL(38, 13)::VARCHAR AS
"credit_card_amount_b", "tmp1"."customer_id"::VARCHAR AS "customer_id_a",
"tmp2"."customer_id"::VARCHAR AS "customer_id_b", "tmp1"."gift_card_amount"::DECIMAL(38,
13)::VARCHAR AS "gift_card_amount_a", "tmp2"."gift_card_amount"::DECIMAL(38, 13)::VARCHAR
AS "gift_card_amount_b", "tmp1"."bank_transfer_amount"::DECIMAL(38, 13)::VARCHAR AS
"bank_transfer_amount_a", "tmp2"."bank_transfer_amount"::DECIMAL(38, 13)::VARCHAR AS
"bank_transfer_amount_b", "tmp1"."status"::VARCHAR AS "status_a", "tmp2"."status"::VARCHAR
AS "status_b", "tmp1"."coupon_amount"::DECIMAL(38, 13)::VARCHAR AS "coupon_amount_a",
"tmp2"."coupon_amount"::DECIMAL(38, 13)::VARCHAR AS "coupon_amount_b" FROM
"datafold_demo"."production"."orders" "tmp1" FULL OUTER JOIN
"datafold_demo"."development"."orders" "tmp2" ON ("tmp1"."order_id" = "tmp2"."order_id"))
tmp3 WHERE (("is_diff_order_id" = 1) OR ("is_diff_amount" = 1) OR ("is_diff_order_date" =
1) OR ("is_diff_credit_card_amount" = 1) OR ("is_diff_customer_id" = 1) OR
("is_diff_gift_card_amount" = 1) OR ("is_diff_bank_transfer_amount" = 1) OR
("is_diff_status" = 1) OR ("is_diff_coupon_amount" = 1))
INFO Validating that the are no duplicate keys in columns: ['order_id'] joindiff_tables.py:243
DEBUG Running SQL (DuckDB): SELECT count(*) AS "total", count(distinct base.py:879
coalesce("order_id"::VARCHAR, '<null>')) AS "total_distinct" FROM
"datafold_demo"."development"."orders"
DEBUG Done collecting stats for table #1 joindiff_tables.py:306
DEBUG Collecting stats for table #2 joindiff_tables.py:270
DEBUG Running SQL (DuckDB): SELECT sum("amount") AS "sum_amount", sum("credit_card_amount") AS base.py:879
"sum_credit_card_amount", sum("customer_id") AS "sum_customer_id", sum("gift_card_amount")
AS "sum_gift_card_amount", sum("bank_transfer_amount") AS "sum_bank_transfer_amount",
sum("coupon_amount") AS "sum_coupon_amount", count(*) AS "count" FROM
"datafold_demo"."development"."orders"
DEBUG Done collecting stats for table #2 joindiff_tables.py:306
DEBUG Testing for null keys joindiff_tables.py:252
DEBUG Running SQL (DuckDB): SELECT "order_id" FROM "datafold_demo"."production"."orders" WHERE base.py:879
("order_id" IS NULL)
DEBUG Running SQL (DuckDB): SELECT "order_id" FROM "datafold_demo"."development"."orders" WHERE base.py:879
("order_id" IS NULL)
DEBUG Counting exclusive rows joindiff_tables.py:352
DEBUG Running SQL (DuckDB): SELECT count(*) FROM (SELECT * FROM (SELECT ("tmp2"."order_id" IS base.py:879
NULL) AS "is_exclusive_a", ("tmp1"."order_id" IS NULL) AS "is_exclusive_b", CASE WHEN
"tmp1"."order_id" is distinct from "tmp2"."order_id" THEN 1 ELSE 0 END AS
"is_diff_order_id", CASE WHEN "tmp1"."amount" is distinct from "tmp2"."amount" THEN 1 ELSE
0 END AS "is_diff_amount", CASE WHEN "tmp1"."order_date" is distinct from
"tmp2"."order_date" THEN 1 ELSE 0 END AS "is_diff_order_date", CASE WHEN
"tmp1"."credit_card_amount" is distinct from "tmp2"."credit_card_amount" THEN 1 ELSE 0 END
AS "is_diff_credit_card_amount", CASE WHEN "tmp1"."customer_id" is distinct from
"tmp2"."customer_id" THEN 1 ELSE 0 END AS "is_diff_customer_id", CASE WHEN
"tmp1"."gift_card_amount" is distinct from "tmp2"."gift_card_amount" THEN 1 ELSE 0 END AS
"is_diff_gift_card_amount", CASE WHEN "tmp1"."bank_transfer_amount" is distinct from
"tmp2"."bank_transfer_amount" THEN 1 ELSE 0 END AS "is_diff_bank_transfer_amount", CASE
WHEN "tmp1"."status" is distinct from "tmp2"."status" THEN 1 ELSE 0 END AS
"is_diff_status", CASE WHEN "tmp1"."coupon_amount" is distinct from "tmp2"."coupon_amount"
THEN 1 ELSE 0 END AS "is_diff_coupon_amount", "tmp1"."order_id"::VARCHAR AS "order_id_a",
"tmp2"."order_id"::VARCHAR AS "order_id_b", "tmp1"."amount"::DECIMAL(38, 13)::VARCHAR AS
"amount_a", "tmp2"."amount"::DECIMAL(38, 13)::VARCHAR AS "amount_b",
"tmp1"."order_date"::VARCHAR AS "order_date_a", "tmp2"."order_date"::VARCHAR AS
"order_date_b", "tmp1"."credit_card_amount"::DECIMAL(38, 13)::VARCHAR AS
"credit_card_amount_a", "tmp2"."credit_card_amount"::DECIMAL(38, 13)::VARCHAR AS
"credit_card_amount_b", "tmp1"."customer_id"::VARCHAR AS "customer_id_a",
"tmp2"."customer_id"::VARCHAR AS "customer_id_b", "tmp1"."gift_card_amount"::DECIMAL(38,
13)::VARCHAR AS "gift_card_amount_a", "tmp2"."gift_card_amount"::DECIMAL(38, 13)::VARCHAR
AS "gift_card_amount_b", "tmp1"."bank_transfer_amount"::DECIMAL(38, 13)::VARCHAR AS
"bank_transfer_amount_a", "tmp2"."bank_transfer_amount"::DECIMAL(38, 13)::VARCHAR AS
"bank_transfer_amount_b", "tmp1"."status"::VARCHAR AS "status_a", "tmp2"."status"::VARCHAR
AS "status_b", "tmp1"."coupon_amount"::DECIMAL(38, 13)::VARCHAR AS "coupon_amount_a",
"tmp2"."coupon_amount"::DECIMAL(38, 13)::VARCHAR AS "coupon_amount_b" FROM
"datafold_demo"."production"."orders" "tmp1" FULL OUTER JOIN
"datafold_demo"."development"."orders" "tmp2" ON ("tmp1"."order_id" = "tmp2"."order_id"))
tmp3 WHERE (("is_diff_order_id" = 1) OR ("is_diff_amount" = 1) OR ("is_diff_order_date" =
1) OR ("is_diff_credit_card_amount" = 1) OR ("is_diff_customer_id" = 1) OR
("is_diff_gift_card_amount" = 1) OR ("is_diff_bank_transfer_amount" = 1) OR
("is_diff_status" = 1) OR ("is_diff_coupon_amount" = 1)) AND ("is_exclusive_a" OR
"is_exclusive_b")) tmp4
DEBUG Counting differences per column joindiff_tables.py:338
DEBUG Running SQL (DuckDB): SELECT sum("is_diff_order_id"), sum("is_diff_amount"), base.py:879
sum("is_diff_order_date"), sum("is_diff_credit_card_amount"), sum("is_diff_customer_id"),
sum("is_diff_gift_card_amount"), sum("is_diff_bank_transfer_amount"),
sum("is_diff_status"), sum("is_diff_coupon_amount") FROM (SELECT ("tmp2"."order_id" IS
NULL) AS "is_exclusive_a", ("tmp1"."order_id" IS NULL) AS "is_exclusive_b", CASE WHEN
"tmp1"."order_id" is distinct from "tmp2"."order_id" THEN 1 ELSE 0 END AS
"is_diff_order_id", CASE WHEN "tmp1"."amount" is distinct from "tmp2"."amount" THEN 1 ELSE
0 END AS "is_diff_amount", CASE WHEN "tmp1"."order_date" is distinct from
"tmp2"."order_date" THEN 1 ELSE 0 END AS "is_diff_order_date", CASE WHEN
"tmp1"."credit_card_amount" is distinct from "tmp2"."credit_card_amount" THEN 1 ELSE 0 END
AS "is_diff_credit_card_amount", CASE WHEN "tmp1"."customer_id" is distinct from
"tmp2"."customer_id" THEN 1 ELSE 0 END AS "is_diff_customer_id", CASE WHEN
"tmp1"."gift_card_amount" is distinct from "tmp2"."gift_card_amount" THEN 1 ELSE 0 END AS
"is_diff_gift_card_amount", CASE WHEN "tmp1"."bank_transfer_amount" is distinct from
"tmp2"."bank_transfer_amount" THEN 1 ELSE 0 END AS "is_diff_bank_transfer_amount", CASE
WHEN "tmp1"."status" is distinct from "tmp2"."status" THEN 1 ELSE 0 END AS
"is_diff_status", CASE WHEN "tmp1"."coupon_amount" is distinct from "tmp2"."coupon_amount"
THEN 1 ELSE 0 END AS "is_diff_coupon_amount", "tmp1"."order_id"::VARCHAR AS "order_id_a",
"tmp2"."order_id"::VARCHAR AS "order_id_b", "tmp1"."amount"::DECIMAL(38, 13)::VARCHAR AS
"amount_a", "tmp2"."amount"::DECIMAL(38, 13)::VARCHAR AS "amount_b",
"tmp1"."order_date"::VARCHAR AS "order_date_a", "tmp2"."order_date"::VARCHAR AS
"order_date_b", "tmp1"."credit_card_amount"::DECIMAL(38, 13)::VARCHAR AS
"credit_card_amount_a", "tmp2"."credit_card_amount"::DECIMAL(38, 13)::VARCHAR AS
"credit_card_amount_b", "tmp1"."customer_id"::VARCHAR AS "customer_id_a",
"tmp2"."customer_id"::VARCHAR AS "customer_id_b", "tmp1"."gift_card_amount"::DECIMAL(38,
13)::VARCHAR AS "gift_card_amount_a", "tmp2"."gift_card_amount"::DECIMAL(38, 13)::VARCHAR
AS "gift_card_amount_b", "tmp1"."bank_transfer_amount"::DECIMAL(38, 13)::VARCHAR AS
"bank_transfer_amount_a", "tmp2"."bank_transfer_amount"::DECIMAL(38, 13)::VARCHAR AS
"bank_transfer_amount_b", "tmp1"."status"::VARCHAR AS "status_a", "tmp2"."status"::VARCHAR
AS "status_b", "tmp1"."coupon_amount"::DECIMAL(38, 13)::VARCHAR AS "coupon_amount_a",
"tmp2"."coupon_amount"::DECIMAL(38, 13)::VARCHAR AS "coupon_amount_b" FROM
"datafold_demo"."production"."orders" "tmp1" FULL OUTER JOIN
"datafold_demo"."development"."orders" "tmp2" ON ("tmp1"."order_id" = "tmp2"."order_id"))
tmp3 WHERE (("is_diff_order_id" = 1) OR ("is_diff_amount" = 1) OR ("is_diff_order_date" =
1) OR ("is_diff_credit_card_amount" = 1) OR ("is_diff_customer_id" = 1) OR
("is_diff_gift_card_amount" = 1) OR ("is_diff_bank_transfer_amount" = 1) OR
("is_diff_status" = 1) OR ("is_diff_coupon_amount" = 1))
12:39:29 INFO Diffing complete joindiff_tables.py:165
datafold_demo.production.orders <> datafold_demo.development.orders
Rows Added Rows Removed
------------ --------------
0 94
Updated Rows: 0
Unchanged Rows: 5
Values Updated:
amount: 0
order_date: 0
credit_card_amount: 0
customer_id: 0
gift_card_amount: 0
bank_transfer_amount: 0
status: 0
coupon_amount: 0
|
dlawin
reviewed
Oct 9, 2023
dlawin
approved these changes
Oct 9, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment isn't blocking, existing smell that could be handled in a subsequent PR
sungchun12
force-pushed
the
feature/motherduck-support
branch
from
October 10, 2023 16:19
cd55fd6
to
f26e11b
Compare
Did a |
sungchun12
force-pushed
the
feature/motherduck-support
branch
from
October 10, 2023 16:20
f26e11b
to
1c26031
Compare
dlawin
approved these changes
Oct 10, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Get duckdb/motherduck to work with data-diff using
joindiff
within thedbt-core
integration.Right now, it defaults to
hashdiff
which is not as performant.Scrub sensitive motherduck tokens in logs.
Known Limitation with SaaS only mode in that it will not work with data-diff. We'll have to change multiple query runner functions to avoid race conditions which is outside the scope of this PR.