This repository was archived by the owner on May 17, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 294
This repository was archived by the owner on May 17, 2024. It is now read-only.
Databricks connection does not close properly when using --dbt #873
Copy link
Copy link
Closed as not planned
Labels
bugSomething isn't workingSomething isn't workingstaleIssues/PRs that have gone staleIssues/PRs that have gone staletriage
Description
Describe the bug
The connection stays open for a while until it dies. Diff results are successfully returned to the user, along with an error message.
Version info:
(datafold-demo) ➜ datafold-demo git:(test-2) data-diff --version
v0.11.1
Excerpt of debug output:
Running with data-diff=0.11.1
08:28:30 INFO Parsing file dbt_project.yml dbt_parser.py:287
INFO Parsing file dbt_parser.py:280
/Users/leoebfolsom/git/leoebfolsom-databricks-demo/datafold-demo/target/manifest.json
INFO Parsing file target/run_results.json dbt_parser.py:253
INFO config: prod_database='dev' prod_schema='dbt_leoebfolsomdatabricksdemo_prod' dbt_parser.py:159
prod_custom_schema=None datasource_id=7498
INFO Parsing file /Users/leoebfolsom/.dbt/profiles.yml dbt_parser.py:294
DEBUG Found PKs via Tags [dim_orgs]: ['org_id'] dbt_parser.py:454
INFO [Databricks] Starting a threadpool, size=1. base.py:1237
DEBUG Running SQL (Databricks): base.py:982
SET TIME ZONE 'UTC'
08:28:34 DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`)
AS LIMITED_SELECT LIMIT 64
DEBUG Mixed Alphanum/Non-Alphanum values detected in column base.py:1153
dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs.sub_plan. It cannot be used as a key.
INFO [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>), schema.py:44
'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],
precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,
rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],
precision=0, python_type=<class 'int'>)}
08:28:35 DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') base.py:980
SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`)
AS LIMITED_SELECT LIMIT 64
DEBUG Mixed Alphanum/Non-Alphanum values detected in column base.py:1153
dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs.sub_plan. It cannot be used as a key.
INFO [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>), schema.py:44
'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],
precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,
rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],
precision=0, python_type=<class 'int'>)}
DEBUG Testing for duplicate keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> joindiff_tables.py:230
('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): base.py:982
SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`
INFO Validating that the are no duplicate keys in columns: ['org_id'] for ('dev', joindiff_tables.py:243
'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS
`total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`
DEBUG Running SQL (Databricks): base.py:982
SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`
INFO Diffing segments at key-range: (100000001)..(100000158). size: table1 <= 157, table2 <= 157 diff_tables.py:318
INFO . Diffing segment 1/32, key-range: (100000001)..(100000005), size <= None joindiff_tables.py:183
DEBUG Collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') joindiff_tables.py:270
DEBUG Querying for different rows: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') joindiff_tables.py:208
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS
`count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >= 100000001)
AND (`org_id` < 100000005)
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT * FROM (SELECT (`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS
`is_exclusive_b`, CASE WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS
`is_diff_org_id`, CASE WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0
END AS `is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1
ELSE 0 END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from
`tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN
`tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,
CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS
`is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)
AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as
string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`
as string) AS `sub_plan_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,
cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS
`num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE
(`org_id` >= 100000001) AND (`org_id` < 100000005)) `tmp1` FULL OUTER JOIN (SELECT * FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND (`org_id` <
100000005)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR
(`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR
(`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))
INFO Validating that the are no duplicate keys in columns: ['org_id'] for ('dev', joindiff_tables.py:243
'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') base.py:980
SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS
`total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`
08:28:36 DEBUG Done collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', joindiff_tables.py:306
'dim_orgs')
DEBUG Collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') joindiff_tables.py:270
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') base.py:980
SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS
`count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND
(`org_id` < 100000005)
DEBUG Done collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', joindiff_tables.py:306
'dim_orgs')
DEBUG Testing for null keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> joindiff_tables.py:252
('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT `org_id` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >=
100000001) AND (`org_id` < 100000005) AND (`org_id` IS NULL)
.... many repetitive such logs ...
DEBUG Counting differences per column: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', joindiff_tables.py:346
'dim_orgs') <> ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev', base.py:980
'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
SELECT sum(`is_diff_org_id`), sum(`is_diff_sub_price`), sum(`is_diff_sub_plan`),
sum(`is_diff_sub_created_at`), sum(`is_diff_created_at`), sum(`is_diff_num_users`) FROM (SELECT
(`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS `is_exclusive_b`, CASE
WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS `is_diff_org_id`, CASE
WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0 END AS
`is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1 ELSE 0
END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from
`tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN
`tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,
CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS
`is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)
AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as
string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`
as string) AS `sub_plan_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,
cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS
`num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE
(`org_id` >= 100000125) AND (`org_id` < 100000158)) `tmp1` FULL OUTER JOIN (SELECT * FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000125) AND (`org_id` <
100000158)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR
(`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR
(`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))
INFO Diffing complete: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev', joindiff_tables.py:165
'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs <> dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs
Primary Keys: ['org_id']
rows PROD <> DEV
--------- ------ ------------ -----------------
Total 157 10 [-147]
Added 0
Removed -147
Different 0
Unchanged 10
columns # diff values
-------------- ---------------
created_at 0
num_users 0
sub_created_at 0
sub_plan 0
sub_price 0
08:29:38 ERROR Attempt to close session raised a local exception: client.py:277
HTTPSConnectionPool(host='<redacted>.cloud.databricks.com', port=443): Max retries
exceeded with url: /sql/1.0/warehouses/<redacted> (Caused by SSLError(SSLError(0,
'Underlying socket connection gone (_ssl.c:2407)')))
Describe the environment
OS: MacBook Pro 16-inch, 2021; Apple M1 MaX; 64 GB Mem; Venture 13.4.1
data-diff version: 0.11.1
(I also found the same issue in 0.9.17)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues/PRs that have gone staleIssues/PRs that have gone staletriage