Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.
This repository was archived by the owner on May 17, 2024. It is now read-only.

Databricks connection does not close properly when using --dbt #873

@leoebfolsom

Description

@leoebfolsom

Describe the bug
The connection stays open for a while until it dies. Diff results are successfully returned to the user, along with an error message.

Version info:

(datafold-demo) ➜  datafold-demo git:(test-2) data-diff --version
v0.11.1

Excerpt of debug output:

Running with data-diff=0.11.1
08:28:30 INFO     Parsing file dbt_project.yml                                                                  dbt_parser.py:287
         INFO     Parsing file                                                                                  dbt_parser.py:280
                  /Users/leoebfolsom/git/leoebfolsom-databricks-demo/datafold-demo/target/manifest.json                          
         INFO     Parsing file target/run_results.json                                                          dbt_parser.py:253
         INFO     config: prod_database='dev' prod_schema='dbt_leoebfolsomdatabricksdemo_prod'                  dbt_parser.py:159
                  prod_custom_schema=None datasource_id=7498                                                                     
         INFO     Parsing file /Users/leoebfolsom/.dbt/profiles.yml                                             dbt_parser.py:294
         DEBUG    Found PKs via Tags [dim_orgs]: ['org_id']                                                     dbt_parser.py:454
         INFO     [Databricks] Starting a threadpool, size=1.                                                        base.py:1237
         DEBUG    Running SQL (Databricks):                                                                           base.py:982
                  SET TIME ZONE 'UTC'                                                                                            
08:28:34 DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`)             
                  AS LIMITED_SELECT LIMIT 64                                                                                     
         DEBUG    Mixed Alphanum/Non-Alphanum values detected in column                                              base.py:1153
                  dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs.sub_plan. It cannot be used as a key.                          
         INFO     [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>),       schema.py:44
                  'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],                  
                  precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,                   
                  rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],                     
                  precision=0, python_type=<class 'int'>)}                                                                       
08:28:35 DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                  base.py:980
                  SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`)              
                  AS LIMITED_SELECT LIMIT 64                                                                                     
         DEBUG    Mixed Alphanum/Non-Alphanum values detected in column                                              base.py:1153
                  dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs.sub_plan. It cannot be used as a key.                           
         INFO     [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>),       schema.py:44
                  'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],                  
                  precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,                   
                  rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],                     
                  precision=0, python_type=<class 'int'>)}                                                                       
         DEBUG    Testing for duplicate keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> joindiff_tables.py:230
                  ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                       
         DEBUG    Running SQL (Databricks):                                                                           base.py:982
                  SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM                                       
                  `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`                                                          
         INFO     Validating that the are no duplicate keys in columns: ['org_id'] for ('dev',             joindiff_tables.py:243
                  'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                                                              
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS                    
                  `total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`                                    
         DEBUG    Running SQL (Databricks):                                                                           base.py:982
                  SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM                                       
                  `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`                                                           
         INFO     Diffing segments at key-range: (100000001)..(100000158). size: table1 <= 157, table2 <= 157  diff_tables.py:318
         INFO     . Diffing segment 1/32, key-range: (100000001)..(100000005), size <= None                joindiff_tables.py:183
         DEBUG    Collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') joindiff_tables.py:270
         DEBUG    Querying for different rows: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')   joindiff_tables.py:208
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS                   
                  `count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >= 100000001)               
                  AND (`org_id` < 100000005)                                                                                     
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT * FROM (SELECT (`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS              
                  `is_exclusive_b`, CASE WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS              
                  `is_diff_org_id`, CASE WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0               
                  END AS `is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1              
                  ELSE 0 END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from                           
                  `tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN                               
                  `tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,            
                  CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS                          
                  `is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)            
                  AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as                 
                  string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`            
                  as string) AS `sub_plan_b`,                                                                                    
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,                            
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,                            
                  cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS                     
                  `num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE                  
                  (`org_id` >= 100000001) AND (`org_id` < 100000005)) `tmp1` FULL OUTER JOIN (SELECT * FROM                      
                  `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND (`org_id` <             
                  100000005)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR               
                  (`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR                     
                  (`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))                                                       
         INFO     Validating that the are no duplicate keys in columns: ['org_id'] for ('dev',             joindiff_tables.py:243
                  'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                               
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                  base.py:980
                  SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS                    
                  `total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`                                     
08:28:36 DEBUG    Done collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod',        joindiff_tables.py:306
                  'dim_orgs')                                                                                                    
         DEBUG    Collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')  joindiff_tables.py:270
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                  base.py:980
                  SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS                   
                  `count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND            
                  (`org_id` < 100000005)                                                                                         
         DEBUG    Done collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev',         joindiff_tables.py:306
                  'dim_orgs')                                                                                                    
         DEBUG    Testing for null keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <>      joindiff_tables.py:252
                  ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                       
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT `org_id` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >=                  
                  100000001) AND (`org_id` < 100000005) AND (`org_id` IS NULL)                                                   
                                                                                         


.... many repetitive such logs ...


                                                                                 
         DEBUG    Counting differences per column: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod',           joindiff_tables.py:346
                  'dim_orgs') <> ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                        
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev',      base.py:980
                  'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                               
                  SELECT sum(`is_diff_org_id`), sum(`is_diff_sub_price`), sum(`is_diff_sub_plan`),                               
                  sum(`is_diff_sub_created_at`), sum(`is_diff_created_at`), sum(`is_diff_num_users`) FROM (SELECT                
                  (`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS `is_exclusive_b`, CASE             
                  WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS `is_diff_org_id`, CASE              
                  WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0 END AS                               
                  `is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1 ELSE 0              
                  END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from                                  
                  `tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN                               
                  `tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,            
                  CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS                          
                  `is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)            
                  AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as                 
                  string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`            
                  as string) AS `sub_plan_b`,                                                                                    
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,                            
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,                            
                  cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS                     
                  `num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE                  
                  (`org_id` >= 100000125) AND (`org_id` < 100000158)) `tmp1` FULL OUTER JOIN (SELECT * FROM                      
                  `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000125) AND (`org_id` <             
                  100000158)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR               
                  (`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR                     
                  (`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))                                                       
         INFO     Diffing complete: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev',   joindiff_tables.py:165
                  'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                               

dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs <> dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs 
Primary Keys: ['org_id'] 

rows       PROD    <>            DEV
---------  ------  ------------  -----------------
Total      157                   10 [-147]
Added              0
Removed            -147
Different          0
Unchanged          10

columns           # diff values
--------------  ---------------
created_at                    0
num_users                     0
sub_created_at                0
sub_plan                      0
sub_price                     0 

08:29:38 ERROR    Attempt to close session raised a local exception:                                                client.py:277
                  HTTPSConnectionPool(host='<redacted>.cloud.databricks.com', port=443): Max retries                      
                  exceeded with url: /sql/1.0/warehouses/<redacted> (Caused by SSLError(SSLError(0,                        
                  'Underlying socket connection gone (_ssl.c:2407)')))                                                           

Describe the environment

OS: MacBook Pro 16-inch, 2021; Apple M1 MaX; 64 GB Mem; Venture 13.4.1
data-diff version: 0.11.1
(I also found the same issue in 0.9.17)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues/PRs that have gone staletriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions