Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt-databricks 1.6.X overeager introspection (describe extended ...) of all tables in schema #442

Closed
jeremyyeo opened this issue Sep 8, 2023 · 5 comments · Fixed by #446
Labels
bug Something isn't working

Comments

@jeremyyeo
Copy link

jeremyyeo commented Sep 8, 2023

Describe the bug

dbt-databricks==1.6.2 appears to be aggressively introspecting all tables in a schema prior to running any models. This causes a huge bottleneck and a large performance degradation vs prior versions (i.e. dbt-databricks~=1.5.0).

Steps To Reproduce

  1. Use a schema with hundreds of random tables (below, hive_metastore.dbt_jyeo has 300+ random delta tables):
    image

  2. Setup dbt project:

# ~/.dbt/profiles.yml
databricks:
  target: default
  outputs:
    default: 
      type: databricks
      schema: dbt_jyeo
      host: <host>.cloud.databricks.com
      token: <token>
      http_path: /sql/1.0/endpoints/<endpoint>

# dbt_project.yml
name: my_dbt_project
profile: databricks
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
-- models/foo.sql
select 1 id
  1. Build model foo with 1.6.latest:
$ dbt run

============================== 14:08:22.625823 | abba8c69-2569-4628-b491-529c280f4946 ==============================
�[0m14:08:22.625823 [info ] [MainThread]: Running with dbt=1.6.2
�[0m14:08:22.677385 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'False', 'cache_selected_only': 'False', 'profiles_dir': '/Users/jeremy/.dbt', 'debug': 'True', 'version_check': 'True', 'log_path': '/Users/jeremy/src/dbt-basic/logs', 'fail_fast': 'False', 'warn_error': 'None', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'log_format': 'default', 'introspect': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'invocation_command': 'dbt --debug run', 'target_path': 'None', 'static_parser': 'True', 'send_anonymous_usage_stats': 'True'}
�[0m14:08:24.550026 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x105a844f0>]}
�[0m14:08:24.566183 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1180ed100>]}
�[0m14:08:24.567362 [info ] [MainThread]: Registered adapter: databricks=1.6.2
�[0m14:08:24.615542 [debug] [MainThread]: checksum: 8d66c46a7c5c0e5717f369273dbde9db64740684d9ba42e6442c1808d698ae9f, vars: {}, profile: , target: , version: 1.6.2
�[0m14:08:24.617734 [debug] [MainThread]: Partial parsing not enabled
�[0m14:08:26.376420 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11831c0d0>]}
�[0m14:08:26.397580 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1182de670>]}
�[0m14:08:26.399297 [info ] [MainThread]: Found 1 model, 0 sources, 0 exposures, 0 metrics, 471 macros, 0 groups, 0 semantic models
�[0m14:08:26.400651 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118253ca0>]}
�[0m14:08:26.403185 [info ] [MainThread]: 
�[0m14:08:26.404658 [debug] [MainThread]: Acquiring new databricks connection 'master'
�[0m14:08:26.407274 [debug] [ThreadPool]: Acquiring new databricks connection 'list_schemas'
�[0m14:08:26.430592 [debug] [ThreadPool]: Using databricks connection "list_schemas"
�[0m14:08:26.431796 [debug] [ThreadPool]: On list_schemas: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_schemas"} */

    show databases
  
�[0m14:08:26.432519 [debug] [ThreadPool]: Opening a new connection, currently in state init
�[0m14:08:27.681523 [info ] [ThreadPool]: databricks-sql-connector adapter: Successfully opened session 01ee4dec-8716-10b8-b18c-9ed310510eec
�[0m14:08:28.253997 [debug] [ThreadPool]: SQL status: OK in 1.8200000524520874 seconds
�[0m14:08:28.290675 [debug] [ThreadPool]: On list_schemas: Close
�[0m14:08:28.291899 [info ] [ThreadPool]: databricks-sql-connector adapter: Closing session 01ee4dec-8716-10b8-b18c-9ed310510eec
�[0m14:08:28.575694 [debug] [ThreadPool]: Re-using an available connection from the pool (formerly list_schemas, now list_None_dbt_jyeo)
�[0m14:08:28.601968 [debug] [ThreadPool]: Spark adapter: NotImplemented: add_begin_query
�[0m14:08:28.607101 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:28.607745 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */
show tables in `dbt_jyeo`
  
�[0m14:08:28.608339 [debug] [ThreadPool]: Opening a new connection, currently in state closed
�[0m14:08:29.504308 [info ] [ThreadPool]: databricks-sql-connector adapter: Successfully opened session 01ee4dec-882b-11d6-b9a4-41ee41de1bfa
�[0m14:08:30.051957 [debug] [ThreadPool]: SQL status: OK in 1.440000057220459 seconds
�[0m14:08:30.085716 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:30.086622 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */
show views in `dbt_jyeo`
  
�[0m14:08:30.603752 [debug] [ThreadPool]: SQL status: OK in 0.5199999809265137 seconds
�[0m14:08:30.623097 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:30.624081 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__0`
  
�[0m14:08:31.371161 [debug] [ThreadPool]: SQL status: OK in 0.75 seconds
�[0m14:08:31.381174 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:31.382018 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1`
  
�[0m14:08:32.102468 [debug] [ThreadPool]: SQL status: OK in 0.7200000286102295 seconds
�[0m14:08:32.109431 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:32.110426 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__10`
  
�[0m14:08:32.775299 [debug] [ThreadPool]: SQL status: OK in 0.6600000262260437 seconds
�[0m14:08:32.787004 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:32.788015 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__100`
  
�[0m14:08:33.485988 [debug] [ThreadPool]: SQL status: OK in 0.699999988079071 seconds
�[0m14:08:33.499019 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:33.500208 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1000`
  
�[0m14:08:34.205464 [debug] [ThreadPool]: SQL status: OK in 0.699999988079071 seconds
�[0m14:08:34.217129 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:34.218309 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1001`
  
�[0m14:08:34.898771 [debug] [ThreadPool]: SQL status: OK in 0.6800000071525574 seconds
�[0m14:08:34.910728 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:34.911870 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1002`
  
�[0m14:08:35.509657 [debug] [ThreadPool]: SQL status: OK in 0.6000000238418579 seconds
�[0m14:08:35.522024 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:35.523289 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1003`
  
�[0m14:08:36.142313 [debug] [ThreadPool]: SQL status: OK in 0.6200000047683716 seconds
�[0m14:08:36.148765 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:36.149505 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1004`
  
�[0m14:08:36.734367 [debug] [ThreadPool]: SQL status: OK in 0.5799999833106995 seconds
�[0m14:08:36.743767 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:08:36.744673 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1005`
  
�[0m14:08:37.367365 [debug] [ThreadPool]: SQL status: OK in 0.6200000047683716 seconds
  
<<<<<<<<<<TRUNCATED>>>>>>>>>>

02:11:49  SQL status: OK in 0.5899999737739563 seconds
02:11:49  Using databricks connection "list_None_dbt_jyeo"
02:11:49  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`bar`
  
02:11:49  SQL status: OK in 0.5899999737739563 seconds
02:11:49  Using databricks connection "list_None_dbt_jyeo"
02:11:49  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`foo`
  
02:11:50  SQL status: OK in 0.6000000238418579 seconds
02:11:50  Using databricks connection "list_None_dbt_jyeo"
02:11:50  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`inc_with_append`
  
02:11:51  SQL status: OK in 0.5899999737739563 seconds
02:11:51  Using databricks connection "list_None_dbt_jyeo"
02:11:51  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`inc_without_append`
  
02:11:51  SQL status: OK in 0.5799999833106995 seconds
02:11:51  On list_None_dbt_jyeo: ROLLBACK
02:11:51  Databricks adapter: NotImplemented: rollback
02:11:51  On list_None_dbt_jyeo: Close
02:11:51  databricks-sql-connector adapter: Closing session 01ee4dec-882b-11d6-b9a4-41ee41de1bfa
02:11:51  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118291580>]}
02:11:51  Spark adapter: NotImplemented: add_begin_query
02:11:51  Spark adapter: NotImplemented: commit
02:11:51  Concurrency: 1 threads (target='default')
02:11:51  
02:11:51  Began running node model.my_dbt_project.foo
02:11:51  1 of 1 START sql table model dbt_jyeo.foo ...................................... [RUN]
02:11:51  Re-using an available connection from the pool (formerly list_None_dbt_jyeo, now model.my_dbt_project.foo)
02:11:51  Began compiling node model.my_dbt_project.foo
02:11:51  Writing injected SQL for node "model.my_dbt_project.foo"
02:11:51  Timing info for model.my_dbt_project.foo (compile): 14:11:51.969753 => 14:11:51.981695
02:11:51  Began executing node model.my_dbt_project.foo
02:11:52  Spark adapter: NotImplemented: add_begin_query
02:11:52  Using databricks connection "model.my_dbt_project.foo"
02:11:52  On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "node_id": "model.my_dbt_project.foo"} */

      describe extended `dbt_jyeo`.`foo`
  
02:11:52  Opening a new connection, currently in state closed
02:11:53  databricks-sql-connector adapter: Successfully opened session 01ee4ded-0186-1cec-89ad-513fbff9480e
02:11:53  SQL status: OK in 1.7599999904632568 seconds
02:11:53  Writing runtime sql for node "model.my_dbt_project.foo"
02:11:53  Using databricks connection "model.my_dbt_project.foo"
02:11:53  On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.2", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "node_id": "model.my_dbt_project.foo"} */

  
    
        create or replace table `dbt_jyeo`.`foo`
      
      
    using delta
      
      
      
      
      
      
      
      as
      select 1 id
  
02:11:56  SQL status: OK in 3.0899999141693115 seconds
02:11:57  Timing info for model.my_dbt_project.foo (execute): 14:11:51.984375 => 14:11:57.003655
02:11:57  On model.my_dbt_project.foo: ROLLBACK
02:11:57  Databricks adapter: NotImplemented: rollback
02:11:57  On model.my_dbt_project.foo: Close
02:11:57  databricks-sql-connector adapter: Closing session 01ee4ded-0186-1cec-89ad-513fbff9480e
02:11:57  Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'abba8c69-2569-4628-b491-529c280f4946', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118193e20>]}
02:11:57  1 of 1 OK created sql table model dbt_jyeo.foo ................................. [OK in 5.34s]
02:11:57  Finished running node model.my_dbt_project.foo
02:11:57  On master: ROLLBACK
02:11:57  Opening a new connection, currently in state init
02:11:58  databricks-sql-connector adapter: Successfully opened session 01ee4ded-0490-1ba7-964b-b1130001391f
02:11:58  Databricks adapter: NotImplemented: rollback
02:11:58  Spark adapter: NotImplemented: add_begin_query
02:11:58  Spark adapter: NotImplemented: commit
02:11:58  On master: ROLLBACK
02:11:58  Databricks adapter: NotImplemented: rollback
02:11:58  On master: Close
02:11:58  databricks-sql-connector adapter: Closing session 01ee4ded-0490-1ba7-964b-b1130001391f
02:11:58  Connection 'master' was properly closed.
02:11:58  Connection 'model.my_dbt_project.foo' was properly closed.
02:11:58  
02:11:58  Finished running 1 table model in 0 hours 3 minutes and 32.05 seconds (212.05s).
02:11:58  Command end result
02:11:58  
02:11:58  Completed successfully
02:11:58  
02:11:58  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
02:11:58  Command `dbt run` succeeded at 14:11:58.479854 after 215.90 seconds
02:11:58  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10474a2b0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118291be0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1182ffa90>]}
02:11:58  Flushing usage events

Logs above are heavily truncated - but you get the idea - were describe extended-ing on everything in dbt_jyeo.

Expected behavior

We should not be running describe extended on everything in the schema... only the relevant models (like foo).

Screenshots and log output

See above.

System information

The output of dbt --version:

Core:
  - installed: 1.6.2
  - latest:    1.6.2 - Up to date!

Plugins:
  - databricks: 1.6.2 - Up to date!
  - spark:      1.6.0 - Up to date!

The operating system you're using:

macOS

The output of python --version:

Python 3.9.13

pip freeze on 1.6

agate==1.7.1
alembic==1.12.0
attrs==23.1.0
Babel==2.12.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
click==8.1.7
colorama==0.4.6
databricks-sdk==0.8.0
databricks-sql-connector==2.7.0
dbt-core==1.6.2
dbt-databricks==1.6.2
dbt-extractor==0.4.1
dbt-semantic-interfaces==0.2.0
dbt-spark==1.6.0
et-xmlfile==1.1.0
greenlet==2.0.2
hologram==0.0.16
idna==3.4
importlib-metadata==6.8.0
isodate==0.6.1
jaraco.classes==3.3.0
Jinja2==3.1.2
jsonschema==3.2.0
keyring==24.2.0
leather==0.3.4
Logbook==1.5.3
lz4==4.3.2
Mako==1.2.4
MarkupSafe==2.1.3
mashumaro==3.8.1
minimal-snowplow-tracker==0.0.2
more-itertools==8.14.0
msgpack==1.0.5
networkx==3.1
numpy==1.25.2
oauthlib==3.2.2
openpyxl==3.1.2
packaging==23.1
pandas==1.5.3
parsedatetime==2.6
pathspec==0.11.2
protobuf==4.24.3
pyarrow==13.0.0
pycparser==2.21
pydantic==1.10.12
pyrsistent==0.19.3
python-dateutil==2.8.2
python-slugify==8.0.1
pytimeparse==1.1.8
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
six==1.16.0
SQLAlchemy==1.4.49
sqlparams==5.1.0
sqlparse==0.4.4
text-unidecode==1.3
thrift==0.16.0
typing_extensions==4.7.1
urllib3==1.26.16
zipp==3.16.2

Additional context

If we downgrade to the dbt-databricks 1.5.latest and rerun:

$ dbt run
�[0m14:19:25.777034 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x104b17f70>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x105c1f3a0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x105c1ff10>]}


============================== 14:19:25.784520 | 8b592af7-eb8d-474c-8b6a-3838791c91f2 ==============================
�[0m14:19:25.784520 [info ] [MainThread]: Running with dbt=1.5.6
�[0m14:19:25.785964 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'False', 'cache_selected_only': 'False', 'warn_error': 'None', 'fail_fast': 'False', 'profiles_dir': '/Users/jeremy/.dbt', 'log_path': '/Users/jeremy/src/dbt-basic/logs', 'debug': 'True', 'version_check': 'True', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'log_format': 'default', 'introspect': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'static_parser': 'True', 'target_path': 'None', 'send_anonymous_usage_stats': 'True'}
�[0m14:19:49.461366 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x105ab05b0>]}
�[0m14:19:49.496752 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x117801ee0>]}
�[0m14:19:49.498118 [info ] [MainThread]: Registered adapter: databricks=1.5.5
�[0m14:19:49.542470 [debug] [MainThread]: checksum: 36b91491eb2b55e69ff526c98e337edadb953c9826242889489b2fea40332ad6, vars: {}, profile: , target: , version: 1.5.6
�[0m14:19:49.544442 [debug] [MainThread]: Partial parsing not enabled
�[0m14:19:50.959109 [debug] [MainThread]: 1699: static parser successfully parsed foo.sql
�[0m14:19:51.060864 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1179b8d90>]}
�[0m14:19:51.071696 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x117a1ea60>]}
�[0m14:19:51.073137 [info ] [MainThread]: Found 1 model, 0 tests, 0 snapshots, 0 analyses, 415 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
�[0m14:19:51.074581 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x117a1e580>]}
�[0m14:19:51.078237 [info ] [MainThread]: 
�[0m14:19:51.080705 [debug] [MainThread]: Acquiring new databricks connection 'master'
�[0m14:19:51.084346 [debug] [ThreadPool]: Acquiring new databricks connection 'list_schemas'
�[0m14:19:51.103138 [debug] [ThreadPool]: Using databricks connection "list_schemas"
�[0m14:19:51.104344 [debug] [ThreadPool]: On list_schemas: /* {"app": "dbt", "dbt_version": "1.5.6", "dbt_databricks_version": "1.5.5", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_schemas"} */

    show databases
  
�[0m14:19:51.105161 [debug] [ThreadPool]: Opening a new connection, currently in state init
�[0m14:19:52.298865 [info ] [ThreadPool]: databricks-sql-connector adapter: Successfully opened session 01ee4dee-1f1d-114e-a7bb-c3f7869bd0a5
�[0m14:19:52.980662 [debug] [ThreadPool]: SQL status: OK in 1.8700000047683716 seconds
�[0m14:19:52.999324 [debug] [ThreadPool]: On list_schemas: Close
�[0m14:19:53.000443 [info ] [ThreadPool]: databricks-sql-connector adapter: Closing session 01ee4dee-1f1d-114e-a7bb-c3f7869bd0a5
�[0m14:19:53.243535 [debug] [ThreadPool]: Re-using an available connection from the pool (formerly list_schemas, now list_None_dbt_jyeo)
�[0m14:19:53.266851 [debug] [ThreadPool]: Spark adapter: NotImplemented: add_begin_query
�[0m14:19:53.267712 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:19:53.268395 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.5.6", "dbt_databricks_version": "1.5.5", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */
show tables in `dbt_jyeo`
  
�[0m14:19:53.269280 [debug] [ThreadPool]: Opening a new connection, currently in state closed
�[0m14:19:54.151557 [info ] [ThreadPool]: databricks-sql-connector adapter: Successfully opened session 01ee4dee-203f-1ce8-a5d0-4d7763c55c0e
�[0m14:19:54.692294 [debug] [ThreadPool]: SQL status: OK in 1.4199999570846558 seconds
�[0m14:19:54.713247 [debug] [ThreadPool]: Using databricks connection "list_None_dbt_jyeo"
�[0m14:19:54.714082 [debug] [ThreadPool]: On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.5.6", "dbt_databricks_version": "1.5.5", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */
show views in `dbt_jyeo`
  
�[0m14:19:55.139263 [debug] [ThreadPool]: SQL status: OK in 0.41999998688697815 seconds
�[0m14:19:55.156175 [debug] [ThreadPool]: On list_None_dbt_jyeo: ROLLBACK
�[0m14:19:55.157347 [debug] [ThreadPool]: Databricks adapter: NotImplemented: rollback
�[0m14:19:55.158238 [debug] [ThreadPool]: On list_None_dbt_jyeo: Close
�[0m14:19:55.158963 [info ] [ThreadPool]: databricks-sql-connector adapter: Closing session 01ee4dee-203f-1ce8-a5d0-4d7763c55c0e
�[0m14:19:55.538206 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x117a0b940>]}
�[0m14:19:55.539550 [debug] [MainThread]: Spark adapter: NotImplemented: add_begin_query
�[0m14:19:55.540212 [debug] [MainThread]: Spark adapter: NotImplemented: commit
�[0m14:19:55.541260 [info ] [MainThread]: Concurrency: 1 threads (target='default')
�[0m14:19:55.541859 [info ] [MainThread]: 
�[0m14:19:55.547490 [debug] [Thread-1  ]: Began running node model.my_dbt_project.foo
�[0m14:19:55.548461 [info ] [Thread-1  ]: 1 of 1 START sql table model dbt_jyeo.foo ...................................... [RUN]
�[0m14:19:55.549707 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly list_None_dbt_jyeo, now model.my_dbt_project.foo)
�[0m14:19:55.550552 [debug] [Thread-1  ]: Began compiling node model.my_dbt_project.foo
�[0m14:19:55.636659 [debug] [Thread-1  ]: Writing injected SQL for node "model.my_dbt_project.foo"
�[0m14:19:55.639315 [debug] [Thread-1  ]: Timing info for model.my_dbt_project.foo (compile): 14:19:55.551143 => 14:19:55.638801
�[0m14:19:55.640253 [debug] [Thread-1  ]: Began executing node model.my_dbt_project.foo
�[0m14:19:55.663655 [debug] [Thread-1  ]: Spark adapter: NotImplemented: add_begin_query
�[0m14:19:55.664386 [debug] [Thread-1  ]: Using databricks connection "model.my_dbt_project.foo"
�[0m14:19:55.664907 [debug] [Thread-1  ]: On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.5.6", "dbt_databricks_version": "1.5.5", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "node_id": "model.my_dbt_project.foo"} */

      describe extended `dbt_jyeo`.`foo`
  
�[0m14:19:55.665460 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
�[0m14:19:56.639609 [info ] [Thread-1  ]: databricks-sql-connector adapter: Successfully opened session 01ee4dee-21ba-1c9f-84b7-092f7b334d33
�[0m14:19:57.598261 [debug] [Thread-1  ]: SQL status: OK in 1.9299999475479126 seconds
�[0m14:19:57.685121 [debug] [Thread-1  ]: Writing runtime sql for node "model.my_dbt_project.foo"
�[0m14:19:57.687009 [debug] [Thread-1  ]: Using databricks connection "model.my_dbt_project.foo"
�[0m14:19:57.688188 [debug] [Thread-1  ]: On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.5.6", "dbt_databricks_version": "1.5.5", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "node_id": "model.my_dbt_project.foo"} */

  
    
        create or replace table `dbt_jyeo`.`foo`
      
      
    using delta
      
      
      
      
      
      
      as
      select 1 id
  
�[0m14:20:00.968429 [debug] [Thread-1  ]: SQL status: OK in 3.2799999713897705 seconds
�[0m14:20:01.030909 [debug] [Thread-1  ]: Timing info for model.my_dbt_project.foo (execute): 14:19:55.640742 => 14:20:01.030153
�[0m14:20:01.032479 [debug] [Thread-1  ]: On model.my_dbt_project.foo: ROLLBACK
�[0m14:20:01.033861 [debug] [Thread-1  ]: Databricks adapter: NotImplemented: rollback
�[0m14:20:01.035172 [debug] [Thread-1  ]: On model.my_dbt_project.foo: Close
�[0m14:20:01.036711 [info ] [Thread-1  ]: databricks-sql-connector adapter: Closing session 01ee4dee-21ba-1c9f-84b7-092f7b334d33
�[0m14:20:01.263678 [debug] [Thread-1  ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '8b592af7-eb8d-474c-8b6a-3838791c91f2', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x117ac51f0>]}
�[0m14:20:01.264762 [info ] [Thread-1  ]: 1 of 1 OK created sql table model dbt_jyeo.foo ................................. [�[32mOK�[0m in 5.71s]
�[0m14:20:01.265670 [debug] [Thread-1  ]: Finished running node model.my_dbt_project.foo
�[0m14:20:01.267750 [debug] [MainThread]: On master: ROLLBACK
�[0m14:20:01.268933 [debug] [MainThread]: Opening a new connection, currently in state init
�[0m14:20:02.143088 [info ] [MainThread]: databricks-sql-connector adapter: Successfully opened session 01ee4dee-2503-1172-a656-9313b589d77a
�[0m14:20:02.145555 [debug] [MainThread]: Databricks adapter: NotImplemented: rollback
�[0m14:20:02.147707 [debug] [MainThread]: Spark adapter: NotImplemented: add_begin_query
�[0m14:20:02.150199 [debug] [MainThread]: Spark adapter: NotImplemented: commit
�[0m14:20:02.153210 [debug] [MainThread]: On master: ROLLBACK
�[0m14:20:02.154782 [debug] [MainThread]: Databricks adapter: NotImplemented: rollback
�[0m14:20:02.156157 [debug] [MainThread]: On master: Close
�[0m14:20:02.157623 [info ] [MainThread]: databricks-sql-connector adapter: Closing session 01ee4dee-2503-1172-a656-9313b589d77a
�[0m14:20:02.391046 [debug] [MainThread]: Connection 'master' was properly closed.
�[0m14:20:02.392693 [debug] [MainThread]: Connection 'model.my_dbt_project.foo' was properly closed.
�[0m14:20:02.396284 [info ] [MainThread]: 
�[0m14:20:02.397162 [info ] [MainThread]: Finished running 1 table model in 0 hours 0 minutes and 11.32 seconds (11.32s).
�[0m14:20:02.398807 [debug] [MainThread]: Command end result
�[0m14:20:02.413292 [info ] [MainThread]: 
�[0m14:20:02.414365 [info ] [MainThread]: �[32mCompleted successfully�[0m
�[0m14:20:02.415551 [info ] [MainThread]: 
�[0m14:20:02.416888 [info ] [MainThread]: Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
�[0m14:20:02.418859 [debug] [MainThread]: Command `dbt run` succeeded at 14:20:02.418656 after 36.69 seconds
�[0m14:20:02.419643 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x104b17f70>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x117954ca0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x102b84e20>]}
�[0m14:20:02.420511 [debug] [MainThread]: Flushing usage events

Didn't have to truncate the logs this time.

^ ~37 seconds vs ~215 seconds.

The implications here are significant actually - i.e. some folks write all their dbt Cloud Slim CI tables to the same schema - so with many open PR's the schema get's filled up more and more and having to describe extended over potentially many thousands of objects before a model is even run is a huge performance impact.

pip freeze on 1.5

agate==1.7.0
alembic==1.12.0
attrs==23.1.0
Babel==2.12.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
click==8.1.7
colorama==0.4.6
databricks-sdk==0.1.7
databricks-sql-connector==2.7.0
dbt-core==1.5.6
dbt-databricks==1.5.5
dbt-extractor==0.4.1
dbt-spark==1.5.2
et-xmlfile==1.1.0
future==0.18.3
greenlet==2.0.2
hologram==0.0.16
idna==3.4
importlib-metadata==6.8.0
isodate==0.6.1
jaraco.classes==3.3.0
Jinja2==3.1.2
jsonschema==4.19.0
jsonschema-specifications==2023.7.1
keyring==24.2.0
leather==0.3.4
Logbook==1.5.3
lz4==4.3.2
Mako==1.2.4
MarkupSafe==2.1.3
mashumaro==3.6
minimal-snowplow-tracker==0.0.2
more-itertools==10.1.0
msgpack==1.0.5
networkx==2.8.8
numpy==1.25.2
oauthlib==3.2.2
openpyxl==3.1.2
packaging==23.1
pandas==1.5.3
parsedatetime==2.4
pathspec==0.11.2
protobuf==4.24.3
pyarrow==13.0.0
pycparser==2.21
python-dateutil==2.8.2
python-slugify==8.0.1
pytimeparse==1.1.8
pytz==2023.3.post1
PyYAML==6.0.1
referencing==0.30.2
requests==2.28.2
rpds-py==0.10.2
six==1.16.0
SQLAlchemy==1.4.49
sqlparams==5.1.0
sqlparse==0.4.4
text-unidecode==1.3
thrift==0.16.0
typing_extensions==4.7.1
urllib3==1.26.16
Werkzeug==2.3.7
zipp==3.16.2
@jeremyyeo jeremyyeo added the bug Something isn't working label Sep 8, 2023
@jeremyyeo jeremyyeo changed the title dbt-databricks 1.6 overeager introspection (describe extended ...) of all tables in schema dbt-databricks 1.6.2 overeager introspection (describe extended ...) of all tables in schema Sep 8, 2023
@jeremyyeo
Copy link
Author

Fwiw, I tested dbt-databricks==1.6.1 and dbt-databricks==1.6.0 after finding #403 and both exhibited the same behaviour as dbt-databricks==1.6.2:

02:50:21  Using databricks connection "list_None_dbt_jyeo"
02:50:21  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.1", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1274`
  
02:50:22  SQL status: OK in 0.6299999952316284 seconds
02:50:22  Using databricks connection "list_None_dbt_jyeo"
02:50:22  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.1", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1275`
  
02:50:22  SQL status: OK in 0.6000000238418579 seconds
02:50:22  Using databricks connection "list_None_dbt_jyeo"
02:50:22  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.1", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1276`
  
02:50:23  SQL status: OK in 0.6000000238418579 seconds
02:50:23  Using databricks connection "list_None_dbt_jyeo"
02:50:23  On list_None_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.2", "dbt_databricks_version": "1.6.1", "databricks_sql_connector_version": "2.7.0", "profile_name": "databricks", "target_name": "default", "connection_name": "list_None_dbt_jyeo"} */

      describe extended `dbt_jyeo`.`_tmp_0__1277`
  
^C02:50:24  SQL status: OK in 0.6000000238418579 seconds

@jeremyyeo jeremyyeo changed the title dbt-databricks 1.6.2 overeager introspection (describe extended ...) of all tables in schema dbt-databricks 1.6.X overeager introspection (describe extended ...) of all tables in schema Sep 8, 2023
@benc-db
Copy link
Collaborator

benc-db commented Sep 8, 2023

@susodapop I thought we fixed this...:/

@benc-db
Copy link
Collaborator

benc-db commented Sep 8, 2023

Thanks for reporting @jeremyyeo

@dataders
Copy link
Contributor

@jeremyyeo, shot in the dark, does the error reproduce if you specify a catalog in the profile?

My interpretation of #231, is that the fasting caching only happens if this is the case, but I have yet to see the conditional code path that impl.py takes when catalog is not provided...

@benc-db
Copy link
Collaborator

benc-db commented Sep 12, 2023

It appears to be a hive-metastore specific thing. In my personal testing, I don't hit this issue if I specify some other catalog, but when specifying hive-metastore, it reproduces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants