Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt compile runs in 7+ minutes #243

Closed
dot2dotseurat opened this issue Dec 28, 2022 · 8 comments
Closed

dbt compile runs in 7+ minutes #243

dot2dotseurat opened this issue Dec 28, 2022 · 8 comments
Labels
bug Something isn't working performance Stale

Comments

@dot2dotseurat
Copy link

Describe the bug

A clear and concise description of what the bug is. What command did you run? What happened?

running dbt compile routine takes upwards of 7 minutes and sometimes greater than 10 minutes. Project size listed below by dbt logs.

✗ dbt compile
15:48:46  Running with dbt=1.2.3
15:48:48  Found 537 models, 842 tests, 0 snapshots, 0 analyses, 551 macros, 0 operations, 42 seed files, 672 sources, 0 exposures, 0 metrics
15:48:48  
15:54:40  Concurrency: 8 threads (target='dev')
15:54:40  
15:55:14  Done.


Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

The project is open sourced here. The issue can be reproduced by cloning the repo and running dbt compile after installing the pipenv included in the repo.

Expected behavior

A clear and concise description of what you expected to happen.

I would expect compiling to run closed to 30 seconds based on what I've seen for other dbt projects.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.2.3
  - latest:    1.3.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.2.3 - Update available!
  - spark:      1.2.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:
Mac Big Sur version 11.5.2, M1 chip

The output of python --version:
Python 3.9.15

Additional context

Add any other context about the problem here.

@dot2dotseurat dot2dotseurat added the bug Something isn't working label Dec 28, 2022
@ueshin
Copy link
Collaborator

ueshin commented Dec 28, 2022

Hi @dot2dotseurat, thanks for the report.

I tried the project in my local and it took like 4 mins even without any tables created, and I guess it will be even worse with tables created.
It's spending long time to retrieve metadata from more than 200 schemas.

I guess this is not an adapter issue, but how dbt-core creates metadata which needs to run queries for each schema, IIUC.
cc @jtcohen6

@dot2dotseurat
Copy link
Author

Hi @ueshin,

Thank you so much for the quick reply.

As I understand it, dbt compile doesn't require a database connection to run. I just did a quick check and dbt compile runs in the expected couple of seconds if I remove the database credentials from my profile.yml (replacing host, token etc with placeholders). After replacing all of the credentials, the project again compiles in 7+ minutes.

So I think you are right that it's something about querying metadata but it's odd that dbt compile apparently can run successfully without running those queries.

Anyway, I am happy to open an issue with dbt-core if @jtcohen6 agrees that's the better place for this inquiry.

Thanks again for the quick response.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 5, 2023

Thanks for tagging me in @ueshin - sorry for the delay.

@dot2dotseurat This isn't quite right:

As I understand it, dbt compile doesn't require a database connection to run.

While dbt doesn't require a database connection to parse your project, or list the resources from it, dbt compile does require a database connection because some Jinja-SQL models may depend on introspective queries (implicitly or explicitly) in order to template out SQL. If you have any models containing macros like (e.g.) {{ is_incremental() }}, {{ run_query(...) }}, {{ dbt_utils.get_columns_values(...) }} — all of those require running queries first, or accessing the relation cache to understand which objects do or don't exist in the database, before dbt can produce the compiled SQL for that model. That remains true, even though dbt is not actually running that model / its compiled SQL.

At the same time, I understand that cache population is a time-consuming step, especially on Spark/Databricks today. The good news is that #231 should (hopefully) speed up cache population significantly; I'm guessing that change will be landing in the next release of dbt-databricks (v1.4).

I also just opened a dbt-core issue, proposing an optional config to skip cache population entirely (dbt-labs/dbt-core#6526), which may be faster or slower depending on the command, environment, and selection criteria.

@ueshin
Copy link
Collaborator

ueshin commented Jan 5, 2023

@jtcohen6 Thanks for the reply.

The good news is that #231 should (hopefully) speed up cache population significantly

Unfortunately, #231 won't help this case. When creating the cache, dbt-core asks adapters to run list_relations_without_caching for each schema. If there are too many schemas defined in the project, the method will be called many times anyway now.
I guess it's the same with dbt-snowflake, dbt-bigquery, or other adapters?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 9, 2023

When creating the cache, dbt-core asks adapters to run list_relations_without_caching for each schema. If there are too many schemas defined in the project, the method will be called many times anyway now.

This is true, the overall mechanism remains the same. The important difference is speed. In our testing several months ago, in the issues & draft PRs that eventually led to #231, we found that show tables in [schema] + show views in [schema] is significantly faster than show table extended in [schema] like '*'. Also, dbt will parallelize the queries for different schemas, up to the configured number of --threads.

I guess it's the same with dbt-snowflake, dbt-bigquery, or other adapters?

Correct, it's the same. On other adapters, the biggest difference is that the metadata queries powering cache population are quite fast. On dbt-bigquery, we skip running a query entirely, and just make an API call.

Even then, we do run into performance issues at serious scale (e.g. dbt-labs/dbt-snowflake#83, dbt-labs/dbt-bigquery#205). Hence the interest in allowing users to either do partial caching (supported as an experimental config), or skip cache population entirely (new issue/proposal I linked above).

@ueshin
Copy link
Collaborator

ueshin commented Jan 10, 2023

@dot2dotseurat Is it possible to upgrade dbt to 1.3?
If possible, could you try dbt-databricks 1.3.3rc1 that contains #231 to see how much it can improve?

@dot2dotseurat
Copy link
Author

@ueshin I gave it a shot and I didn't see an improvement. :/

@github-actions
Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Stale
Projects
None yet
Development

No branches or pull requests

5 participants