Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate schema discovery queries that happen at startup #2127

Closed
cmcarthur opened this issue Feb 12, 2020 · 1 comment · Fixed by #2157
Closed

Evaluate schema discovery queries that happen at startup #2127

cmcarthur opened this issue Feb 12, 2020 · 1 comment · Fixed by #2157

Comments

@cmcarthur
Copy link
Member

cmcarthur commented Feb 12, 2020

dbt runs a bunch of introspective queries at startup:

  • finding the schemas that exist in the database
  • finding the tables in those schemas
  • creating schemas if they don't exist

On some databases (postgres/redshift) these information schemas are pretty quick. On other databases (snowflake, spark) these queries can (and do) run for 10s of seconds! This means that the typical flow for running dbt looks like:

$ dbt run
<wait a not-insignificant number of seconds>
12:00:00 | Running with 8 threads
....

These queries are unavoidable. We experimented with using faster alternatives to the information schema (#1877) and ultimately, we must reckon with the reality that hitting the information schema can be slow. The next best thing that we can do is understand and optimize when these queries happen in order to provide the best possible experience for users running dbt.

@drewbanin drewbanin changed the title Deeply evaluate schema discovery queries that happen at startup, decide which ones don't need to happen Evaluate schema discovery queries that happen at startup Feb 20, 2020
@beckjake
Copy link
Contributor

There are really only two metadata queries that occur before a run starts on snowflake:

  • we list all the schemas from the information schema. We do this to avoid costly unnecessary 'create schema if not exists ...' calls.
    • One alternative here is to always try to create the schema if it doesn't exist. Very unscientifically: against DBT_TEST, the list_schemas call took 2.21s. A create schema if not exists for an existing schema took 2.24s. If that's at least close to normal, changing this will hurt performance in the >1 schemas case.
    • We could also try to use show schemas in {database}. The downsides are:
      1. You have to account for the >10k schemas case and fall back (probably fine!)
      2. You have to account for the database case sensitivity, likely with an extra show databases call
      3. It seems like you have to use table(result_scan(last_query_id())) or similar to get useful data out of show commands, which doesn't seem substantially faster than information_schema. Maybe I'm way off here, if I am this might be worth doing.
    • We could parallelize this, using one thread per database.
  • We fill the cache with information about relations, by querying information_schema.tables. This area is probably the only hope for improving performance on this front.
    • Previous work on making catalogs faster didn't seem to indicate that describe or show had any significant performance improvements over querying the information_schema, and has similar limits around 10k. I'm happy to create a branch that does it this way and we can try it out on some large snowflake environments and see what happens.could create some frustrating performance side-effects.
    • We could do this up front, but parallelize this part across all the databases we're querying. This will probably get slower in the one-db case (paying the threading overhead costs) and faster everywhere else.
    • Now that we fully lazy-load our connections, it's possible that we could lazy-fill the cache on a schema by schema basis for non-postgres databases (postgres/redshift need to create the metadata links for drop ... cascades into views, so I think they'd still have to do this up-front). I don't think that's the worst idea ever, though there will be some thread coordination issues we'll have to consider, bottlenecking there could easily tank performance. This idea is the most likely to cause weird annoying deadlock-y bugs and hard-to-predict performance behavior. The graph queue would probably have to get smarter about ordering and schedule tasks in the same database more carefully.

dbt also makes a number of unnecessary BEGIN and COMMIT calls, each of which takes ~0.5s on my machine. We can save a couple seconds by skipping those, there seems to be no good reason for these queries to run in a transaction, beyond dbt's innate eagerness to wrap everything in transactions.

beckjake added a commit that referenced this issue Feb 27, 2020
…tartup

Use threadpools for filling the cache and listing schemas (#2127)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants