-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache schema for selected models #4860
cache schema for selected models #4860
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting the PR!! Added some questions and suggestions(some of them are more like making the codebase cleaner since we are here).
One other thing we should add is some tests to test the expected behavior. I understand this is might be tricky in terms of what use cases we want to test and how to properly test those. And I am not sure if we have an existing test for this kind of stuff. But would love to hear your thoughts and work with you on it.
core/dbt/task/runnable.py
Outdated
def populate_adapter_cache(self, adapter): | ||
adapter.set_relations_cache(self.manifest) | ||
def populate_adapter_cache(self, adapter, required_schemas: Set[BaseRelation] = None): | ||
if flags.SELECTED_SCHEMA_CACHE is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like we can just remove this function and call adapter.set_relations_cache
in places that call populate_adapter_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
populate_adapter_cache
here in GraphRunnableTask
has been inherited by RunTask
where it is mostly used. Moving this logic to before_run
function would duplicate the same logic in before_run
function in GraphRunnableTask
and RunTask
class.
If there is room to reorganize stuffs around, might make sense to do by someone more experienced with these code then me. This is the first time I am touching dbt-core :-)
core/dbt/adapters/base/impl.py
Outdated
@@ -337,11 +337,12 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap: | |||
# databases | |||
return info_schema_name_map | |||
|
|||
def _relations_cache_for_schemas(self, manifest: Manifest) -> None: | |||
def _relations_cache_for_schemas(self, manifest: Manifest, cache_schemas: Set[BaseRelation] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are here, let's just remove this function and move all of the logic into set_relations_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these methods are defined in the adapters
module, these are potentially breaking changes to the adapter interface. E.g. the Postgres plugin reimplements this method (currently with two arguments), and would require updating:
dbt-core/plugins/postgres/dbt/adapters/postgres/impl.py
Lines 129 to 131 in 5e0a765
def _relations_cache_for_schemas(self, manifest): | |
super()._relations_cache_for_schemas(manifest) | |
self._link_cached_relations(manifest) |
@@ -436,8 +436,9 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]): | |||
|
|||
def before_run(self, adapter, selected_uids: AbstractSet[str]): | |||
with adapter.connection_named("master"): | |||
self.create_schemas(adapter, selected_uids) | |||
self.populate_adapter_cache(adapter) | |||
required_schemas = self.get_model_schemas(adapter, selected_uids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 get_model_schemas
would remove all of the node that is not in selected_uids
, is there a case we need adapter cache to contain more nodes than selected_nodes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenyuLInx I'm not sure! That's the critical question animating in this issue/PR. I will say, having this PR makes it a lot easier to test.
- Will dbt return incorrect results, if it fails to cache an unselected relation? I don't think so, as long as caching happens consistently at the schema level. (That's an open question for us right now on Spark: [CT-302] [Spike] Benchmark perf for
show tables
,show views
dbt-spark#296) - How will this work for defer? When
--defer
is enabled, it's necessary to calladapter.get_relation
for all unselected parents of selected models, to figure out whether to use a "dev" version or defer to the "prod" version. In my local testing, this still works! dbt will just record a cache miss, and run the query independently:
08:35:23.500589 [debug] [MainThread]: On "master": cache miss for schema "jerco.dbt_jcohen_schema_a", this is inefficient
It's likely that this will be less efficient. So, I think it's important that the behavior remains configurable. I'm not opposed to moving forward with this, in order to test the waters. We may someday wish to turn it on by default.
@@ -1084,6 +1084,27 @@ def parse_args(args, cls=DBTArgumentParser): | |||
""", | |||
) | |||
|
|||
schema_cache_flag = p.add_mutually_exclusive_group() | |||
schema_cache_flag.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fairly new to all of the command arguments, would like to hear why you chose to add two arguments that would change the same dest
vs the default behavior is not use this feature and a just one flag to enable the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could set selected_schema_cache: true
in profile.yml as general config for their project but may choose to override that with --no-selected-schema-cache
in cli for specific scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right on! The inheritance order for "global" configs is CLI flag > env var (DBT_{config}
) > profile config
I would vote for naming this config in a way that's slightly more generic (not every database calls it "schema," not every database caches at the schema level). Perhaps CACHE_SELECTED_ONLY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karunpoudel Very cool! Thanks so much for taking the idea and running with it. I left a few comments
core/dbt/adapters/base/impl.py
Outdated
@@ -337,11 +337,12 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap: | |||
# databases | |||
return info_schema_name_map | |||
|
|||
def _relations_cache_for_schemas(self, manifest: Manifest) -> None: | |||
def _relations_cache_for_schemas(self, manifest: Manifest, cache_schemas: Set[BaseRelation] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these methods are defined in the adapters
module, these are potentially breaking changes to the adapter interface. E.g. the Postgres plugin reimplements this method (currently with two arguments), and would require updating:
dbt-core/plugins/postgres/dbt/adapters/postgres/impl.py
Lines 129 to 131 in 5e0a765
def _relations_cache_for_schemas(self, manifest): | |
super()._relations_cache_for_schemas(manifest) | |
self._link_cached_relations(manifest) |
@@ -1084,6 +1084,27 @@ def parse_args(args, cls=DBTArgumentParser): | |||
""", | |||
) | |||
|
|||
schema_cache_flag = p.add_mutually_exclusive_group() | |||
schema_cache_flag.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right on! The inheritance order for "global" configs is CLI flag > env var (DBT_{config}
) > profile config
I would vote for naming this config in a way that's slightly more generic (not every database calls it "schema," not every database caches at the schema level). Perhaps CACHE_SELECTED_ONLY
?
@@ -436,8 +436,9 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]): | |||
|
|||
def before_run(self, adapter, selected_uids: AbstractSet[str]): | |||
with adapter.connection_named("master"): | |||
self.create_schemas(adapter, selected_uids) | |||
self.populate_adapter_cache(adapter) | |||
required_schemas = self.get_model_schemas(adapter, selected_uids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenyuLInx I'm not sure! That's the critical question animating in this issue/PR. I will say, having this PR makes it a lot easier to test.
- Will dbt return incorrect results, if it fails to cache an unselected relation? I don't think so, as long as caching happens consistently at the schema level. (That's an open question for us right now on Spark: [CT-302] [Spike] Benchmark perf for
show tables
,show views
dbt-spark#296) - How will this work for defer? When
--defer
is enabled, it's necessary to calladapter.get_relation
for all unselected parents of selected models, to figure out whether to use a "dev" version or defer to the "prod" version. In my local testing, this still works! dbt will just record a cache miss, and run the query independently:
08:35:23.500589 [debug] [MainThread]: On "master": cache miss for schema "jerco.dbt_jcohen_schema_a", this is inefficient
It's likely that this will be less efficient. So, I think it's important that the behavior remains configurable. I'm not opposed to moving forward with this, in order to test the waters. We may someday wish to turn it on by default.
rename flag to cache_selected_only, update postgres adapter: function _relations_cache_for_schemas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks @karunpoudel for making it happen! We added some pre-commit hook to format the code. I think if you follow the instruction here and push the changes then the failed GHA will pass.
@jtcohen6, @ChenyuLInx , I am thinking of adding test to check cached relation after supplying this new flag. Do you have existing integration test to verify that caching is working properly for project with multiple database/schema. I can add a new test to that. |
|
I'd be happy to have this change as is, as an optional/experimental flag, included in v1.1.0-rc1. We already have a follow-on issue open to add more rigorous testing and tracking (#4961), ahead of this being something we'd want to turn on by default. |
@@ -1088,6 +1088,27 @@ def parse_args(args, cls=DBTArgumentParser): | |||
""", | |||
) | |||
|
|||
schema_cache_flag = p.add_mutually_exclusive_group() | |||
schema_cache_flag.add_argument( | |||
"--cache_selected_only", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I just caught: for consistency with other flags, this should probably be --cache-selected-only
(hyphens not underscores)
Rebased + merged in #5036 |
resolves #4688
Description
Cache schema for selected models only instead of caching all schemas in a project when
--cache_selected_only
option is provided in cli or config. This will help improve startup time when a project has many databases and schema but only few models are selected.This behavior is similar to how schemas are created currently. Dbt only creates schema (if they don't exists) for selected models only rather than all schema defined in a project.
Checklist