cache schema for selected models #4860

karunpoudel · 2022-03-13T00:20:01Z

resolves #4688

Description

Cache schema for selected models only instead of caching all schemas in a project when --cache_selected_only option is provided in cli or config. This will help improve startup time when a project has many databases and schema but only few models are selected.

This behavior is similar to how schemas are created currently. Dbt only creates schema (if they don't exists) for selected models only rather than all schema defined in a project.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have added information about my change to be included in the CHANGELOG.

ChenyuLInx

Thanks for submitting the PR!! Added some questions and suggestions(some of them are more like making the codebase cleaner since we are here).
One other thing we should add is some tests to test the expected behavior. I understand this is might be tricky in terms of what use cases we want to test and how to properly test those. And I am not sure if we have an existing test for this kind of stuff. But would love to hear your thoughts and work with you on it.

ChenyuLInx · 2022-03-15T22:15:09Z

core/dbt/task/runnable.py

-    def populate_adapter_cache(self, adapter):
-        adapter.set_relations_cache(self.manifest)
+    def populate_adapter_cache(self, adapter, required_schemas: Set[BaseRelation] = None):
+        if flags.SELECTED_SCHEMA_CACHE is True:


Feels like we can just remove this function and call adapter.set_relations_cache in places that call populate_adapter_cache

populate_adapter_cache here in GraphRunnableTask has been inherited by RunTask where it is mostly used. Moving this logic to before_run function would duplicate the same logic in before_run function in GraphRunnableTask and RunTask class.
If there is room to reorganize stuffs around, might make sense to do by someone more experienced with these code then me. This is the first time I am touching dbt-core :-)

ChenyuLInx · 2022-03-15T22:17:49Z

core/dbt/adapters/base/impl.py

@@ -337,11 +337,12 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap:
        # databases
        return info_schema_name_map

-    def _relations_cache_for_schemas(self, manifest: Manifest) -> None:
+    def _relations_cache_for_schemas(self, manifest: Manifest, cache_schemas: Set[BaseRelation] = None) -> None:


Since we are here, let's just remove this function and move all of the logic into set_relations_cache

Because these methods are defined in the adapters module, these are potentially breaking changes to the adapter interface. E.g. the Postgres plugin reimplements this method (currently with two arguments), and would require updating:

dbt-core/plugins/postgres/dbt/adapters/postgres/impl.py

Lines 129 to 131 in 5e0a765

def _relations_cache_for_schemas(self, manifest):

super()._relations_cache_for_schemas(manifest)

self._link_cached_relations(manifest)

ChenyuLInx · 2022-03-15T22:27:41Z

core/dbt/task/run.py

@@ -436,8 +436,9 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):

    def before_run(self, adapter, selected_uids: AbstractSet[str]):
        with adapter.connection_named("master"):
-            self.create_schemas(adapter, selected_uids)
-            self.populate_adapter_cache(adapter)
+            required_schemas = self.get_model_schemas(adapter, selected_uids)


@jtcohen6 get_model_schemas would remove all of the node that is not in selected_uids, is there a case we need adapter cache to contain more nodes than selected_nodes?

@ChenyuLInx I'm not sure! That's the critical question animating in this issue/PR. I will say, having this PR makes it a lot easier to test.

Will dbt return incorrect results, if it fails to cache an unselected relation? I don't think so, as long as caching happens consistently at the schema level. (That's an open question for us right now on Spark: [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296)

How will this work for defer? When --defer is enabled, it's necessary to call adapter.get_relation for all unselected parents of selected models, to figure out whether to use a "dev" version or defer to the "prod" version. In my local testing, this still works! dbt will just record a cache miss, and run the query independently:

08:35:23.500589 [debug] [MainThread]: On "master": cache miss for schema "jerco.dbt_jcohen_schema_a", this is inefficient

It's likely that this will be less efficient. So, I think it's important that the behavior remains configurable. I'm not opposed to moving forward with this, in order to test the waters. We may someday wish to turn it on by default.

ChenyuLInx · 2022-03-15T22:30:42Z

core/dbt/main.py

@@ -1084,6 +1084,27 @@ def parse_args(args, cls=DBTArgumentParser):
        """,
    )

+    schema_cache_flag = p.add_mutually_exclusive_group()
+    schema_cache_flag.add_argument(


I am fairly new to all of the command arguments, would like to hear why you chose to add two arguments that would change the same dest vs the default behavior is not use this feature and a just one flag to enable the feature.

One could set selected_schema_cache: true in profile.yml as general config for their project but may choose to override that with --no-selected-schema-cache in cli for specific scenario.

Right on! The inheritance order for "global" configs is CLI flag > env var (DBT_{config}) > profile config

I would vote for naming this config in a way that's slightly more generic (not every database calls it "schema," not every database caches at the schema level). Perhaps CACHE_SELECTED_ONLY?

jtcohen6

@karunpoudel Very cool! Thanks so much for taking the idea and running with it. I left a few comments

jtcohen6 · 2022-03-18T08:27:39Z

core/dbt/adapters/base/impl.py

@@ -337,11 +337,12 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap:
        # databases
        return info_schema_name_map

-    def _relations_cache_for_schemas(self, manifest: Manifest) -> None:
+    def _relations_cache_for_schemas(self, manifest: Manifest, cache_schemas: Set[BaseRelation] = None) -> None:


Because these methods are defined in the adapters module, these are potentially breaking changes to the adapter interface. E.g. the Postgres plugin reimplements this method (currently with two arguments), and would require updating:

dbt-core/plugins/postgres/dbt/adapters/postgres/impl.py

Lines 129 to 131 in 5e0a765

def _relations_cache_for_schemas(self, manifest):

super()._relations_cache_for_schemas(manifest)

self._link_cached_relations(manifest)

jtcohen6 · 2022-03-18T08:29:03Z

core/dbt/main.py

@@ -1084,6 +1084,27 @@ def parse_args(args, cls=DBTArgumentParser):
        """,
    )

+    schema_cache_flag = p.add_mutually_exclusive_group()
+    schema_cache_flag.add_argument(


Right on! The inheritance order for "global" configs is CLI flag > env var (DBT_{config}) > profile config

I would vote for naming this config in a way that's slightly more generic (not every database calls it "schema," not every database caches at the schema level). Perhaps CACHE_SELECTED_ONLY?

jtcohen6 · 2022-03-18T08:41:25Z

core/dbt/task/run.py

@@ -436,8 +436,9 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):

    def before_run(self, adapter, selected_uids: AbstractSet[str]):
        with adapter.connection_named("master"):
-            self.create_schemas(adapter, selected_uids)
-            self.populate_adapter_cache(adapter)
+            required_schemas = self.get_model_schemas(adapter, selected_uids)


@ChenyuLInx I'm not sure! That's the critical question animating in this issue/PR. I will say, having this PR makes it a lot easier to test.

Will dbt return incorrect results, if it fails to cache an unselected relation? I don't think so, as long as caching happens consistently at the schema level. (That's an open question for us right now on Spark: [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296)

How will this work for defer? When --defer is enabled, it's necessary to call adapter.get_relation for all unselected parents of selected models, to figure out whether to use a "dev" version or defer to the "prod" version. In my local testing, this still works! dbt will just record a cache miss, and run the query independently:

08:35:23.500589 [debug] [MainThread]: On "master": cache miss for schema "jerco.dbt_jcohen_schema_a", this is inefficient

It's likely that this will be less efficient. So, I think it's important that the behavior remains configurable. I'm not opposed to moving forward with this, in order to test the waters. We may someday wish to turn it on by default.

rename flag to cache_selected_only, update postgres adapter: function _relations_cache_for_schemas

…lected-models

ChenyuLInx

Looks good to me! Thanks @karunpoudel for making it happen! We added some pre-commit hook to format the code. I think if you follow the instruction here and push the changes then the failed GHA will pass.

karunpoudel · 2022-04-04T18:23:23Z

@jtcohen6, @ChenyuLInx , I am thinking of adding test to check cached relation after supplying this new flag. Do you have existing integration test to verify that caching is working properly for project with multiple database/schema. I can add a new test to that.

ChenyuLInx · 2022-04-05T23:01:31Z

verify that caching is working properly for project with multiple database/schema
@karunpoudel I don't think we do. Thanks for adding those! Let me know if there's anything I can help with! We haven't made the annocement yet but we are swithing to new testing framework, you can find examples here. And here are some initial docs. This is not fully done so if you feel more comfortable using the original test framework that also works for me. But I highly recommand checking out the new ones.

jtcohen6 · 2022-04-11T09:05:51Z

I'd be happy to have this change as is, as an optional/experimental flag, included in v1.1.0-rc1. We already have a follow-on issue open to add more rigorous testing and tracking (#4961), ahead of this being something we'd want to turn on by default.

jtcohen6 · 2022-04-11T16:54:52Z

core/dbt/main.py

@@ -1088,6 +1088,27 @@ def parse_args(args, cls=DBTArgumentParser):
        """,
    )

+    schema_cache_flag = p.add_mutually_exclusive_group()
+    schema_cache_flag.add_argument(
+        "--cache_selected_only",


One thing I just caught: for consistency with other flags, this should probably be --cache-selected-only (hyphens not underscores)

jtcohen6 · 2022-04-12T16:04:56Z

Rebased + merged in #5036

cache schema for selected models

54f69f8

karunpoudel requested a review from a team as a code owner March 13, 2022 00:20

karunpoudel requested a review from a team March 13, 2022 00:20

karunpoudel requested review from a team as code owners March 13, 2022 00:20

cla-bot bot added the cla:yes label Mar 13, 2022

jtcohen6 added the Team:Execution label Mar 15, 2022

ChenyuLInx reviewed Mar 15, 2022

View reviewed changes

Create Features-20220316-003847.yaml

7a89c54

jtcohen6 reviewed Mar 18, 2022

View reviewed changes

karunpoudel and others added 4 commits March 20, 2022 02:08

rename flag, update postgres adapter

b56534f

rename flag to cache_selected_only, update postgres adapter: function _relations_cache_for_schemas

Update Features-20220316-003847.yaml

b710ea9

Merge remote-tracking branch 'upstream/main' into cache-schema-for-se…

5613ec6

…lected-models

added test for cache_selected_only flag

345adec

ChenyuLInx mentioned this pull request Mar 25, 2022

[CT-429] Add tracking and more tests for cache only selected nodes #4961

Closed

ChenyuLInx approved these changes Mar 25, 2022

View reviewed changes

formatted as per pre-commit

9885b5a

jtcohen6 mentioned this pull request Mar 26, 2022

New global config: cache_selected_only dbt-labs/docs.getdbt.com#1273

Closed

1 task

jtcohen6 reviewed Apr 11, 2022

View reviewed changes

jtcohen6 mentioned this pull request Apr 12, 2022

Add experimental cache_selected_only config #5036

Merged

4 tasks

jtcohen6 closed this Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache schema for selected models #4860

cache schema for selected models #4860

karunpoudel commented Mar 13, 2022 •

edited

Loading

ChenyuLInx left a comment

ChenyuLInx Mar 15, 2022

karunpoudel Mar 16, 2022

ChenyuLInx Mar 15, 2022

jtcohen6 Mar 18, 2022

ChenyuLInx Mar 15, 2022

jtcohen6 Mar 18, 2022

ChenyuLInx Mar 15, 2022

karunpoudel Mar 16, 2022

jtcohen6 Mar 18, 2022

jtcohen6 left a comment

jtcohen6 Mar 18, 2022

jtcohen6 Mar 18, 2022

jtcohen6 Mar 18, 2022

ChenyuLInx left a comment

karunpoudel commented Apr 4, 2022

ChenyuLInx commented Apr 5, 2022

jtcohen6 commented Apr 11, 2022

jtcohen6 Apr 11, 2022

jtcohen6 commented Apr 12, 2022

	def _relations_cache_for_schemas(self, manifest):
	super()._relations_cache_for_schemas(manifest)
	self._link_cached_relations(manifest)

cache schema for selected models #4860

cache schema for selected models #4860

Conversation

karunpoudel commented Mar 13, 2022 • edited Loading

Description

Checklist

ChenyuLInx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChenyuLInx left a comment

Choose a reason for hiding this comment

karunpoudel commented Apr 4, 2022

ChenyuLInx commented Apr 5, 2022

jtcohen6 commented Apr 11, 2022

Choose a reason for hiding this comment

jtcohen6 commented Apr 12, 2022

karunpoudel commented Mar 13, 2022 •

edited

Loading