Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatically faster caching #433

Closed
wants to merge 10 commits into from
Closed

Dramatically faster caching #433

wants to merge 10 commits into from

Conversation

jtcohen6
Copy link
Contributor

Rebase of #342

Description

Eventually, we may want to investigate the feasibility of column-level cache (in)validation. For now, let's just stick to the core behavior. Columns will still be cached, but get_columns_in_relation will skip over the cached values and run a describe query instead.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have run changie new to create a changelog entry

@cla-bot cla-bot bot added the cla:yes label Aug 19, 2022
@jtcohen6 jtcohen6 changed the title Jerco/pr 342 run tests Dramatically faster caching Aug 19, 2022
from hologram import JsonDict

Self = TypeVar("Self", bound="SparkColumn")


@dataclass
class SparkColumn(dbtClassMixin, Column):
class SparkColumn(FakeAPIObject, Column):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables SparkColumn to be validated as a field on SparkRelation

return f"updated a nonexistent relationship: {str(self.relation)}"


class SparkRelationsCache(RelationsCache):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events above, and cache methods below, could absolutely move into dbt-core. The big idea here is:

  • At the start of the run, we populate a sparser cache: which relations exist, what their names are, what types they are
  • For certain operations, we need to look up more detailed information (e.g. Is this a Delta table?). In that case, we look up the info and update the cached relation, so that the next lookup will be free.

Comment on lines +289 to +291
rows: List[agate.Row] = self.execute_macro(
GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation}
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the describe table extended query that Spark uses to return detailed information, including the columns in the table

Comment on lines +262 to +264
# We shouldn't access columns from the cache, until we've implemented
# proper cache update or invalidation at the column level
# https://github.com/dbt-labs/dbt-spark/issues/431
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most relation attributes are unchanging, unless the relation is dropped and recreated. (dbt-core already has a mechanism to record that drop in its cache.)

However, on a few occasions, dbt does alter the columns within a relation: namely, to handle schema evolution in snapshots and incremental models (on_schema_change). We don't have a core mechanism to invalidate or update the cache when this happens. So, even though we have been and are still recording columns in the cache, we need to make get_columns_in_relation skip the cache and run a query every time.

Comment on lines +209 to +211
def get_relation(
self, database: Optional[str], schema: str, identifier: str, needs_information=False
) -> Optional[BaseRelation]:
Copy link
Contributor Author

@jtcohen6 jtcohen6 Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a fourth argument to the get_relation signature, which only has 3 in dbt-core's "base" adapter implementation.

The idea: If we need "sparse" information, a sparse cache lookup will suffice. If we need "bonus" information (e.g. file format), then we need to first check to see if that additional information is available in the cache from a previous lookup. If not, we'll run a query to look it up, and update the cache accordingly.

Copy link
Contributor

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use needs_information=True here too.

{%- set existing_relation = load_relation(this) -%}

ueshin added a commit to databricks/dbt-databricks that referenced this pull request Dec 5, 2022
### Description

Avoids show table extended command.

This is based on dbt-labs/dbt-spark#433.

1. Create a table/view list with `show tables in {{ relation }}` and `show views in {{ relation }}` commands, or `get_tables` API when `catalog` is provided.
2. Retrieve additional information by `describe extended {{ relation }}` command.
    1. `get_relation` with `needs_information=True`
    2. `get_columns_in_relation`
ueshin added a commit to databricks/dbt-databricks that referenced this pull request Dec 6, 2022
### Description

Avoids show table extended command.

This is based on dbt-labs/dbt-spark#433.

1. Create a table/view list with `show tables in {{ relation }}` and `show views in {{ relation }}` commands, or `get_tables` API when `catalog` is provided.
2. Retrieve additional information by `describe extended {{ relation }}` command.
    1. `get_relation` with `needs_information=True`
    2. `get_columns_in_relation`
@jtcohen6 jtcohen6 closed this Jan 17, 2023
@mikealfare mikealfare deleted the jerco/pr-342-run-tests branch March 1, 2023 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants