Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing metadata freshness checks #481

Merged
merged 5 commits into from
Oct 23, 2023
Merged

Conversation

benc-db
Copy link
Collaborator

@benc-db benc-db commented Oct 13, 2023

Description

Addresses the 'metadata freshness changes' part of dbt-labs/dbt-core#8307
Doing this in a way that can work for hive and unity was interesting. I'm a little concerned about the accuracy/perf around the hive implementation, but its the best I can do, and I don't have a great way to insert a check for hive metastore earlier in the call chain.

@dataders any concern that if someone has a view as a source (as is the case in my bigger transform jobs), then this mechanism doesn't work? I didn't see if there was any logic that only called down to this path if the relation is a table. Both using information schema in this way and describe history are table-only features. I mean we could look at information_schema.views, but last altered doesn't give the same sort of information for views as it does for tables.

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@benc-db
Copy link
Collaborator Author

benc-db commented Oct 13, 2023

BTW, there may be additional tests required as the guide calls out two more functional tests, but since those tests had a lot of snowflake specific comments in them (and so far are only in the snowflake adapter), I'm waiting until they're part of core before implementing here. I wanted to get this code up for people to look at while I'm away.

@@ -115,3 +125,13 @@ def matches(
@classproperty
def get_relation_type(cls) -> Type[DatabricksRelationType]:
return DatabricksRelationType

def information_schema(self, view_name=None) -> InformationSchema:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly copied from core, but I had to override so that it would use an information schema that used "`" for quoting, since InformationSchema inherits from BaseRelation.

project.adapter.drop_schema(relation)

def test_get_relation_last_modified(self, project, set_env_vars, custom_schema):
project.run_sql(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is currently hand copied from dbt-snowflake. Nothing I do in the test is specific to databricks, so hopefully it gets pulled into core.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially the test just says, were we able to run the metadata commands and not warn/error. We definitely need another test that asserts that something useful happens. Will add after I'm back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test to ensure the freshness tests passed, as opposed to not having a warning or error. @dataders might be worth considering this form for core; to me at least its more obvious what the test is doing (as opposed to the probe function). Also, this file did not match the format I found in the documentation for sources.json, btw.

@benc-db benc-db temporarily deployed to azure-prod October 17, 2023 17:10 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 17, 2023 17:10 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 17, 2023 17:10 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 17, 2023 17:24 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 17, 2023 17:24 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 17, 2023 17:24 — with GitHub Actions Inactive
@dataders
Copy link
Contributor

@benc-db you da real MVP for posting this. I'll definitely flag this for the adapters eng team to look at once we're back from Coalesce

susodapop
susodapop previously approved these changes Oct 23, 2023
Copy link
Contributor

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few comments. Agree we need a test of the true output of the freshness check. But the smoke test is a good start. Nice work

_capabilities = CapabilityDict(
{Capability.TableLastModifiedMetadata: CapabilitySupport(support=Support.Full)}
)

@available.parse(lambda *a, **k: 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the first time our dialect has needed to use _capabilities? Or had we just inherited the _capabilities from dbt-spark? Not a blocker for this PR but I wonder if more capabilities should be reflected explicitly here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new in 1.7

quote_character: str = "`"

def is_hive_metastore(self):
return self.database is None or self.database == "hive_metastore"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming: does dbt's database == Databricks' schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database == catalog.

{%- for relation in relations -%}
select '{{ relation.schema }}' as schema,
'{{ relation.identifier }}' as identifier,
max(timestamp) as last_modified,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming: the output of a Databricks SQL query that includes a max() aggregation is never more than a single row, yes? That's why there is no GROUP BY present here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the absence of group by, it will be a single row.

dbt/include/databricks/macros/metadata.sql Outdated Show resolved Hide resolved
tests/functional/adapter/test_source_freshness.py Outdated Show resolved Hide resolved
Co-authored-by: Jesse <jesse.whitehouse@databricks.com>
@benc-db benc-db temporarily deployed to azure-prod October 23, 2023 17:11 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 23, 2023 17:11 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 23, 2023 17:11 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 23, 2023 17:18 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 23, 2023 17:18 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 23, 2023 17:18 — with GitHub Actions Inactive
@benc-db benc-db merged commit 9127720 into main Oct 23, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants