Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-910] Support Metadata Freshness #785

Closed
peterallenwebb opened this issue Sep 22, 2023 · 7 comments · Fixed by #796
Closed

[ADAP-910] Support Metadata Freshness #785

peterallenwebb opened this issue Sep 22, 2023 · 7 comments · Fixed by #796
Assignees
Labels
enhancement New feature or request

Comments

@peterallenwebb
Copy link
Contributor

peterallenwebb commented Sep 22, 2023

Describe the feature

Support metadata-based freshness by implementing the new macro and feature flag described in #8704.

Who will this benefit?

Everyone who wants faster freshness results from Snowflake!

@sp-tkerlavage
Copy link

sp-tkerlavage commented Sep 27, 2023

Are external tables being tracked as part of this implementation?

I think for external tables the information_schema.tables.last_altered column is the time of the last refresh (or last ALTER), not the last time a file came in.

So for example if an external table has stale data, an ALTER TABLE ... REFRESH is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think the last_altered column will have the timestamp of when the ALTER TABLE ... REFRESH that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)

The directory (if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here dbt-labs/dbt-core#7012

@peterallenwebb
Copy link
Contributor Author

@sp-tkerlavage Thanks for your insight on this. Our current intention with Snowflake is to implement the metadata option using LAST_ALTERED, but I'll investigate what you are saying about external tables before we complete the feature. In general, the metadata maintained by warehouses/DBMSes comes with a wide array of warnings and caveats, and we're going to have to figure out the right balance between allowing users the convenience of relying on it and warning them about the potential drawbacks.

@graciegoheen
Copy link

graciegoheen commented Oct 9, 2023

Hi @sp-tkerlavage thanks for raising this! This won't be in-scope for 1.7, but I've opened a new feature request to track this and added it to our Applied State epic.

Would you be able to add an example query you might use to determine source freshness, so we make sure we solve your use case?

dbt-labs/dbt-core#8797

@sp-tkerlavage
Copy link

sp-tkerlavage commented Oct 10, 2023

In Snowflake an external stage can have a directory which essentially stores a list of files in the external storage (S3 Bucket, etc) along with other useful information like relative_path, url, size, MD5, and last_modified.
https://docs.snowflake.com/en/user-guide/data-load-dirtables

This directory table can be automatically refreshed when files are added via SQS
https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto

The directory is not a standalone object, its just a layer on top of a stage. It can be queried like this:

SELECT relative_path, 
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@database.schema.stage_name' )

However, you would have to parse the relative path to extract the name of the source table. So if for example you have a bucket that is partitioned like this:
<bucket>/source/source_table/yyyy/mm/dd/hh
You would have to do something like this

SELECT relative_path, 
    SPLIT_PART(relative_path, '/', 1) source_database,
    SPLIT_PART(relative_path, '/', 2) source_table,
    SPLIT_PART(relative_path, '/', 3) year
    ...
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@prod_cdc.meta.dms' )

You could then just group by the source_database and source_table to get the max last_modified for each and do the freshness comparison.

Obviously different S3 directory structures would require different parsing.

I actually have an incremental model for this directory parsing logic in my own project, so that may be a cleaner approach. However, I'm not sure if ref'ing a model in a freshness test is even possible. And even if it is, the model would have to be ran first. But again, I'm not entirely clear on the parsing order so it may not even be possible to ref a model in a freshness test on the face of it.

@tjirab
Copy link

tjirab commented Jan 19, 2024

Minor concern: LAST_ALTERED gets updated upon DML & DDL changes, and maintenance ops. It does not necessarily mean fresh data has become available.

@mikealfare
Copy link
Contributor

Minor concern: LAST_ALTERED gets updated upon DML & DDL changes, and maintenance ops. It does not necessarily mean fresh data has become available.

Ah, good to know. @Fleid, in the context of dbt-managed objects, how large of an issue do you see this as? Is this something worth addressing with some level of urgency?

@mikealfare
Copy link
Contributor

Minor concern: LAST_ALTERED gets updated upon DML & DDL changes, and maintenance ops. It does not necessarily mean fresh data has become available.

This bug has been opened as a separate issue: #899. This issue will remain closed and conversation on it may not necessarily be seen due to status. Please move any new conversation over to the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants