Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2144] [Feature] Allow source freshness to be evaluated from table metadata when adapter supports it #7012

Closed
3 tasks done
Tracked by #8316
adamcunnington-mlg opened this issue Feb 20, 2023 · 24 comments
Assignees
Labels
enhancement New feature or request Impact: Adapters

Comments

@adamcunnington-mlg
Copy link

adamcunnington-mlg commented Feb 20, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, source freshness can only be determined by querying a column in a table. When there are a lot of tables, even with a high number of threads, the amount of time it takes to compute source freshness might be unacceptably long - it's effectively just metadata collation after all.

However, lots of databases track table modification times (although don't always isolate DDL changes from DML changes) and expose this via a metadata route. For example, BigQuery has the well-known INFORMATION_SCHEMA tables which expose various metadata attributes.

This isn't consistently available but there's already a precedent for adapter-specific functionality/availability in DBT.

I propose a config parameter, perhaps location is exposed to allow user to control where the info comes from (query/metadata). Metadata would only be supported for specific adapters and when it is, the backend implementation is probably something like:

  • evaluate all source YAML configs to determine which want to use metadata
  • produce the minimum number of queries possible, e.g. maybe just 1 metadata query at some database-level information schema table (you can do this in BQ by using region-foo.INFORMATION_SCHEMA rather than dataset-specific INFORMATION_SCHEMA)
  • iterate through response to get source-table-level information

It would massively speed up compiling of source freshness and reduce query load/cost (although the latter is negligible and not the main motivation for the FR).

Lastly, a significant benefit is that it doesn't rely on a specific column being consistently available - it provides a more generic implementation for the masses.

Describe alternatives you've considered

An alternative is to somehow allow per-statement increase in threads and we could just spam the database with loads of queries to evaluate source freshness faster.

Who will this benefit?

Everyone! But the benefit is especially pronounced for those who:

  • Configure their dbt prod job to be dynamic/diff-based, e.g. using source_status:fresher+ (docs)
  • Have a lot of source tables - e.g. some BI context that isn't internal and experiences growth - this would probably resonate with anyone who uses phrases like "onboarding feeds"

Are you interested in contributing this feature?

Yes - can get one of my devs to help although new to contributing so may be slower route if core team can afford it priority

Acceptance Criteria

  1. Add a new static boolean property to the base adapter called METADATA_FRESHNESS_SUPPORT. It will default to False.
  2. When an adapter implementation overrides this property and sets it to True, and the user has requested that freshness for a source be determined via metadata, dbt-core will look for a macro called collect_freshness_via_metadata(sources). The new macro will accept a List[BaseRelation] and return a query which will generate a table with two columns: source and last_modified. This will allow core to batch freshness requests.
  3. In the YML description of a source, freshness can be requested by metadata, with exact details flexible.
  4. [Optional? Discuss in Refinement/Estimation] For sources marked for freshness via metadata, dbt should issue a single query instead of one per source.

dbt Labs Supported Adapter Implementation Links

@adamcunnington-mlg adamcunnington-mlg added enhancement New feature or request triage labels Feb 20, 2023
@github-actions github-actions bot changed the title [Feature] Allow source freshness to be evaluated from table metadata when adapter supports it [CT-2144] [Feature] Allow source freshness to be evaluated from table metadata when adapter supports it Feb 20, 2023
@jtcohen6
Copy link
Contributor

Slack thread with additional context / motivation

@jtcohen6 jtcohen6 added the Team:Adapters Issues designated for the adapter area of the code label Feb 21, 2023
@Fleid
Copy link
Contributor

Fleid commented Feb 28, 2023

Thanks for opening the issue @adamcunnington-mlg

As a conversation starter, what do you think of something like this:

  • freshness
    • loaded_at_location
      • expression : default, current behavior
      • query : expecting a full SQL statement instead
      • metadata : pre-configured by adapter to leverage known metadata fields
    • loaded_at_field
      • if expression, current behavior
      • if query, a select statement returning a single value
      • if metadata, this is ignored

We need to go through the different platforms and see if we really need a query one, or if we can just go metadata everywhere instead.

@jtcohen6 do you know of other areas in dbt where a similar choice is offered? I'm wondering about the ergonomics of the location/field setup.

@Fleid Fleid added Refinement Maintainer input needed and removed triage labels Feb 28, 2023
@adamcunnington-mlg
Copy link
Author

@Fleid thanks for this - suggestion looks great to me.

The only thing I was thinking about this morning - when doing the per-adapter implementation, should be careful to digest the docs and ensure that there's no caveats from the DB providers about the metadata being stale / laggy. That would significantly undermine the checks!

@Fleid
Copy link
Contributor

Fleid commented Mar 1, 2023

That's a good point. I'd rather we don't support metadata if the implementation is subpar. Do you have a specific vendor in mind for that comment, or is that more of a general concern?

@adamcunnington-mlg
Copy link
Author

@Fleid no specific concern. I've come across this for BQ before but for attributes like storage.

Having said that, a bigger concern I have right now is last modified doesn't seem to be available in the INFORMATION_SCHEMA.tables table. I'm hoping they've moved it somewhere else (it definitely used to be in the previous version, TABLES but I can't find it... hopefully your adapter team can. I can raise a support ticket with Google if it helps

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 2, 2023

My sense is that, when you select max(column_value) from some_table, Snowflake should already be able to answer that as a metadata-only query. (I know the same isn't true of BigQuery, at least in the sense of the bytes billed.)

I think the more compelling aspect of this approach is the ability to grab the freshness of multiple sources from a single query. If there's an information_schema view containing a last_modified field for all tables in a database/schema, dbt could determine the freshness of all sources in that database/schema in one go. That would require some more significant refactoring of how the FreshnessTask works currently, as it expects to spin up a thread for each selected source, and run a query for each (in parallel). It's not so dissimilar from how our "catalog" queries work today, for docs generate.

@Fleid
Copy link
Contributor

Fleid commented Mar 3, 2023

I like that Jeremy. A lot. But that will be a slow one.

In the meantime, should we let users be clever for us at the table level?
By that I mean limiting the feature set to:

version: 2

sources:
  - name: <source_name>
    freshness
    loaded_at_field: <column_name_or_expression_or_statement>
    loaded_at_location: expression | query

    tables:
      - name: <table_name>
        freshness
        loaded_at_field: <column_name_or_expression_or_statement>
        loaded_at_location: expression | query
        ...

Knowing that loaded_at_field should always return a single value, and via loaded_at_location can be expected to be an expression (column name or expression, default, current behavior) or a query (sql statement).

To be honest, I'm not too happy about the loaded_at_field / loaded_at_location terms. If loaded_at_location is the location where we find the thing that is loaded_at, then the associated column should either be its value loaded_at_value (because that's we want and we provide an expression for it) or an expression/statement loaded_at_expression.
I may be happier with loaded_at_field_type instead of location. But then metadata will feel clunky.

@adamcunnington-mlg
Copy link
Author

This feels like a good short-term compromise to me. It could always be labelled as a beta feature and subject to change. Anyone that is using this wholesale could replace an updated version of this supported on a later dbt core release.

What would next step be?

P.s. for BigQuery, i've raised a ticket with google cloud to ask where on earth the metadata about table last updated is because for the life of me, I cannot find it in any information schema table which seems nuts. I assume I'm missing something.

@Fleid
Copy link
Contributor

Fleid commented Mar 9, 2023

In terms of next step, we're projecting multiple changes for sources for 1.6. I think that would be a great time to include this.
But that means July.

The adapters team is still catching up on all the PRs across the repos, plus regression and bugs. So I don't see us being able to touch this before then even if we wanted to.

@Fleid Fleid added this to the v1.6 milestone Mar 9, 2023
@Fleid Fleid removed the Refinement Maintainer input needed label Mar 9, 2023
@adamcunnington-mlg
Copy link
Author

I'm cool with that. Thank you.

I'll post an update when Google cloud support confirm where the table updated at metadata is

@adamcunnington-mlg
Copy link
Author

The docs are currently outdated as TABLE_STORAGE is a preview feature. It's pretty irritating that BigQuery have removed a common feature "table last updated" from the previous information schema view and added it to a new preview table instead! Grr, hopefully this will be GA soon.
https://cloud.google.com/bigquery/docs/information-schema-table-storage
Also, I note on the documentation that "The data in this table is not kept in real time, and might be delayed by a few seconds to a few minutes.". For most dbt schedules, this isn't going to be an issue - I think you should give the user a choice. I'd be fine with this limitation. But I have asked the BigQuery team to better qualify this.

image

@adamcunnington-mlg
Copy link
Author

@Fleid hi there - please can I check this is still pen'd for July? it hurts us more and more every-day. we're running 550 queries an hour rather than 1!

@adamcunnington-mlg
Copy link
Author

@Fleid grateful if you could comment on above now that 1.5 is shipped? Just looking to manage internal expectations and plan around this. Our source freshness job sometimes now takes 10+ minutes - should take 10 seconds. Thanks

@Fleid
Copy link
Contributor

Fleid commented May 9, 2023

Hi @adamcunnington-mlg, we've been slipping on MVs, and I'm not 100% sure we're going to be able to fit the managed sources work into 1.6. I'm exploring alternative options to get this issue done independently.

@Fleid Fleid self-assigned this May 9, 2023
@adamcunnington-mlg
Copy link
Author

@Fleid hi - just wondering if you reached any conclusions? Since first raising this issue, we're now average 10 minutes for our source freshness step - and continuing to increase - and scare us! This feels like such a small (relatively) feature with such a big impact.

@graciegoheen
Copy link
Contributor

Hey @adamcunnington-mlg - have you tried adding a freshness filter config to limit the amount of data being processed by the source freshness check? Doesn't negate the needs discussed in this thread, but might be a short term solution to improving the runtime of your source freshness check (especially if your sources are partitioned tables)

@adamcunnington-mlg
Copy link
Author

adamcunnington-mlg commented Jun 9, 2023

@graciegoheen thanks for the thought - unfortunately, that won't help in our case - we wouldn't know what to filter for - the whole point of the source freshness stage is to understand what has changed. Our data integration pipeline is triggered almost arbitrarily - a huge volume of arbitrary schedules as well as ad-hoc action which could effect historic date ranges. It is the very outcome of this step that allows our build stage to actually be an effective Delta.

@Fleid
Copy link
Contributor

Fleid commented Jun 21, 2023

Hey @adamcunnington-mlg - sadly we didn't have the chance to take a stab at this, and I don't see us being able to in the near future.

@dataders if you do end-up re-opening the box of managed sources, this should be on your radar ;)

@adamcunnington-mlg
Copy link
Author

@Fleid thanks for the update - although super disappointing! What would it take to get commitment against an upcoming minor version?

We'd offer to PR this but my feel is it is going to take a lot to get accepted because this is a fairly fundamental part that affects a core concept in the docs and I suspect there could be varying opinions on how this works - and perhaps you'd want multi-adapter support from the get-go - is there a precedent for extending some generic functionality but filling in the implementation for just 1 backend? Perhaps we will fork so we can add for BQ.

@jtcohen6
Copy link
Contributor

Removing this from the v1.6 milestone for now. However: I do think there are some compelling use cases that we could enable with faster/cheaper source freshness checks—on the data platforms that support it (Snowflake and BigQuery at minimum), and understanding that each platform implementation will have some latency/limitations.

The trickiest thing about the approach here will be our need to run a single query, and then map that query's result back to multiple nodes, rather than running individual queries (in parallel) for each source node. As we chatted about further up in the thread, there's some precedent for this — that's the pattern dbt follows right now for docs generate. We know that can lead to significant time/memory consumption at scale, so we'd want this implementation to still respect node selection (--select): If I'm only checking the freshness for a handful of sources, filter down the information schema / run only a subset of show statements accordingly.

@jtcohen6 jtcohen6 removed this from the v1.6 milestone Jun 22, 2023
@Fleid
Copy link
Contributor

Fleid commented Jun 28, 2023

It's not unreasonable to scope that down to BQ only at the moment. We need to be pragmatic.

@adamcunnington-mlg I understand you not wanting to waste cycles. I do think if you can get a lightweight implementation out, for BQ, following on @jtcohen6's advice above, there is a path to getting it merged.

@adamcunnington-mlg
Copy link
Author

adamcunnington-mlg commented Jul 5, 2023

@Fleid thanks, we'll take a look at it - although even-then, it does seem like a fairly significant change.

Am I right in understanding that the work will be split in following ways:

  • dbt-core: will need to be changed to only run 1 query and then map back results, and to expect different config fields
  • dbt-bigquery: will contain the logic that will query INFORMATION_SCHEMA and return the results to dbt-core

@dataders dataders added this to the v1.7 milestone Aug 3, 2023
@graciegoheen graciegoheen added Impact: Adapters and removed Team:Adapters Issues designated for the adapter area of the code labels Sep 25, 2023
@graciegoheen graciegoheen removed this from the v1.7 milestone Jan 22, 2024
@martynydbt martynydbt modified the milestone: v1.8 Feb 8, 2024
@Fleid
Copy link
Contributor

Fleid commented Feb 23, 2024

@adamcunnington-mlg I'm closing this as solved in 1.7.3.
Let's open a follow-up issue if you think we can improve on the current implementation.

@Fleid Fleid closed this as completed Feb 23, 2024
@graciegoheen
Copy link
Contributor

graciegoheen commented Feb 23, 2024

@Fleid I believe the final follow up here, is allowing source metadata to be calculated in batch - we have a follow up ticket for that here, slotted for 1.8 release currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Impact: Adapters
Projects
None yet
Development

No branches or pull requests

6 participants