Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow partition filters to be passed to source freshness checks #1495

Closed
whittid4 opened this issue May 31, 2019 · 5 comments · Fixed by #1776
Closed

Allow partition filters to be passed to source freshness checks #1495

whittid4 opened this issue May 31, 2019 · 5 comments · Fixed by #1776
Assignees
Labels
bigquery enhancement New feature or request

Comments

@whittid4
Copy link

Feature

Allow partition filters to be passed to source freshness checks

If the source table has a mandatory partition filter (as BigQuery offers) then the out-of-the-box freshness check will fail since the partition filter cannot be passed to the SQL query.

Cannot query over table 'prod.core.stock' without a filter over column(s) 'date' that can be used for partition elimination

This is because the query generated from the freshness check runs the following:

select
      max(date) as max_loaded_at,
      CURRENT_TIMESTAMP() as snapshotted_at
    from `prod`.`core`.`stock`

Who will this benefit?

This feature would benefit users who have mandatory partitions applied to tables. It is currently an issue with BigQuery, but would apply to any source that has the mandatory partition feature.

@drewbanin
Copy link
Contributor

Thanks for making this issue @whittid4! I think that mandatory partition requirements are a good idea for source tables on BQ, so I'd be interested in figuring out a way to make this work.

The challenge here dbt tries to discern:

  1. if there is a record within the acceptable freshness window
  2. what the most recent datapoint for the table is

Adding a filter here won't necessarily impact dbt's ability to to determine if fresh records exist, but it probably could impact dbt's knowledge of when the table was last loaded into.

I can imagine a filter config to the freshness block that looks like:

version: 2

sources:
  - name: ecommerce
    loaded_at_field: date

    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
      filter: "date > date_sub(current_date(), INTERVAL 7 days)     # this is new
      
    tables:
      - name: orders

You buy that?

@thenaturalist
Copy link

thenaturalist commented Sep 19, 2019

Hey, adding to this I'd make filter a parameter on the source - aka table - level, given the exact same problem arises when testing time partitioned (source) tables in BQ.

At the moment the workaround would be as suggested by you, @drewbanin to create views for the sources, but that creates somewhat of a semantic duplication that requires context.

Would be great if it could all be kept neatly in one place.
Cheers

@drewbanin
Copy link
Contributor

Thanks for that @thenaturalist - really good point!

@drewbanin drewbanin added this to the Louisa May Alcott milestone Sep 19, 2019
@drewbanin drewbanin added enhancement New feature or request bigquery labels Sep 19, 2019
@thenaturalist
Copy link

Hey @drewbanin another issue I noticed today: The current limitation of taking the route through ephemeral models forces a duplication of all sources in a given schema.yml if I want to use enable proper data lineage by using the {{ source() }} reference. Basically there is one source_a_validation table referencing a filtered view, and another source_a table which is actually used in queries as it contains all data.

@drewbanin
Copy link
Contributor

Hey @thenaturalist - check out #1776 - that in-progress PR implements filtering for dbt source snapshot-freshness commands.

Regarding schema tests: We're going to hold off on automatically implementing filters for schema tests on BQ at the moment. This is very possible to implement in user-space, and it may be something that we choose to prioritize in dbt natively in the future.

You can find all of dbt's built-in schema tests here. You can make your own copies of these test macros specifying filters as required. Here's an example of what that might look like:

{% macro test_unique(model) %}

{% set column_name = kwargs.get('column_name', kwargs.get('arg')) %}

select count(*)
from (

    select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
      -- NEW: Add a filter on the partition time
      and _PARTITIONTIME > '2019-01-01'
    group by {{ column_name }}
    having count(*) > 1

) validation_errors

{% endmacro %}

You can make this _partitiontime filter dynamic (eg. check the last 30 days), or you can supply an argument to the schema test. See the docs on custom schema tests for more information on making custom schema tests. If you drop this macro into your dbt project's macros/ directory, then dbt will prefer your schema test macro over the built-in version. Alternatively, you can name the macro something like test_unique_filtered, then test columns with unique_filtered instead of unique.

beckjake added a commit that referenced this issue Sep 26, 2019
…lters-sources

Add filter field to source table definitions (#1495)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants