Allow partition filters to be passed to source freshness checks #1495

whittid4 · 2019-05-31T07:07:09Z

Feature

Allow partition filters to be passed to source freshness checks

If the source table has a mandatory partition filter (as BigQuery offers) then the out-of-the-box freshness check will fail since the partition filter cannot be passed to the SQL query.

Cannot query over table 'prod.core.stock' without a filter over column(s) 'date' that can be used for partition elimination

This is because the query generated from the freshness check runs the following:

select
      max(date) as max_loaded_at,
      CURRENT_TIMESTAMP() as snapshotted_at
    from `prod`.`core`.`stock`

Who will this benefit?

This feature would benefit users who have mandatory partitions applied to tables. It is currently an issue with BigQuery, but would apply to any source that has the mandatory partition feature.

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-06-01T19:19:19Z

Thanks for making this issue @whittid4! I think that mandatory partition requirements are a good idea for source tables on BQ, so I'd be interested in figuring out a way to make this work.

The challenge here dbt tries to discern:

if there is a record within the acceptable freshness window
what the most recent datapoint for the table is

Adding a filter here won't necessarily impact dbt's ability to to determine if fresh records exist, but it probably could impact dbt's knowledge of when the table was last loaded into.

I can imagine a filter config to the freshness block that looks like:

version: 2

sources:
  - name: ecommerce
    loaded_at_field: date

    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
      filter: "date > date_sub(current_date(), INTERVAL 7 days)     # this is new
      
    tables:
      - name: orders

You buy that?

thenaturalist · 2019-09-19T10:21:04Z

Hey, adding to this I'd make filter a parameter on the source - aka table - level, given the exact same problem arises when testing time partitioned (source) tables in BQ.

At the moment the workaround would be as suggested by you, @drewbanin to create views for the sources, but that creates somewhat of a semantic duplication that requires context.

Would be great if it could all be kept neatly in one place.
Cheers

drewbanin · 2019-09-19T17:00:00Z

Thanks for that @thenaturalist - really good point!

thenaturalist · 2019-09-20T09:20:07Z

Hey @drewbanin another issue I noticed today: The current limitation of taking the route through ephemeral models forces a duplication of all sources in a given schema.yml if I want to use enable proper data lineage by using the {{ source() }} reference. Basically there is one source_a_validation table referencing a filtered view, and another source_a table which is actually used in queries as it contains all data.

drewbanin · 2019-09-24T17:54:20Z

Hey @thenaturalist - check out #1776 - that in-progress PR implements filtering for dbt source snapshot-freshness commands.

Regarding schema tests: We're going to hold off on automatically implementing filters for schema tests on BQ at the moment. This is very possible to implement in user-space, and it may be something that we choose to prioritize in dbt natively in the future.

You can find all of dbt's built-in schema tests here. You can make your own copies of these test macros specifying filters as required. Here's an example of what that might look like:

{% macro test_unique(model) %}

{% set column_name = kwargs.get('column_name', kwargs.get('arg')) %}

select count(*)
from (

    select
        {{ column_name }}

    from {{ model }}
    where {{ column_name }} is not null
      -- NEW: Add a filter on the partition time
      and _PARTITIONTIME > '2019-01-01'
    group by {{ column_name }}
    having count(*) > 1

) validation_errors

{% endmacro %}

You can make this _partitiontime filter dynamic (eg. check the last 30 days), or you can supply an argument to the schema test. See the docs on custom schema tests for more information on making custom schema tests. If you drop this macro into your dbt project's macros/ directory, then dbt will prefer your schema test macro over the built-in version. Alternatively, you can name the macro something like test_unique_filtered, then test columns with unique_filtered instead of unique.

…lters-sources Add filter field to source table definitions (#1495)

drewbanin mentioned this issue Jul 20, 2019

More flexible control over selecting from sources #1592

Closed

drewbanin added this to the Louisa May Alcott milestone Sep 19, 2019

drewbanin added enhancement New feature or request bigquery labels Sep 19, 2019

beckjake self-assigned this Sep 23, 2019

beckjake mentioned this issue Sep 23, 2019

Add filter field to source table definitions (#1495) #1776

Merged

beckjake closed this as completed in #1776 Sep 26, 2019

beckjake added a commit that referenced this issue Sep 26, 2019

Merge pull request #1776 from fishtown-analytics/feature/partition-fi…

23484b1

…lters-sources Add filter field to source table definitions (#1495)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow partition filters to be passed to source freshness checks #1495

Allow partition filters to be passed to source freshness checks #1495

whittid4 commented May 31, 2019

drewbanin commented Jun 1, 2019

thenaturalist commented Sep 19, 2019 •

edited

drewbanin commented Sep 19, 2019

thenaturalist commented Sep 20, 2019

drewbanin commented Sep 24, 2019

Allow partition filters to be passed to source freshness checks #1495

Allow partition filters to be passed to source freshness checks #1495

Comments

whittid4 commented May 31, 2019

Feature