Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add interval arg to values every n datepart #110

Conversation

lewisarmistead
Copy link
Contributor

This PR addresses issue #109 by adding an optional argument for grouping by count of date_part in expect_row_values_to_have_data_for_every_n_datepart.

On the note from that issue about whether mod will work across databases, I saw that dbt Labs uses mod in one of their cross-DB utility macros, so I'm assuming it's safe. I verified that the changes run using Snowflake, and I saw that BigQuery, Postgres, Redshift, & Spark all have the same mod function as well.

@clausherther
Copy link
Contributor

clausherther commented Sep 28, 2021

@lewisarmistead thanks for this PR!

However, looks like at least in Postgres, both arguments to mod need to be integers or at least of the same type. I added a test to schema.yml

  - name: timeseries_data_extended
    tests:
        - dbt_expectations.expect_table_columns_to_match_ordered_list:
            column_list: ["date_day", "row_value", "row_value_log"]
        - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
            date_col: date_day
            date_part: day
        - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
            date_col: date_day
            date_part: day
            interval: 2

which fails on me:

Database Error in test dbt_expectations_expect_row_values_to_have_data_for_every_n_datepart_timeseries_data_extended_date_day__day__2 (models/schema_tests/schema.yml)
  function mod(double precision, integer) does not exist
  LINE 157:     where mod(date_part('day', date_day), 2) = 0
                      ^
  HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Also tried with interval: 2.0 which fails with

function mod(double precision, numeric) does not exist
  LINE 157:     where mod(date_part('day', date_day), 2.0) = 0

@clausherther
Copy link
Contributor

@lewisarmistead btw, I'm not sure I super understand the use case (or the implementation of it). I know you're mostly interested in hourly intervals, but something like day / interval: 7 would simply check whether we have data on the 7th/14th/21nd/28th of each month (i.e. is the day number divisible by 7 without remainder), which doesn't seem to have a lot of use cases as far as I can tell. Based on the updated README I would think that having data for "each 7-day" period implies every nth day which I can see some use for, but I don't think that's what this implementation allows?

@clausherther
Copy link
Contributor

Re: the type issue above, this may work (I can't see needing non-integer arguments?)

    {% if interval %}
    where mod(
            cast({{ dbt_date.date_part(date_part, 'date_' + date_part) }} as {{ dbt_utils.type_int() }}),
            cast({{interval}} as {{ dbt_utils.type_int() }})
        ) = 0
    {% endif %}

…e loop.index (calogica#112)

* Fixes calogica#111 - refactor row_number to use loop.index

* Update CHANGELOG
@lewisarmistead
Copy link
Contributor Author

@clausherther - apologies for the intermittent attention here, but thanks for the great feedback! I started work on your suggestions and will push up some changes for your review soon.

RE: integer arguments for the mod function - I agree completely with forcing the type to int. Would it be worth error handling for non-int inputs? e.g. if someone uses 0.5 days for the interval?

RE: checking data presence over a period rather than each X interval - I did misunderstand this, and I'm close to getting the proper aggregation developed for the condition when the interval arg is passed. I'll have something for your review soon on that front too.

@clausherther
Copy link
Contributor

@lewisarmistead thanks for working on this and contributing in the first place! Let me know if I can help with anything here.

This test will handle the mod function, which only takes integer arguments, more stably. It also aggregates row counts across intervals when joining on the date spine to correctly detect data presence in the target model

update conditions based on interval

update styling

update styling
@lewisarmistead
Copy link
Contributor Author

Thanks @clausherther - I added some changes reflecting the comments above. I didn't want to slow down execution of the original test, so I favored no-op conditions over readability/concision. Let me know your thoughts!

clausherther and others added 9 commits October 6, 2021 08:31
…oth start and end dates are set (calogica#115)

* fix none when both test dates are set

* Add support for dbt 0.21 (calogica#116)

* Update README.md

* fix none when both test dates are set

* Update README

Co-authored-by: Claus Herther <claus@calogica.com>
for checking presence every n-date_parts instead of every date_part
This test will handle the mod function, which only takes integer arguments, more stably. It also aggregates row counts across intervals when joining on the date spine to correctly detect data presence in the target model

update conditions based on interval

update styling

update styling
… github.com:lewisarmistead/dbt-expectations into pr/lewisarmistead/110
@clausherther
Copy link
Contributor

FYI, rebased to incorporate #115 and #116

from
base_date_windows d
left join model_data f
on f.date_{{ date_part }} >= d.date_{{ date_part }} and f.date_{{ date_part }} < d.interval_end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewisarmistead fyi, BigQuery won't let you do a left join like that

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

base_date_windows d
left join
model_data f
on d.date_{{ date_part }} <= f.date_{{ date_part }} and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewisarmistead FYI, BigQuery won't let you do a left join like this.

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

the condition added to the model_data CTE is meant to emulate (kind of) Snowflake's [`TIME_SLICE`](https://docs.snowflake.com/en/sql-reference/functions/time_slice.html), which should allow exact matches to the base_dates CTE for better time bucketing
@lewisarmistead
Copy link
Contributor Author

@clausherther - I recently pushed some changes that I think fix the BigQuery issue and simplify the additions in this PR overall. I wound up spinning up an instance of BigQuery, so I've now tested on Snowflake and BigQuery. Apologies for the wait here, and thanks for the feedback!

Copy link
Contributor

@clausherther clausherther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments, I think we're almost there! Generally, though: could we a few comments to explain what the mod and other additions do? I'm afraid a year (or a month?) from now, I might not remember what these sections do. Thanks!!

{{dbt_utils.dateadd(date_part, 'interval_diff', 'truncated_date')}} as date_{{ date_part }},
count(*) as row_cnt
from (
select
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we break this out into a CTE instead of a subquery? Generally, we try to stay away from subqueries.

select
cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as truncated_date,
mod(
cast({{dbt_utils.datediff("'"~start_date~"'", date_col, date_part)}} as {{ dbt_utils.type_int() }}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick: we'll want space before and after the ~ operator, e.g. "'" ~ start_date ~ "'"

from (
select
cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as truncated_date,
mod(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mod transform the same as on line 47? If so, I'd like to refactor this in a way that doesn't repeat code.

@lewisarmistead
Copy link
Contributor Author

@clausherther - thanks for the feedback. To address some of the comments you left, I removed the subquery in the model_data CTE and instead passed the columns from that subquery into the dateadd function that was using them directly as SQL statements. Let me know if you'd like any other formatting changes as well.

I like your idea of making the use of these mod functions more DRY, but I'm not sure the best way to go about that right now. Both uses of mod starting on lines 47 and 65 are for calculating the difference between a datetime column and regular intervals starting at a reference date - the former is for downsampling the spine to a less granular interval and the latter is used for calculating differences that can be used to slice timestamps to the desired interval (similar to Snowflake's TIME_SLICE). Some options here:

  • create a macro that will calculate differences between a datetime column and nearest regular intervals starting at a certain reference date (params: start/reference date, datetime column, date part, interval)
    • pro: DRY
    • con: seems pretty narrow in scope - it would just be configuring the two arguments to pass to mod, i.e.
{% macro datetime_interval_diff(start_date, date_col, date_part, interval) %}
        mod(
            cast({{ dbt_utils.datediff("'" ~ start_date ~ "'", date_col, date_part) }} as {{ dbt_utils.type_int() }}),
            cast({{interval}} as {{ dbt_utils.type_int() }})
        )
{% endmacro %}
  • leave as is
    • pro: potentially more manageable by not tying the two functional goals of downsampling and slicing together and showing the query more directly
    • con: feels hacky and hard to read
  • create the macro in the first option, which gets used in a time_slice-like macro - my main thought here is that these macros potentially belong in a different package - I'm also not sure where you would want macros like these, e.g. above line 1 in this script or elsewhere?

I'm open to any ideas on this front, so let me know your thoughts!

@lewisarmistead
Copy link
Contributor Author

I'm also noticing upon rereading that creating a macro for how we're using mod here is probably useful, e.g. for ensuring that start_date is the appropriate type, which isn't handled independently in the current working version.

@clausherther
Copy link
Contributor

Hi @lewisarmistead! Really appreciate you refactoring this, I think even with the section that concats the mod statement, this looks a lot cleaner to me (we had two uses of mod previously I think?)
I don't think, at this point, I'd invest more in refactoring out that section into a macro, since as you point, it'd be pretty narrow in scope.
However, since that section is pretty dense, could we just add a solid comment on what exactly is going in there? Maybe even with a quick example of what you're calculating there? Without seeing data, I have a hard time unraveling the dateadd/mod/datediff logic.
Other than that, I think we're there!

@clausherther clausherther self-requested a review November 6, 2021 15:55
Copy link
Contributor

@clausherther clausherther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we add some comments to the outlined section, I think we're good to go. Thanks!

cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as date_{{ date_part }},

{% else %}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add comments, along with an example showing sample data, of what this section does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments but had some difficulty keeping them brief. Please edit as you see fit, and feel free to reach out with any questions

@lewisarmistead
Copy link
Contributor Author

Once we add some comments to the outlined section, I think we're good to go. Thanks!

Thanks, @clausherther! Let me know if you need anything else

@clausherther clausherther self-requested a review November 10, 2021 16:07
Copy link
Contributor

@clausherther clausherther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@clausherther
Copy link
Contributor

@lewisarmistead thanks for all the code comments and hanging in there with me as we worked through this!

@lewisarmistead
Copy link
Contributor Author

@lewisarmistead thanks for all the code comments and hanging in there with me as we worked through this!

Happy to contribute - thanks for the feedback @clausherther !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants