Feature/add interval arg to values every n datepart #110

lewisarmistead · 2021-09-28T11:39:16Z

This PR addresses issue #109 by adding an optional argument for grouping by count of date_part in expect_row_values_to_have_data_for_every_n_datepart.

On the note from that issue about whether mod will work across databases, I saw that dbt Labs uses mod in one of their cross-DB utility macros, so I'm assuming it's safe. I verified that the changes run using Snowflake, and I saw that BigQuery, Postgres, Redshift, & Spark all have the same mod function as well.

for checking presence every n-date_parts instead of every date_part

[expect_row_values_to_have_data_for_every_n_datepart](https://github.com/calogica/dbt-expectations/tree/0.4.2#expect_row_values_to_have_data_for_every_n_datepart)

clausherther · 2021-09-28T19:56:16Z

@lewisarmistead thanks for this PR!

However, looks like at least in Postgres, both arguments to mod need to be integers or at least of the same type. I added a test to schema.yml

  - name: timeseries_data_extended
    tests:
        - dbt_expectations.expect_table_columns_to_match_ordered_list:
            column_list: ["date_day", "row_value", "row_value_log"]
        - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
            date_col: date_day
            date_part: day
        - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
            date_col: date_day
            date_part: day
            interval: 2

which fails on me:

Database Error in test dbt_expectations_expect_row_values_to_have_data_for_every_n_datepart_timeseries_data_extended_date_day__day__2 (models/schema_tests/schema.yml)
  function mod(double precision, integer) does not exist
  LINE 157:     where mod(date_part('day', date_day), 2) = 0
                      ^
  HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Also tried with interval: 2.0 which fails with

function mod(double precision, numeric) does not exist
  LINE 157:     where mod(date_part('day', date_day), 2.0) = 0

clausherther · 2021-09-28T20:25:34Z

@lewisarmistead btw, I'm not sure I super understand the use case (or the implementation of it). I know you're mostly interested in hourly intervals, but something like day / interval: 7 would simply check whether we have data on the 7th/14th/21nd/28th of each month (i.e. is the day number divisible by 7 without remainder), which doesn't seem to have a lot of use cases as far as I can tell. Based on the updated README I would think that having data for "each 7-day" period implies every nth day which I can see some use for, but I don't think that's what this implementation allows?

clausherther · 2021-09-28T20:27:52Z

Re: the type issue above, this may work (I can't see needing non-integer arguments?)

    {% if interval %}
    where mod(
            cast({{ dbt_date.date_part(date_part, 'date_' + date_part) }} as {{ dbt_utils.type_int() }}),
            cast({{interval}} as {{ dbt_utils.type_int() }})
        ) = 0
    {% endif %}

…e loop.index (calogica#112) * Fixes calogica#111 - refactor row_number to use loop.index * Update CHANGELOG

lewisarmistead · 2021-10-05T19:01:49Z

@clausherther - apologies for the intermittent attention here, but thanks for the great feedback! I started work on your suggestions and will push up some changes for your review soon.

RE: integer arguments for the mod function - I agree completely with forcing the type to int. Would it be worth error handling for non-int inputs? e.g. if someone uses 0.5 days for the interval?

RE: checking data presence over a period rather than each X interval - I did misunderstand this, and I'm close to getting the proper aggregation developed for the condition when the interval arg is passed. I'll have something for your review soon on that front too.

clausherther · 2021-10-06T00:29:48Z

@lewisarmistead thanks for working on this and contributing in the first place! Let me know if I can help with anything here.

This test will handle the mod function, which only takes integer arguments, more stably. It also aggregates row counts across intervals when joining on the date spine to correctly detect data presence in the target model update conditions based on interval update styling update styling

lewisarmistead · 2021-10-06T01:00:28Z

Thanks @clausherther - I added some changes reflecting the comments above. I didn't want to slow down execution of the original test, so I favored no-op conditions over readability/concision. Let me know your thoughts!

…oth start and end dates are set (calogica#115) * fix none when both test dates are set * Add support for dbt 0.21 (calogica#116) * Update README.md * fix none when both test dates are set * Update README Co-authored-by: Claus Herther <claus@calogica.com>

for checking presence every n-date_parts instead of every date_part

[expect_row_values_to_have_data_for_every_n_datepart](https://github.com/calogica/dbt-expectations/tree/0.4.2#expect_row_values_to_have_data_for_every_n_datepart)

This test will handle the mod function, which only takes integer arguments, more stably. It also aggregates row counts across intervals when joining on the date spine to correctly detect data presence in the target model update conditions based on interval update styling update styling

… github.com:lewisarmistead/dbt-expectations into pr/lewisarmistead/110

clausherther · 2021-10-08T16:13:57Z

FYI, rebased to incorporate #115 and #116

clausherther · 2021-10-08T16:30:01Z

macros/schema_tests/distributional/expect_row_values_to_have_data_for_every_n_datepart.sql

+    from
+        base_date_windows d
+        left join model_data f
+            on f.date_{{ date_part }} >= d.date_{{ date_part }} and f.date_{{ date_part }} < d.interval_end


@lewisarmistead fyi, BigQuery won't let you do a left join like that

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

clausherther · 2021-10-08T16:32:52Z

macros/schema_tests/distributional/expect_row_values_to_have_data_for_every_n_datepart.sql

+        base_date_windows d
+        left join
+        model_data f
+            on d.date_{{ date_part }} <= f.date_{{ date_part }} and


@lewisarmistead FYI, BigQuery won't let you do a left join like this.

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

the condition added to the model_data CTE is meant to emulate (kind of) Snowflake's [`TIME_SLICE`](https://docs.snowflake.com/en/sql-reference/functions/time_slice.html), which should allow exact matches to the base_dates CTE for better time bucketing

lewisarmistead · 2021-10-26T03:09:53Z

@clausherther - I recently pushed some changes that I think fix the BigQuery issue and simplify the additions in this PR overall. I wound up spinning up an instance of BigQuery, so I've now tested on Snowflake and BigQuery. Apologies for the wait here, and thanks for the feedback!

clausherther

Couple of small comments, I think we're almost there! Generally, though: could we a few comments to explain what the mod and other additions do? I'm afraid a year (or a month?) from now, I might not remember what these sections do. Thanks!!

clausherther · 2021-10-26T13:43:35Z

macros/schema_tests/distributional/expect_row_values_to_have_data_for_every_n_datepart.sql

+        {{dbt_utils.dateadd(date_part, 'interval_diff', 'truncated_date')}} as date_{{ date_part }},
+        count(*) as row_cnt
+    from (
+        select


Could we break this out into a CTE instead of a subquery? Generally, we try to stay away from subqueries.

clausherther · 2021-10-26T13:44:32Z

macros/schema_tests/distributional/expect_row_values_to_have_data_for_every_n_datepart.sql

+        select
+            cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as truncated_date,
+            mod(
+                    cast({{dbt_utils.datediff("'"~start_date~"'", date_col, date_part)}} as {{ dbt_utils.type_int() }}),


Nit pick: we'll want space before and after the ~ operator, e.g. "'" ~ start_date ~ "'"

clausherther · 2021-10-26T13:45:39Z

macros/schema_tests/distributional/expect_row_values_to_have_data_for_every_n_datepart.sql

+    from (
+        select
+            cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as truncated_date,
+            mod(


Is this mod transform the same as on line 47? If so, I'd like to refactor this in a way that doesn't repeat code.

…ta` CTE

lewisarmistead · 2021-10-27T11:47:38Z

@clausherther - thanks for the feedback. To address some of the comments you left, I removed the subquery in the model_data CTE and instead passed the columns from that subquery into the dateadd function that was using them directly as SQL statements. Let me know if you'd like any other formatting changes as well.

I like your idea of making the use of these mod functions more DRY, but I'm not sure the best way to go about that right now. Both uses of mod starting on lines 47 and 65 are for calculating the difference between a datetime column and regular intervals starting at a reference date - the former is for downsampling the spine to a less granular interval and the latter is used for calculating differences that can be used to slice timestamps to the desired interval (similar to Snowflake's TIME_SLICE). Some options here:

create a macro that will calculate differences between a datetime column and nearest regular intervals starting at a certain reference date (params: start/reference date, datetime column, date part, interval)
- pro: DRY
- con: seems pretty narrow in scope - it would just be configuring the two arguments to pass to mod, i.e.

{% macro datetime_interval_diff(start_date, date_col, date_part, interval) %}
        mod(
            cast({{ dbt_utils.datediff("'" ~ start_date ~ "'", date_col, date_part) }} as {{ dbt_utils.type_int() }}),
            cast({{interval}} as {{ dbt_utils.type_int() }})
        )
{% endmacro %}

leave as is
- pro: potentially more manageable by not tying the two functional goals of downsampling and slicing together and showing the query more directly
- con: feels hacky and hard to read
create the macro in the first option, which gets used in a time_slice-like macro - my main thought here is that these macros potentially belong in a different package - I'm also not sure where you would want macros like these, e.g. above line 1 in this script or elsewhere?

I'm open to any ideas on this front, so let me know your thoughts!

lewisarmistead · 2021-10-27T11:51:42Z

I'm also noticing upon rereading that creating a macro for how we're using mod here is probably useful, e.g. for ensuring that start_date is the appropriate type, which isn't handled independently in the current working version.

clausherther · 2021-11-06T14:31:41Z

Hi @lewisarmistead! Really appreciate you refactoring this, I think even with the section that concats the mod statement, this looks a lot cleaner to me (we had two uses of mod previously I think?)
I don't think, at this point, I'd invest more in refactoring out that section into a macro, since as you point, it'd be pretty narrow in scope.
However, since that section is pretty dense, could we just add a solid comment on what exactly is going in there? Maybe even with a quick example of what you're calculating there? Without seeing data, I have a hard time unraveling the dateadd/mod/datediff logic.
Other than that, I think we're there!

clausherther

Once we add some comments to the outlined section, I think we're good to go. Thanks!

clausherther · 2021-11-06T15:56:55Z

macros/schema_tests/distributional/expect_row_values_to_have_data_for_every_n_datepart.sql

        cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as date_{{ date_part }},
+
+    {% else %}
+


Would be great to add comments, along with an example showing sample data, of what this section does.

I added some comments but had some difficulty keeping them brief. Please edit as you see fit, and feel free to reach out with any questions

lewisarmistead · 2021-11-09T14:37:31Z

Once we add some comments to the outlined section, I think we're good to go. Thanks!

Thanks, @clausherther! Let me know if you need anything else

clausherther

LGTM!

clausherther · 2021-11-10T16:17:09Z

@lewisarmistead thanks for all the code comments and hanging in there with me as we worked through this!

lewisarmistead · 2021-11-10T17:44:34Z

@lewisarmistead thanks for all the code comments and hanging in there with me as we worked through this!

Happy to contribute - thanks for the feedback @clausherther !

lewisarmistead added 2 commits September 28, 2021 07:20

add interval argument

74441b7

for checking presence every n-date_parts instead of every date_part

update docs for explaining new interval arg

894ceec

[expect_row_values_to_have_data_for_every_n_datepart](https://github.com/calogica/dbt-expectations/tree/0.4.2#expect_row_values_to_have_data_for_every_n_datepart)

clausherther self-requested a review September 28, 2021 19:53

expect_table_columns_to_match_ordered_list: refactor row_number to us…

42a8dc8

…e loop.index (calogica#112) * Fixes calogica#111 - refactor row_number to use loop.index * Update CHANGELOG

clausherther and others added 9 commits October 6, 2021 08:31

Add support for dbt 0.21 (calogica#116)

2028910

Update README.md

6114264

remove unintentional styling changes

e42393a

add interval argument

ed0b4bd

for checking presence every n-date_parts instead of every date_part

update docs for explaining new interval arg

ab08f6f

[expect_row_values_to_have_data_for_every_n_datepart](https://github.com/calogica/dbt-expectations/tree/0.4.2#expect_row_values_to_have_data_for_every_n_datepart)

remove unintentional styling changes

de94e83

Merge branch 'feature/add_interval_arg_to_values_every_n_datepart' of…

e46833e

… github.com:lewisarmistead/dbt-expectations into pr/lewisarmistead/110

Change datepart param and fix formatting

dc449e2

clausherther reviewed Oct 8, 2021

View reviewed changes

Reformat join to match prior style

9f10973

clausherther reviewed Oct 8, 2021

View reviewed changes

clausherther requested changes Oct 26, 2021

View reviewed changes

clausherther and others added 2 commits October 26, 2021 06:47

Add schema test

0c8b21a

replace calls to subquery with calls directly to columns in `model_da…

e505eb2

…ta` CTE

clausherther self-requested a review November 6, 2021 15:55

clausherther requested changes Nov 6, 2021

View reviewed changes

add comments/examples for new interval additions

66c4243

clausherther self-requested a review November 10, 2021 16:07

clausherther approved these changes Nov 10, 2021

View reviewed changes

clausherther merged commit 9c075f1 into calogica:main Nov 10, 2021

clausherther mentioned this pull request Nov 10, 2021

add interval argument to expect_row_values_to_have_data_for_every_n_datepart #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/add interval arg to values every n datepart #110

Feature/add interval arg to values every n datepart #110

lewisarmistead commented Sep 28, 2021

clausherther commented Sep 28, 2021 •

edited

clausherther commented Sep 28, 2021

clausherther commented Sep 28, 2021

lewisarmistead commented Oct 5, 2021

clausherther commented Oct 6, 2021

lewisarmistead commented Oct 6, 2021

clausherther commented Oct 8, 2021

clausherther Oct 8, 2021

clausherther Oct 8, 2021

lewisarmistead commented Oct 26, 2021

clausherther left a comment

clausherther Oct 26, 2021

clausherther Oct 26, 2021

clausherther Oct 26, 2021

lewisarmistead commented Oct 27, 2021

lewisarmistead commented Oct 27, 2021

clausherther commented Nov 6, 2021

clausherther left a comment

clausherther Nov 6, 2021

lewisarmistead Nov 9, 2021

lewisarmistead commented Nov 9, 2021

clausherther left a comment

clausherther commented Nov 10, 2021

lewisarmistead commented Nov 10, 2021

		cast({{ dbt_utils.date_trunc(date_part, date_col) }} as {{ dbt_expectations.type_datetime() }}) as date_{{ date_part }},

		{% else %}

Feature/add interval arg to values every n datepart #110

Feature/add interval arg to values every n datepart #110

Conversation

lewisarmistead commented Sep 28, 2021

clausherther commented Sep 28, 2021 • edited

clausherther commented Sep 28, 2021

clausherther commented Sep 28, 2021

lewisarmistead commented Oct 5, 2021

clausherther commented Oct 6, 2021

lewisarmistead commented Oct 6, 2021

clausherther commented Oct 8, 2021

clausherther Oct 8, 2021

Choose a reason for hiding this comment

clausherther Oct 8, 2021

Choose a reason for hiding this comment

lewisarmistead commented Oct 26, 2021

clausherther left a comment

Choose a reason for hiding this comment

clausherther Oct 26, 2021

Choose a reason for hiding this comment

clausherther Oct 26, 2021

Choose a reason for hiding this comment

clausherther Oct 26, 2021

Choose a reason for hiding this comment

lewisarmistead commented Oct 27, 2021

lewisarmistead commented Oct 27, 2021

clausherther commented Nov 6, 2021

clausherther left a comment

Choose a reason for hiding this comment

clausherther Nov 6, 2021

Choose a reason for hiding this comment

lewisarmistead Nov 9, 2021

Choose a reason for hiding this comment

lewisarmistead commented Nov 9, 2021

clausherther left a comment

Choose a reason for hiding this comment

clausherther commented Nov 10, 2021

lewisarmistead commented Nov 10, 2021

clausherther commented Sep 28, 2021 •

edited