Support for hourly, weekly, monthly bucket size in anomaly tests #439

rloredo · 2022-08-21T16:49:19Z

In anomaly detection, the default bucket size is 1 day. I couldn't find a way to configure the size of these buckets.

Checking the code here I think that it's not an easy fix, since it also depends on how the data is collected before this query is executed.

I would keep exploring, but perhaps you can guide me on what should be done to be able to do this.

Maayan-s · 2022-08-21T17:45:29Z

Hi @rloredo! Thanks for opening this issue, makes perfect sense.
You are right that this is not an easy fix.
However, we had this in mind as a future option when we wrote the tests, so it's not a fundamental change.
I'll test a bit and describe the changes required here, so we can discuss further :)

Maayan-s · 2022-08-21T18:53:39Z

Thanks again @rloredo, let me know what do you think about this:

Here's a short background on our tests -
Each anomalies test is a macro under ./macros/edr/tests/. Each test runs 2 queries - one to collect new metrics, and one to calculate the anomaly score. The results of the queries are saved to tables, and merged to the elementary models on an on-run-end hook.

My suggestion is to break down the change to pieces.
I think the best place to start would the column-level tests. After we add this e2e, it should be pretty easy to reuse the macros and change the other tests as well.

Here's a description on what's needed to support this -

We need to add bucket_size (is this the best name?) to the config of tests. I think it makes sense to configure it in the model and / or test level. You can look at the implementation of timestamp_column here. I suggest having a list of supported time formats, and if the config is not in the list, return day as default.
This config needs to be added as a param to 4 tests:

test_all_columns_anomalies
test_column_anomalies

The tests should pass the bucket size as a param to the following queries:

column_monitoring_query
get_anomaly_scores_query

In the queries, there is a macro named elementary.date_trunc that today always receives 'day', but can receive any date part supported by the 4 platforms we support ('hour', 'week' and 'month' are all supported!).
In the metrics_final CTE the following columns should be impacted by the config:

bucket_end
bucket_duration_hours

As the anomaly scores query is shared between all the tests and has a complex logic, I think it's better the team makes the required changes to support the different buckets on that one.

rloredo · 2022-08-22T08:04:56Z

Hi, @Maayan-s! thank you for your fast response. It looks clearer now :)
I can dedicate some time to doing part of the steps, but the last part wasn't clear to me: do you want to leave this enhancement to the team? I wouldn't mind waiting for you to do it either.

For the bucket_size thing, I think it's good to change the name... I think we can use this as inspiration. Note that they also use lookback_periods, trend_periods, and test_periods. I guess that days_back should also change to periods_back (or lookback_periods).

Let me know what you think and if you want me to take some of this!

Maayan-s · 2022-08-22T08:24:38Z

Thanks, @rloredo!
We would do it at some point, but I don't think we will get to it in the next several weeks.
If you are open to taking on the steps I described, this will make it easier for us to prioritize.
We could have someone on the team finish the anomaly_scores_query changes and integration to the UI as well.

What do you think?

rloredo · 2022-08-22T15:32:55Z

Great, I can give it a try then :)

rloredo · 2022-08-28T16:59:50Z

So, I did most of the changes in elementary-data/dbt-data-reliability#85
These tests are working and writing the data correctly to __test tables:

table_anomalies
all_columns_anomalies
column_anomalies

I couldn't make dimension_anomalies work, mainly because of the metrics_anomaly_score view.
Also, I've only tested the daily_buckets_cte (renamed to period_buckets_cte) for Snowflake and BQ. I don't have access to Redshift.
I didn't run any integration or e2e test either.

The #TODO's are marked in the code.

This is as far as I can get! Let me know if someone will pick it up and I if you need any clarification :)

Maayan-s · 2022-08-29T11:17:04Z

Wow @rloredo, thank you so much!
Awesome work, you actually implemented a lot more than I expected :)
I'll find time to review in the next few days!
Also we will prioritize internally the remaining tasks to make this functionality operational.

rloredo changed the title ~~Support for hourly, weekly, monthly~~ Support for hourly, weekly, monthly bucket size in anomaly tests Aug 21, 2022

Maayan-s added the Enhancement New feature or request label Aug 21, 2022

Maayan-s mentioned this issue Aug 21, 2022

Support for hourly, weekly, monthly bucket size in anomaly tests #256

Closed

Maayan-s linked a pull request Aug 24, 2022 that will close this issue

Feat/support for multiple buckets anomaly tests elementary-data/dbt-data-reliability#85

Merged

elongl transferred this issue from elementary-data/dbt-data-reliability Nov 29, 2022

elongl linked a pull request Dec 27, 2022 that will close this issue

Anomaly tests granular time buckets elementary-data/dbt-data-reliability#202

Merged

elongl closed this as completed in elementary-data/dbt-data-reliability#202 Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for hourly, weekly, monthly bucket size in anomaly tests #439

Support for hourly, weekly, monthly bucket size in anomaly tests #439

rloredo commented Aug 21, 2022

Maayan-s commented Aug 21, 2022

Maayan-s commented Aug 21, 2022

rloredo commented Aug 22, 2022

Maayan-s commented Aug 22, 2022

rloredo commented Aug 22, 2022

rloredo commented Aug 28, 2022

Maayan-s commented Aug 29, 2022

Support for hourly, weekly, monthly bucket size in anomaly tests #439

Support for hourly, weekly, monthly bucket size in anomaly tests #439

Comments

rloredo commented Aug 21, 2022

Maayan-s commented Aug 21, 2022

Maayan-s commented Aug 21, 2022

rloredo commented Aug 22, 2022

Maayan-s commented Aug 22, 2022

rloredo commented Aug 22, 2022

rloredo commented Aug 28, 2022

Maayan-s commented Aug 29, 2022