Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for hourly, weekly, monthly bucket size in anomaly tests #439

Closed
rloredo opened this issue Aug 21, 2022 · 7 comments · Fixed by elementary-data/dbt-data-reliability#85 or elementary-data/dbt-data-reliability#202
Labels
Enhancement New feature or request

Comments

@rloredo
Copy link
Contributor

rloredo commented Aug 21, 2022

In anomaly detection, the default bucket size is 1 day. I couldn't find a way to configure the size of these buckets.

Checking the code here I think that it's not an easy fix, since it also depends on how the data is collected before this query is executed.

I would keep exploring, but perhaps you can guide me on what should be done to be able to do this.

@Maayan-s
Copy link
Contributor

Hi @rloredo! Thanks for opening this issue, makes perfect sense.
You are right that this is not an easy fix.
However, we had this in mind as a future option when we wrote the tests, so it's not a fundamental change.
I'll test a bit and describe the changes required here, so we can discuss further :)

@rloredo rloredo changed the title Support for hourly, weekly, monthly Support for hourly, weekly, monthly bucket size in anomaly tests Aug 21, 2022
@Maayan-s
Copy link
Contributor

Thanks again @rloredo, let me know what do you think about this:

Here's a short background on our tests -
Each anomalies test is a macro under ./macros/edr/tests/. Each test runs 2 queries - one to collect new metrics, and one to calculate the anomaly score. The results of the queries are saved to tables, and merged to the elementary models on an on-run-end hook.

elementary anomalies tests

My suggestion is to break down the change to pieces.
I think the best place to start would the column-level tests. After we add this e2e, it should be pretty easy to reuse the macros and change the other tests as well.

Here's a description on what's needed to support this -

  1. We need to add bucket_size (is this the best name?) to the config of tests. I think it makes sense to configure it in the model and / or test level. You can look at the implementation of timestamp_column here. I suggest having a list of supported time formats, and if the config is not in the list, return day as default.

  2. This config needs to be added as a param to 4 tests:

  • test_all_columns_anomalies
  • test_column_anomalies
  1. The tests should pass the bucket size as a param to the following queries:
  • column_monitoring_query
  • get_anomaly_scores_query
  1. In the queries, there is a macro named elementary.date_trunc that today always receives 'day', but can receive any date part supported by the 4 platforms we support ('hour', 'week' and 'month' are all supported!).

  2. In the metrics_final CTE the following columns should be impacted by the config:

  • bucket_end
  • bucket_duration_hours

As the anomaly scores query is shared between all the tests and has a complex logic, I think it's better the team makes the required changes to support the different buckets on that one.

@rloredo
Copy link
Contributor Author

rloredo commented Aug 22, 2022

Hi, @Maayan-s! thank you for your fast response. It looks clearer now :)
I can dedicate some time to doing part of the steps, but the last part wasn't clear to me: do you want to leave this enhancement to the team? I wouldn't mind waiting for you to do it either.

For the bucket_size thing, I think it's good to change the name... I think we can use this as inspiration. Note that they also use lookback_periods, trend_periods, and test_periods. I guess that days_back should also change to periods_back (or lookback_periods).

Let me know what you think and if you want me to take some of this!

@Maayan-s
Copy link
Contributor

Thanks, @rloredo!
We would do it at some point, but I don't think we will get to it in the next several weeks.
If you are open to taking on the steps I described, this will make it easier for us to prioritize.
We could have someone on the team finish the anomaly_scores_query changes and integration to the UI as well.

What do you think?

@rloredo
Copy link
Contributor Author

rloredo commented Aug 22, 2022

Great, I can give it a try then :)

@rloredo
Copy link
Contributor Author

rloredo commented Aug 28, 2022

So, I did most of the changes in elementary-data/dbt-data-reliability#85
These tests are working and writing the data correctly to __test tables:

  • table_anomalies
  • all_columns_anomalies
  • column_anomalies

I couldn't make dimension_anomalies work, mainly because of the metrics_anomaly_score view.
Also, I've only tested the daily_buckets_cte (renamed to period_buckets_cte) for Snowflake and BQ. I don't have access to Redshift.
I didn't run any integration or e2e test either.

The #TODO's are marked in the code.

This is as far as I can get! Let me know if someone will pick it up and I if you need any clarification :)

@Maayan-s
Copy link
Contributor

Wow @rloredo, thank you so much!
Awesome work, you actually implemented a lot more than I expected :)
I'll find time to review in the next few days!
Also we will prioritize internally the remaining tasks to make this functionality operational.

@elongl elongl transferred this issue from elementary-data/dbt-data-reliability Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
2 participants