Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --hours-back CLI option for edr monitor #1548

Open
miktros opened this issue Jun 6, 2024 · 3 comments · May be fixed by #1549
Open

Add --hours-back CLI option for edr monitor #1548

miktros opened this issue Jun 6, 2024 · 3 comments · May be fixed by #1549

Comments

@miktros
Copy link

miktros commented Jun 6, 2024

Is your feature request related to a problem? Please describe.

Documentation for volume_anomalies lists hour as an option for configuring detection_period. However, configuring detection_period using hour results in compilation error: Missing mandatory configuration: ['backfill_days']

Describe the solution you'd like

Elementary tests like volume_anomalies test allow configuring time_bucket by the hour. I would like to be able to configure the detection_period using the hour option so that I can arrange for test runs such that anomaly alerts are emitted based on test failures of comparing the row count of the most recent hourly time bucket of detection_period against row count of time buckets for the last training_period days.

Describe alternatives you've considered

Introduce a new CLI option --hours-back for edr monitor to optionally set a number-of-hours limit to how far back should edr monitor look for pending alerts. If provided, it overrides --days-back.

I have a POC implementation that seems to work. PR to add optional --hours-back for edr monitor here.

Additional context

None.

Would you be willing to contribute this feature?

I am open to contributing to this feature and would appreciate any guidance you can provide.

@ellakz
Copy link
Contributor

ellakz commented Jun 17, 2024

Hi @miktros
Indeed, when we added the detection_period and training_period params we translated them under the hood to "days_back" and "backfill_days", making "hour" unit not be supported.
The solution for this is to change the operation under the hood in the dbt package and make the queries really use hours for the training and detection.
However, in your PR you added an "hours_back" option to the monitor CLI so I'm not sure I understand how it solves that same problem?
If you could shed some light on it I would be happy to review.
Thanks!

@miktros
Copy link
Author

miktros commented Jun 18, 2024

By splitting data into 1-hour time buckets in volume.anomalies tests, we were hoping to be able to run dbt test and then edr monitor every hour so that alert is emitted only for a test with row count in the most recent 1-hour time bucket that is lower than past 21 day average (time bucket failure). We use the "hours-back" option, edr monitor --hours-back 1, to achieve the alerting behavior as described above. With the "days-back" option, alert is emitted each time edr monitor is run when there is a time bucket failure any time during the day, even though the most recent time bucket has no failure. In short, we want per hour alert notification that reflects failure condition of the most recent hour.

@ellakz
Copy link
Contributor

ellakz commented Jun 19, 2024

So I'm not sure this behavior really works, because when the failed test happened in the last hour an alert will be fired, even if the failed metric isn't from the last hour, because you have a detection period of 1 day - so every failure in the past 1 day will fail the test, and if it happens in the hour prior to running edr monitor an alert will be fired.

A different workaround I can suggest is to use the alert_suppression_interval flag in your CLI - setting it to 24 for example will only alert on the same issue once every 24 hours, preventing you from getting duplicate alerts on the same issue. Does this help?

If not-
If I understand correctly, what you want to do it open a PR in the dbt-package repo, changing the way the detection_period param is handled in the test queries.
In here You can see the param 'detection_period' is being translated to 'backfill_days' which is being then piped into the rest of the macros handling the test query, but actually handling it and translating it to the chosen unit and setting the query accordingly is the required change to achieve the behavior you need.
I understand this is not the easiest contribution to do, and at some point we will probably do it ourselves. But I can't commit to a timeline at the moment unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants