If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE #1158

garfieldthesam · 2023-09-13T02:18:30Z

Is your feature request related to a problem? Please describe.
Unless I'm misunderstanding how the package works, adding a days_back config to a data volume anomaly detection test means no more than days_back days worth of data will be used in evaluating the test.

I created a data volume test on a very large table that's well-optimized when queried with a filter on the table's partition timestamp column. I was surprised that my tests against this table with days_back parameters configured were timing out. I was then surprised to see that, in the code executed on the Databricks cluster, the first CTE was the following:

with monitored_table as (
  select
    *
  from
    {{very_large_table}}
),

If I hadn't set an appropriate timeout on the cluster where these tests ran it could have run up very large infra costs!

It may be the case that other anomaly tests follow a similar pattern but I have not tested.

This makes adoption of elementary more painful than it needs to be because:

It creates a predictable (and potentially costly) failure state for anomaly detection tests
The developer must dig into the compiled code and troubleshoot why a test that should run easily timed out
The documentation doesn't make it clear that the days_back parameter doesn't filter the table in question using the timestamp_column as one might expect

Describe the solution you'd like
Given that the test already knows the timestamp_column and the number of days_back I would expect the compiled code to look something like the following (in Spark SQL syntax), designed to filter the table down to the date range needed to evaluate the test:

monitored_table as (
  select
    *
  from
    {{very_large_table}}
  where
    date_trunc('{{time_bucket}}', {{timestamp_column}}) between 
        date_trunc('{{time_bucket}}', datesub(current_date(), {{days_back}} + 1) and 
        last_day('{{time_bucket}}',   datesub(date_trunc('{{time_bucket}}', current_date()), 1))
),

Describe alternatives you've considered

Update the documentation to make it clearer that the days_back parameter does not filter the table in question, and those filters need to be applied using the where_expression parameter instead
Do nothing--it's may be complicated to implement this filtering given all of the parameters that can be applied to a test

Additional context
Somewhat related to but distinct from issue #1329

Would you be willing to contribute this feature?
May be willing to contribute at some point Oct 2023 or later

The text was updated successfully, but these errors were encountered:

haritamar · 2024-05-29T00:55:52Z

Hi @garfieldthesam !
Thanks for opening this issue and sorry for the delay.
Similar to my response to the other issue, I believe this filtering is now done as you'd expect, so I'm tending to think this is no longer an issue.
Closing but feel free to re-open if you feel otherwise.

garfieldthesam added Feature Request 💡 Triage 👀 labels Sep 13, 2023

garfieldthesam mentioned this issue Dec 14, 2023

Request for volume anomalies to pre-aggregate data before computing statistics #1329

Closed

haritamar closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE #1158

If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE #1158

garfieldthesam commented Sep 13, 2023 •

edited

Loading

haritamar commented May 29, 2024

If a date filter is implied by the days_back parameter in a data volume alert, filter by the timestamp_column in the monitored_table CTE #1158

If a date filter is implied by the days_back parameter in a data volume alert, filter by the timestamp_column in the monitored_table CTE #1158

Comments

garfieldthesam commented Sep 13, 2023 • edited Loading

haritamar commented May 29, 2024

If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE #1158

If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE #1158

garfieldthesam commented Sep 13, 2023 •

edited

Loading