Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a date filter is implied by the days_back parameter in a data volume alert, filter by the timestamp_column in the monitored_table CTE #1158

Closed
garfieldthesam opened this issue Sep 13, 2023 · 1 comment

Comments

@garfieldthesam
Copy link

garfieldthesam commented Sep 13, 2023

Is your feature request related to a problem? Please describe.
Unless I'm misunderstanding how the package works, adding a days_back config to a data volume anomaly detection test means no more than days_back days worth of data will be used in evaluating the test.

I created a data volume test on a very large table that's well-optimized when queried with a filter on the table's partition timestamp column. I was surprised that my tests against this table with days_back parameters configured were timing out. I was then surprised to see that, in the code executed on the Databricks cluster, the first CTE was the following:

with monitored_table as (
  select
    *
  from
    {{very_large_table}}
),

If I hadn't set an appropriate timeout on the cluster where these tests ran it could have run up very large infra costs!

It may be the case that other anomaly tests follow a similar pattern but I have not tested.

This makes adoption of elementary more painful than it needs to be because:

  • It creates a predictable (and potentially costly) failure state for anomaly detection tests
  • The developer must dig into the compiled code and troubleshoot why a test that should run easily timed out
  • The documentation doesn't make it clear that the days_back parameter doesn't filter the table in question using the timestamp_column as one might expect

Describe the solution you'd like
Given that the test already knows the timestamp_column and the number of days_back I would expect the compiled code to look something like the following (in Spark SQL syntax), designed to filter the table down to the date range needed to evaluate the test:

monitored_table as (
  select
    *
  from
    {{very_large_table}}
  where
    date_trunc('{{time_bucket}}', {{timestamp_column}}) between 
        date_trunc('{{time_bucket}}', datesub(current_date(), {{days_back}} + 1) and 
        last_day('{{time_bucket}}',   datesub(date_trunc('{{time_bucket}}', current_date()), 1))
),

Describe alternatives you've considered

  1. Update the documentation to make it clearer that the days_back parameter does not filter the table in question, and those filters need to be applied using the where_expression parameter instead
  2. Do nothing--it's may be complicated to implement this filtering given all of the parameters that can be applied to a test

Additional context
Somewhat related to but distinct from issue #1329

Would you be willing to contribute this feature?
May be willing to contribute at some point Oct 2023 or later

@haritamar
Copy link
Collaborator

Hi @garfieldthesam !
Thanks for opening this issue and sorry for the delay.
Similar to my response to the other issue, I believe this filtering is now done as you'd expect, so I'm tending to think this is no longer an issue.
Closing but feel free to re-open if you feel otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants