If a date filter is implied by the days_back
parameter in a data volume alert, filter by the timestamp_column
in the monitored_table
CTE
#1158
Labels
Is your feature request related to a problem? Please describe.
Unless I'm misunderstanding how the package works, adding a
days_back
config to a data volume anomaly detection test means no more thandays_back
days worth of data will be used in evaluating the test.I created a data volume test on a very large table that's well-optimized when queried with a filter on the table's partition timestamp column. I was surprised that my tests against this table with
days_back
parameters configured were timing out. I was then surprised to see that, in the code executed on the Databricks cluster, the first CTE was the following:If I hadn't set an appropriate timeout on the cluster where these tests ran it could have run up very large infra costs!
It may be the case that other anomaly tests follow a similar pattern but I have not tested.
This makes adoption of elementary more painful than it needs to be because:
days_back
parameter doesn't filter the table in question using thetimestamp_column
as one might expectDescribe the solution you'd like
Given that the test already knows the
timestamp_column
and the number ofdays_back
I would expect the compiled code to look something like the following (in Spark SQL syntax), designed to filter the table down to the date range needed to evaluate the test:Describe alternatives you've considered
days_back
parameter does not filter the table in question, and those filters need to be applied using thewhere_expression
parameter insteadAdditional context
Somewhat related to but distinct from issue #1329
Would you be willing to contribute this feature?
May be willing to contribute at some point Oct 2023 or later
The text was updated successfully, but these errors were encountered: