[ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents #980

mossyyy · 2023-07-06T02:02:03Z

Currently, there's a limitation with using anomalies whereby if you have an incident in your training data it'll be reflected in how your thresholds are calculated. This can lead to a situation whereby an incident occurs and then reoccurs within x days and is not flagged as it "looks" similar to the previous incident which is now influencing threholds. See below for an example of this in an hour_of_day seasonal freshness graph.

Jun 30th you can see an incident occurs and we get a linearly increasing delta in current time to last load time as no new data comes in. You can see the same issue occur on Jul 5th. However it's now considered within the threshold.

Note: The periodic bumps daily are expected (hence why hour_of_day is used for this to take them into account).

Describe the solution you'd like
I think there's two possible approaches.

Would be having some way to retrospectively tag something as an incident. In theory you could do this with an incident seed file with a start and finish time of an incident but this is kinda clunky as it involves a PR being merged into dbt every time you want to remove an incident from training data. Maybe this is a good approach?
Would be to apply prior knowledge at the time of test config which adds out of bounds thresholds for training data. e.g. a min and max tolerance. Anything outside that tolerance would be ignored when calculating thresholds, but included when looking at the detection periods. I would envisage this would work with the following yaml being added to anomaly test config

training_thresholds:
    min_value: <when training any value < this will be ignored>
    max_value: <when training any value > this will be ignored>

One might ask well if you can apply that knowledge of the thresholds should you be doing an anomaly test in the first place or simply a static threshold test? I think it's still fair to do an anomaly test as it'll allow tighter tolerances during during normally consistent periods.

Describe alternatives you've considered
See option 1 above.

Would you be willing to contribute this feature?
More than happy to work on this, keen for feedback on the options though and a discussion first as I imagine this is an area that elementary may have already discussed internally.

_{From SyncLinear.com | ELE-1286}

The text was updated successfully, but these errors were encountered:

Maayan-s · 2023-07-09T14:12:15Z

Hi @mossyyy,
Definitely something we have discussed, @oravi could elaborate on it.
Until he does - for this specific use case I would recommend you to add where_expression that filters out these few anomalous hours in your training set.
When you add a where_expression to a test Elementary would "reset" the metrics (as this is a change to the underlying dataset) and will recalculate the metrics without the excluded rows.

Maayan-s · 2023-07-09T15:16:20Z

haritamar · 2024-05-29T01:22:38Z

Hi @mossyyy !
This is now available with the anomaly_exclude_metrics parameter:
https://docs.elementary-data.com/data-tests/anomaly-detection-configuration/anomaly-exclude-metrics

Cheers,
Itamar

mossyyy added Feature Request 💡 Triage 👀 labels Jul 6, 2023

Maayan-s added the linear label Jul 9, 2023

Maayan-s changed the title ~~Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents~~ [ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents Jul 9, 2023

haritamar closed this as completed May 29, 2024

haritamar added dbt package Anomaly Detection labels May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents #980

[ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents #980

mossyyy commented Jul 6, 2023 •

edited by Maayan-s

Maayan-s commented Jul 9, 2023

Maayan-s commented Jul 9, 2023

haritamar commented May 29, 2024

[ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents #980

[ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents #980

Comments

mossyyy commented Jul 6, 2023 • edited by Maayan-s

Maayan-s commented Jul 9, 2023

Maayan-s commented Jul 9, 2023

haritamar commented May 29, 2024

mossyyy commented Jul 6, 2023 •

edited by Maayan-s