Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents #980

Closed
mossyyy opened this issue Jul 6, 2023 · 3 comments

Comments

@mossyyy
Copy link

mossyyy commented Jul 6, 2023

Currently, there's a limitation with using anomalies whereby if you have an incident in your training data it'll be reflected in how your thresholds are calculated. This can lead to a situation whereby an incident occurs and then reoccurs within x days and is not flagged as it "looks" similar to the previous incident which is now influencing threholds. See below for an example of this in an hour_of_day seasonal freshness graph.

Jun 30th you can see an incident occurs and we get a linearly increasing delta in current time to last load time as no new data comes in. You can see the same issue occur on Jul 5th. However it's now considered within the threshold.

Note: The periodic bumps daily are expected (hence why hour_of_day is used for this to take them into account).

Describe the solution you'd like
I think there's two possible approaches.

  1. Would be having some way to retrospectively tag something as an incident. In theory you could do this with an incident seed file with a start and finish time of an incident but this is kinda clunky as it involves a PR being merged into dbt every time you want to remove an incident from training data. Maybe this is a good approach?
  2. Would be to apply prior knowledge at the time of test config which adds out of bounds thresholds for training data. e.g. a min and max tolerance. Anything outside that tolerance would be ignored when calculating thresholds, but included when looking at the detection periods. I would envisage this would work with the following yaml being added to anomaly test config
training_thresholds:
    min_value: <when training any value < this will be ignored>
    max_value: <when training any value > this will be ignored>

One might ask well if you can apply that knowledge of the thresholds should you be doing an anomaly test in the first place or simply a static threshold test? I think it's still fair to do an anomaly test as it'll allow tighter tolerances during during normally consistent periods.

Describe alternatives you've considered
See option 1 above.

Would you be willing to contribute this feature?
More than happy to work on this, keen for feedback on the options though and a discussion first as I imagine this is an area that elementary may have already discussed internally.

From SyncLinear.com | ELE-1286

@Maayan-s
Copy link
Contributor

Maayan-s commented Jul 9, 2023

Hi @mossyyy,
Definitely something we have discussed, @oravi could elaborate on it.
Until he does - for this specific use case I would recommend you to add where_expression that filters out these few anomalous hours in your training set.
When you add a where_expression to a test Elementary would "reset" the metrics (as this is a change to the underlying dataset) and will recalculate the metrics without the excluded rows.

@Maayan-s Maayan-s added the linear label Jul 9, 2023
@Maayan-s Maayan-s changed the title Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents [ELE-1286] Allow a test to specify a training data filter to prevent over fitting thresholds to previous incidents Jul 9, 2023
@Maayan-s
Copy link
Contributor

Maayan-s commented Jul 9, 2023

18178494-39ed-4bef-b813-dc7ac692ff2e

@haritamar
Copy link
Collaborator

Hi @mossyyy !
This is now available with the anomaly_exclude_metrics parameter:
https://docs.elementary-data.com/data-tests/anomaly-detection-configuration/anomaly-exclude-metrics

Cheers,
Itamar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants