[ML] Improvements to the outlier influence calculation #2256

tveasey · 2022-04-25T09:18:04Z

Currently, we compute influence of features on the outlier score by searching for the minimum value of the outlier score allowing one coordinate to vary at a time. This works well if the data aren't projected, but when the data are high dimensional we randomly project them. We then have to work out how changes in projected coordinates should be attributed to the original features. This is only possible approximately and can lead to confusing results.

The normal definition of influence is the (Gateaux) derivative of the outlier score with respect to the feature values. This has the advantage that we can simply compute the change in the projection space for small changes in a single feature value. This changes switches to use this definition. For the unprojected case the results are broadly similar: the unit tests pass with minor tweaks. However, this fixes confusing results in the other case.

tveasey · 2022-09-05T11:15:46Z

lib/maths/analytics/COutliers.cc

+            double logCdfComplement{common::CTools::fastLog(
+                std::max(cdfComplement, std::numeric_limits<double>::min()))};
+            return P_OUTLIER.value(logCdfComplement);
+        };


Note that this represents a small change to how we compute outlier scores for extreme outliers. Before things would saturate which means we couldn't reliably compute the derivative of the overall score w.r.t. changes in feature values. This extends the range to std::numeric_limits<double>::min(). It also avoids making the function as flat around P = 0.5 for the same reason. The resulting scores are still broadly similar.

valeriy42

LGTM

Improve outlier influence calculation

cd7a770

tveasey added the WIP label Apr 25, 2022

tveasey added 12 commits April 26, 2022 10:09

Testing

8b6e222

Merge branch 'main' into outlier-influence

cbc7b91

Build fix

229e5f8

Test fixes

a171411

Experiment

72eeec5

Relax threshold for all platforms

d862d75

Relax threshold for all platforms

db1ee2d

Extra test logging

3a180f8

Merge branch 'main' into outlier-influence

3471598

Fix merge

5b10d73

Docs

807d2e5

Typo

89de925

tveasey added >enhancement review :ml/DataFrameAnalysis v8.5.0 and removed WIP labels Aug 25, 2022

valeriy42 self-requested a review September 5, 2022 08:45

tveasey commented Sep 5, 2022

View reviewed changes

valeriy42 approved these changes Sep 5, 2022

View reviewed changes

Merge branch 'main' into outlier-influence

d6cc705

tveasey merged commit 2cdec4f into elastic:main Sep 5, 2022

tveasey deleted the outlier-influence branch September 5, 2022 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Improvements to the outlier influence calculation #2256

[ML] Improvements to the outlier influence calculation #2256

Uh oh!

tveasey commented Apr 25, 2022

Uh oh!

tveasey Sep 5, 2022

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

[ML] Improvements to the outlier influence calculation #2256

[ML] Improvements to the outlier influence calculation #2256

Uh oh!

Conversation

tveasey commented Apr 25, 2022

Uh oh!

tveasey Sep 5, 2022

Choose a reason for hiding this comment

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!