Skip to content

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Apr 25, 2022

Currently, we compute influence of features on the outlier score by searching for the minimum value of the outlier score allowing one coordinate to vary at a time. This works well if the data aren't projected, but when the data are high dimensional we randomly project them. We then have to work out how changes in projected coordinates should be attributed to the original features. This is only possible approximately and can lead to confusing results.

The normal definition of influence is the (Gateaux) derivative of the outlier score with respect to the feature values. This has the advantage that we can simply compute the change in the projection space for small changes in a single feature value. This changes switches to use this definition. For the unprojected case the results are broadly similar: the unit tests pass with minor tweaks. However, this fixes confusing results in the other case.

@tveasey tveasey added the WIP label Apr 25, 2022
double logCdfComplement{common::CTools::fastLog(
std::max(cdfComplement, std::numeric_limits<double>::min()))};
return P_OUTLIER.value(logCdfComplement);
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this represents a small change to how we compute outlier scores for extreme outliers. Before things would saturate which means we couldn't reliably compute the derivative of the overall score w.r.t. changes in feature values. This extends the range to std::numeric_limits<double>::min(). It also avoids making the function as flat around P = 0.5 for the same reason. The resulting scores are still broadly similar.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey merged commit 2cdec4f into elastic:main Sep 5, 2022
@tveasey tveasey deleted the outlier-influence branch September 5, 2022 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants