-
Notifications
You must be signed in to change notification settings - Fork 66
[ML] Improvements to the outlier influence calculation #2256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
double logCdfComplement{common::CTools::fastLog( | ||
std::max(cdfComplement, std::numeric_limits<double>::min()))}; | ||
return P_OUTLIER.value(logCdfComplement); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this represents a small change to how we compute outlier scores for extreme outliers. Before things would saturate which means we couldn't reliably compute the derivative of the overall score w.r.t. changes in feature values. This extends the range to std::numeric_limits<double>::min()
. It also avoids making the function as flat around P = 0.5 for the same reason. The resulting scores are still broadly similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Currently, we compute influence of features on the outlier score by searching for the minimum value of the outlier score allowing one coordinate to vary at a time. This works well if the data aren't projected, but when the data are high dimensional we randomly project them. We then have to work out how changes in projected coordinates should be attributed to the original features. This is only possible approximately and can lead to confusing results.
The normal definition of influence is the (Gateaux) derivative of the outlier score with respect to the feature values. This has the advantage that we can simply compute the change in the projection space for small changes in a single feature value. This changes switches to use this definition. For the unprojected case the results are broadly similar: the unit tests pass with minor tweaks. However, this fixes confusing results in the other case.