-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Compute SHAP values for supervised learning #857
[ML] Compute SHAP values for supervised learning #857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking like a great first step. I haven't yet been through the implementation in detail, but thought I'd submit my comments to date. Aside from some minor things, my biggest concern is we effectively write Shapley values for every feature and row into a vector of vectors then write a subset of these values back to the data frame. This slightly subverts the intention of that class, which is that it can move off heap if needed. There seems to be no obstacle (random access to values Shapley values you've already written) to use a strategy which just directly writes all the values into the data frame and then writes out a subset to the results objects at the end (this also reduces peak memory usage when you copy across to resized frame). Importantly, this allows us to move all that state off heap if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the comments @valeriy42! This is basically LGTM.
As discussed offline, it would be nice to avoid copying SPath
in the recursion, but we can make this change in a following PR. The one thing I'd really like to see, would be a unit test that tests agreement against the brute force calculation, iterating over the power set of feature values, for small random trees. I'm also happy to postpone this test to a later PR, since you already have a fair bit of testing in place. I'm going to go ahead and approve on this basis.
# Conflicts: # include/maths/CBoostedTreeImpl.h # lib/api/CDataFrameTrainBoostedTreeRegressionRunner.cc
This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm. |
This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm.
This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm.
This PR adds a number of minor adjustments to the original change #857 for SHAP estimation - SHAP columns are returned as an integer range instead of a vector to prevent instantiation - splitPath is passed by reference in the recursive call to avoid creating copies - "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)
This PR adds a number of minor adjustments to the original change elastic#857 for SHAP estimation - SHAP columns are returned as an integer range instead of a vector to prevent instantiation - splitPath is passed by reference in the recursive call to avoid creating copies - "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)
This PR adds a number of minor adjustments to the original change #857 for SHAP estimation - SHAP columns are returned as an integer range instead of a vector to prevent instantiation - splitPath is passed by reference in the recursive call to avoid creating copies - "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)
This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer Consistent Individualized Feature Attribution for Tree Ensembles by Lundberg et al. for details to the original algorithm.