Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Compute SHAP values for supervised learning #857

Merged
merged 52 commits into from
Dec 11, 2019

Conversation

valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Nov 28, 2019

This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer Consistent Individualized Feature Attribution for Tree Ensembles by Lundberg et al. for details to the original algorithm.

@valeriy42 valeriy42 removed the WIP label Nov 29, 2019
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking like a great first step. I haven't yet been through the implementation in detail, but thought I'd submit my comments to date. Aside from some minor things, my biggest concern is we effectively write Shapley values for every feature and row into a vector of vectors then write a subset of these values back to the data frame. This slightly subverts the intention of that class, which is that it can move off heap if needed. There seems to be no obstacle (random access to values Shapley values you've already written) to use a strategy which just directly writes all the values into the data frame and then writes out a subset to the results objects at the end (this also reduces peak memory usage when you copy across to resized frame). Importantly, this allows us to move all that state off heap if we need to.

lib/api/CDataFrameTrainBoostedTreeRegressionRunner.cc Outdated Show resolved Hide resolved
lib/api/CDataFrameTrainBoostedTreeRunner.cc Outdated Show resolved Hide resolved
lib/api/unittest/CDataFrameAnalyzerTrainingTest.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeFactory.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Show resolved Hide resolved
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the comments @valeriy42! This is basically LGTM.

As discussed offline, it would be nice to avoid copying SPath in the recursion, but we can make this change in a following PR. The one thing I'd really like to see, would be a unit test that tests agreement against the brute force calculation, iterating over the power set of feature values, for small random trees. I'm also happy to postpone this test to a later PR, since you already have a fair bit of testing in place. I'm going to go ahead and approve on this basis.

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeFactory.cc Show resolved Hide resolved
lib/maths/CDataFrameCategoryEncoder.cc Outdated Show resolved Hide resolved
@valeriy42
Copy link
Contributor Author

This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm.

@valeriy42 valeriy42 merged commit 781ee92 into elastic:master Dec 11, 2019
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Dec 11, 2019
This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm.
valeriy42 added a commit that referenced this pull request Dec 11, 2019
This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm.
@valeriy42 valeriy42 deleted the feature/feature-importance branch December 13, 2019 08:10
valeriy42 added a commit that referenced this pull request Dec 16, 2019
This PR adds a number of minor adjustments to the original change #857 for SHAP estimation
-    SHAP columns are returned as an integer range instead of a vector to prevent instantiation
-    splitPath is passed by reference in the recursive call to avoid creating copies
-    "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Dec 16, 2019
This PR adds a number of minor adjustments to the original change elastic#857 for SHAP estimation
-    SHAP columns are returned as an integer range instead of a vector to prevent instantiation
-    splitPath is passed by reference in the recursive call to avoid creating copies
-    "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)
valeriy42 added a commit that referenced this pull request Dec 16, 2019
This PR adds a number of minor adjustments to the original change #857 for SHAP estimation
-    SHAP columns are returned as an integer range instead of a vector to prevent instantiation
-    splitPath is passed by reference in the recursive call to avoid creating copies
-    "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants