[ML] Compute SHAP values for supervised learning #857

valeriy42 · 2019-11-28T15:59:31Z

This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer Consistent Individualized Feature Attribution for Tree Ensembles by Lundberg et al. for details to the original algorithm.

tveasey

This is looking like a great first step. I haven't yet been through the implementation in detail, but thought I'd submit my comments to date. Aside from some minor things, my biggest concern is we effectively write Shapley values for every feature and row into a vector of vectors then write a subset of these values back to the data frame. This slightly subverts the intention of that class, which is that it can move off heap if needed. There seems to be no obstacle (random access to values Shapley values you've already written) to use a strategy which just directly writes all the values into the data frame and then writes out a subset to the results objects at the end (this also reduces peak memory usage when you copy across to resized frame). Importantly, this allows us to move all that state off heap if we need to.

lib/api/CDataFrameTrainBoostedTreeRegressionRunner.cc

lib/api/CDataFrameTrainBoostedTreeRunner.cc

lib/api/unittest/CDataFrameAnalyzerTrainingTest.cc

lib/maths/CBoostedTreeFactory.cc

lib/maths/CTreeShapFeatureImportance.cc

tveasey

Thanks for working through the comments @valeriy42! This is basically LGTM.

As discussed offline, it would be nice to avoid copying SPath in the recursion, but we can make this change in a following PR. The one thing I'd really like to see, would be a unit test that tests agreement against the brute force calculation, iterating over the power set of feature values, for small random trees. I'm also happy to postpone this test to a later PR, since you already have a fair bit of testing in place. I'm going to go ahead and approve on this basis.

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc

lib/maths/CBoostedTreeFactory.cc

lib/maths/CDataFrameCategoryEncoder.cc

# Conflicts: # include/maths/CBoostedTreeImpl.h # lib/api/CDataFrameTrainBoostedTreeRegressionRunner.cc

valeriy42 · 2019-12-11T10:01:19Z

This PR introduces the computation of SHAP (SHapley Additive exPlanation) values for feature importance. Refer "Consistent Individualized Feature Attribution for Tree Ensembles" by Lundberg et al. for details to the original algorithm.

This PR adds a number of minor adjustments to the original change #857 for SHAP estimation - SHAP columns are returned as an integer range instead of a vector to prevent instantiation - splitPath is passed by reference in the recursive call to avoid creating copies - "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)

This PR adds a number of minor adjustments to the original change elastic#857 for SHAP estimation - SHAP columns are returned as an integer range instead of a vector to prevent instantiation - splitPath is passed by reference in the recursive call to avoid creating copies - "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)

This PR adds a number of minor adjustments to the original change #857 for SHAP estimation - SHAP columns are returned as an integer range instead of a vector to prevent instantiation - splitPath is passed by reference in the recursive call to avoid creating copies - "NoNormalization" is removed from the test names, since we don't do any normalization (may become relevant in the future, once we have multiclass classification)

valeriy42 added 10 commits November 21, 2019 14:16

single tree tests pass

66cb6c1

multiple trees pass

75ae386

superfluous information removed

d2a47ef

add data frame analytics unit test for feature importance

82d4a06

wire through shap values specification

914c68f

missing header file added

9fe1356

bias term removed

9d521ab

compute Shap values implemented

0912750

fixing unit test

f7b4b17

unit test finished

dc3e8fe

valeriy42 added >non-issue WIP :ml >feature v8.0.0 v7.6.0 labels Nov 28, 2019

valeriy42 added 4 commits November 28, 2019 17:00

add shap values to classification runner

125befc

Merge branch 'master' into feature/feature-importance

92d47cd

pre-review

d96c3e8

minor refactorings

c9d4490

valeriy42 removed the WIP label Nov 29, 2019

valeriy42 requested a review from tveasey November 29, 2019 12:55

valeriy42 added 2 commits November 29, 2019 16:18

reformatting

cee9c0e

handle category encoding

706a33f

tveasey reviewed Dec 2, 2019

View reviewed changes

valeriy42 added 5 commits December 2, 2019 16:04

handle category encoding with unit test

30788f5

write shap valued directly into the data frame

6f81bda

review comments

c0d7622

make shap methods static, resolve unit test memory issue

65864e1

add top shap values to the tree factor, add memory estimation

006698f

valeriy42 added 7 commits December 7, 2019 09:34

cleaning up Feature Importance class

a2e4fce

shap() refactored, return value removed

4a1f491

function documentation added

a16dde8

add Tom's text as comments to the complex parts of the algorithm

f20660a

feature importance all warnings fixed

dacab64

clean up compiler warnings

b954aca

Merge branch 'master' into feature/feature-importance

17c4513

tveasey approved these changes Dec 9, 2019

View reviewed changes

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc Outdated Show resolved Hide resolved

lib/maths/CBoostedTreeFactory.cc Show resolved Hide resolved

lib/maths/CDataFrameCategoryEncoder.cc Outdated Show resolved Hide resolved

valeriy42 added 11 commits December 9, 2019 16:51

added unit test comparing with the brute force solution

e747c5e

compute top values using accumulator

6a8c2ca

persist/restore new members

c610e42

Merge branch 'master' into feature/feature-importance

d2f7c8f

# Conflicts: # include/maths/CBoostedTreeImpl.h # lib/api/CDataFrameTrainBoostedTreeRegressionRunner.cc

fix windows compile error

acea41d

segfault fixed

0802b2d

persist restore problem fixed

f3f6ff5

renamings in test

b9b3b11

renamings in test

a308825

adding comments

33dfae5

change memory limit for unit test

f622403

valeriy42 merged commit 781ee92 into elastic:master Dec 11, 2019

valeriy42 mentioned this pull request Dec 11, 2019

[7.6][ML] Compute SHAP values for supervised learning (#857) #888

Merged

valeriy42 mentioned this pull request Dec 12, 2019

[ML] Cleaning-up SHAP estimation #900

Merged

valeriy42 deleted the feature/feature-importance branch December 13, 2019 08:10

droberts195 mentioned this pull request Dec 19, 2019

[ML] Add CHANGELOG entry for SHAP feature #914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Compute SHAP values for supervised learning #857

[ML] Compute SHAP values for supervised learning #857

valeriy42 commented Nov 28, 2019 •

edited

Loading

tveasey left a comment •

edited

Loading

tveasey left a comment

valeriy42 commented Dec 11, 2019

[ML] Compute SHAP values for supervised learning #857

[ML] Compute SHAP values for supervised learning #857

Conversation

valeriy42 commented Nov 28, 2019 • edited Loading

tveasey left a comment • edited Loading

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

valeriy42 commented Dec 11, 2019

valeriy42 commented Nov 28, 2019 •

edited

Loading

tveasey left a comment •

edited

Loading