[ML] Add information about samples per node to the tree #991

valeriy42 · 2020-02-06T13:13:41Z

This PR extends the definition of the tree node by adding information about the number of training samples that passed through the node (numberSamples or number_samples). The json schema for inference model is adjusted accordingly.

Since this change the schema for persist/restore of the tree implementation, I bumped the version and removed 7.5 and 7.6 from the list of supported version. My reasoning: restoring from old schema and setting number samples to 0 would break feature importance at inference time.

I also adjust feature importance computation to use pre-computed number samples instead of recomputing it on the fly.

Closes #850

valeriy42 · 2020-02-07T08:11:50Z

retest

tveasey

I've done a pass through and left some minor style comments. However, I have two more significant comments: 1) I'm worried about the cost of computing the number of rows from the row mask where you do (this was the motivation for introducing leftChildHasFewerRows in the first place), 2) I wonder if we should use the test rows as well when computing sample counts: this is a change in behaviour and it seems better to use all available data for these. To me these point to keeping the step to compute and write them to the tree separate from the main training loop (since they aren't needed here). It could just happen as a single standalone pass at the end of training.

include/maths/CBoostedTree.h

include/maths/CTreeShapFeatureImportance.h

lib/api/unittest/CDataFrameAnalyzerFeatureImportanceTest.cc

lib/maths/CBoostedTreeImpl.cc

lib/maths/CBoostedTreeLeafNodeStatistics.cc

lib/maths/CTreeShapFeatureImportance.cc

tveasey

Good stuff @valeriy42! I've done a second pass through. I think there are some hangover TODOs from the refactoring. Also, it seems like we can exploit having the counts on the node to simplify SHAP code slightly.

include/maths/CTreeShapFeatureImportance.h

lib/maths/CBoostedTreeImpl.cc

lib/maths/CBoostedTreeLeafNodeStatistics.cc

lib/maths/CTreeShapFeatureImportance.cc

…p-850 � Conflicts: � lib/maths/CBoostedTreeImpl.cc

valeriy42 · 2020-02-17T09:53:08Z

Thank you, @tveasey for the review. I addressed your comments and removed unnecessary code. Let me know if everything is ok now.

tveasey

Thanks for working through the suggestions, looks good!

valeriy42 · 2020-02-17T12:12:43Z

retest

This PR extends the definition of the tree node by adding information about the number of training samples that passed through the node (numberSamples or number_samples). The json schema for inference model is adjusted accordingly. Since this change the schema for persist/restore of the tree implementation, I bumped the version and removed 7.5 and 7.6 from the list of supported version. My reasoning: restoring from old schema and setting number samples to 0 would break feature importance at inference time. I also adjust feature importance computation to use pre-computed number samples instead of recomputing it on the fly.

Backport to #991

This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991

…c#52218) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991

#52666) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991

valeriy42 added 4 commits February 5, 2020 14:12

additions to definition

feacc51

Merge branch 'master' into ml-cpp-850

714633f

wire through number of samples

4245b52

unit test adjusted

2cc8b0c

valeriy42 added >non-issue WIP :ml v8.0.0 v7.7.0 labels Feb 6, 2020

valeriy42 changed the title ~~[ML] Add number_samples field to the node~~ [ML] Add information about samples per node to the tree Feb 6, 2020

valeriy42 added 6 commits February 6, 2020 15:04

refactoring, formatting

e7229e1

move setter for number of samples

69f6650

add enhancement note

1d23037

bump version for persist/restore

a85304d

adjust SHAP algorithm to use precomputed number samples

ee63036

add comments

736416a

valeriy42 requested a review from tveasey February 6, 2020 16:30

valeriy42 added >enhancement and removed WIP labels Feb 6, 2020

updated test bounds

f9edc6b

valeriy42 added >feature and removed >feature labels Feb 7, 2020

valeriy42 added 5 commits February 7, 2020 14:11

rename variables for consistency

bcadd65

version bump fixed

868d49f

Merge branch 'master' into ml-cpp-850

db799ed

formatting

763bc4e

clang warning fixed.

aaf79b7

valeriy42 removed the >non-issue label Feb 10, 2020

tveasey reviewed Feb 10, 2020

View reviewed changes

benwtrent mentioned this pull request Feb 11, 2020

[ML] Adds feature importance option to inference processor elastic/elasticsearch#52218

Merged

valeriy42 added 2 commits February 13, 2020 15:16

samples per node computation as a standalone method

097dc25

formatting

29dd782

tveasey reviewed Feb 14, 2020

View reviewed changes

valeriy42 added 6 commits February 17, 2020 09:36

Merge branch 'master' of https://github.com/elastic/ml-cpp into ml-cp…

6a61fa9

…p-850 � Conflicts: � lib/maths/CBoostedTreeImpl.cc

explicit numberSamples vector removed

21e714d

Formatting

ed15070

changes in CBoostedTreeLeafNodeStatistics reverted

c30908e

use root() and fix conversions

3fbb75e

move number samples computations

93a4432

tveasey approved these changes Feb 17, 2020

View reviewed changes

fix for root method

f1cc310

valeriy42 merged commit 309db87 into elastic:master Feb 17, 2020

valeriy42 deleted the ml-cpp-850 branch February 17, 2020 15:32

valeriy42 mentioned this pull request Feb 17, 2020

[7.7][ML] Add information about samples per node to the tree #1006

Merged

tveasey mentioned this pull request Feb 17, 2020

[ML] Test threshold fix #1007

Merged

valeriy42 added a commit that referenced this pull request Feb 18, 2020

[7.7][ML] Add information about samples per node to the tree (#1006)

030d608

Backport to #991

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add information about samples per node to the tree #991

[ML] Add information about samples per node to the tree #991

valeriy42 commented Feb 6, 2020 •

edited

Loading

valeriy42 commented Feb 7, 2020

tveasey left a comment

tveasey left a comment

valeriy42 commented Feb 17, 2020

tveasey left a comment

valeriy42 commented Feb 17, 2020

[ML] Add information about samples per node to the tree #991

[ML] Add information about samples per node to the tree #991

Conversation

valeriy42 commented Feb 6, 2020 • edited Loading

valeriy42 commented Feb 7, 2020

tveasey left a comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

valeriy42 commented Feb 17, 2020

tveasey left a comment

Choose a reason for hiding this comment

valeriy42 commented Feb 17, 2020

valeriy42 commented Feb 6, 2020 •

edited

Loading