[ML] Adds feature importance option to inference processor #52218

benwtrent · 2020-02-11T15:48:37Z

This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: num_top_feature_importance_values
Example:

"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}

This will write to the document as follows:

"inference" : {
   "feature_importance" : { 
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}

This is done through calculating the SHAP values.

It requires that models have populated number_samples for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991

elasticmachine · 2020-02-11T15:48:40Z

Pinging @elastic/ml-core (:ml)

benwtrent · 2020-02-11T19:07:56Z

@elasticmachine update branch

…benwtrent/elasticsearch into feature/ml-inference-feature-importance

droberts195

Looks good.

My only real comment is that the thread safety of Tree needs thinking about. If in fact these objects are only ever used from one thread then just remove the TODO, the volatiles and add a class Javadoc comment stating they must only be used from one thread at a time. But if they can be used from multiple threads then your TODO is correct and extra work to ensure thread safety is required.

droberts195 · 2020-02-13T14:12:43Z

docs/reference/ingest/processors/inference.asciidoc

+(Optional, integer)
+If set, feature importance for the top
+most important features will be computed. Importance is calculated
+using the SHAP (SHapley Additive exPlanations) method as described in


You might be able to use the macro from https://github.com/elastic/elasticsearch/pull/52283/files#diff-876579125ed69329aeee2dff00706206R909 to avoid duplicating this text.

droberts195 · 2020-02-13T14:55:53Z

...gin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/tree/Tree.java

+                .collect(Collectors.toMap(featureNames::get, i -> featureImportance[i])));
+    }
+
+    //TODO synchronized?


If it is possible that this can be called from two different threads around the same time then there is a bigger problem that making this synchronized will not solve. In the private featureImportance(List<Double>, Map<String, String> featureDecoder) method above it would be possible for a simultaneous call to calculateNodeEstimatesIfNeeded() has changed maxDepth but not nodeEstimates at the time when these members are being read in featureImportance().

If multithreaded access to Tree objects is possible then the featureImportance(List<Double>, Map<String, String> featureDecoder) method needs to be synchronized and this method needs a comment saying it must only be called from featureImportance(List<Double>, Map<String, String> featureDecoder). Then nodeEstimates and maxDepth wouldn't need to be volatile.

@droberts195 I do think this will be accessible via multiple threads. Multiple docs can come into different pipelines on the same node referencing the same model. So, they would all hit the same instance in cache.

I am going to get this synchronized, and also add a node to the classes saying that they must be MT safe.

droberts195

LGTM

…nce-feature-importance

benwtrent · 2020-02-21T15:11:58Z

@elasticmachine update branch

benwtrent · 2020-02-21T19:01:21Z

run elasticsearch-ci/docs

…enwtrent/elasticsearch into feature/ml-inference-feature-importance

…c#52218) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991

#52666) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991

[ML] Adds feature importance to models via SHAP value calculations

cd41c83

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.7.0 labels Feb 11, 2020

updating inference processor doc

83ec74d

elasticmachine and others added 3 commits February 11, 2020 13:08

Merge branch 'master' into feature/ml-inference-feature-importance

6162b49

minor fix ups

3761a6f

Merge branch 'feature/ml-inference-feature-importance' of github.com:…

7ca4875

…benwtrent/elasticsearch into feature/ml-inference-feature-importance

droberts195 reviewed Feb 13, 2020

View reviewed changes

synchronizing depth and estimation calculation

2d14185

droberts195 approved these changes Feb 14, 2020

View reviewed changes

benwtrent added 3 commits February 20, 2020 16:11

adjusting algorithm for faster performance

9ae4e03

Merge remote-tracking branch 'upstream/master' into feature/ml-infere…

9cec51f

…nce-feature-importance

fixing initialization and copy forward

8a67223

Merge branch 'master' into feature/ml-inference-feature-importance

65c798a

benwtrent added 3 commits February 21, 2020 14:15

fixing docs

cc76161

erge branch 'feature/ml-inference-feature-importance' of github.com:b…

c8b7fd0

…enwtrent/elasticsearch into feature/ml-inference-feature-importance

making infer MT safe

01638f8

benwtrent merged commit 20f5427 into elastic:master Feb 21, 2020

benwtrent deleted the feature/ml-inference-feature-importance branch February 21, 2020 21:36

benwtrent changed the title ~~[ML] Adds feature importance to option to inference processor~~ [ML] Adds feature importance option to inference processor Feb 21, 2020

benwtrent mentioned this pull request Feb 21, 2020

[7.x] [ML] Adds feature importance to option to inference processor (#52218) #52666

Merged

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Adds feature importance option to inference processor #52218

[ML] Adds feature importance option to inference processor #52218

benwtrent commented Feb 11, 2020

elasticmachine commented Feb 11, 2020

benwtrent commented Feb 11, 2020

droberts195 left a comment

droberts195 Feb 13, 2020

droberts195 Feb 13, 2020

benwtrent Feb 13, 2020

droberts195 left a comment

benwtrent commented Feb 21, 2020

benwtrent commented Feb 21, 2020

[ML] Adds feature importance option to inference processor #52218

[ML] Adds feature importance option to inference processor #52218

Conversation

benwtrent commented Feb 11, 2020

elasticmachine commented Feb 11, 2020

benwtrent commented Feb 11, 2020

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Feb 13, 2020

Choose a reason for hiding this comment

droberts195 Feb 13, 2020

Choose a reason for hiding this comment

benwtrent Feb 13, 2020

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Feb 21, 2020

benwtrent commented Feb 21, 2020