Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Adds feature importance option to inference processor #52218

Merged

Conversation

benwtrent
Copy link
Member

This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: num_top_feature_importance_values
Example:

"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}

This will write to the document as follows:

"inference" : {
   "feature_importance" : { 
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}

This is done through calculating the SHAP values.

It requires that models have populated number_samples for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

My only real comment is that the thread safety of Tree needs thinking about. If in fact these objects are only ever used from one thread then just remove the TODO, the volatiles and add a class Javadoc comment stating they must only be used from one thread at a time. But if they can be used from multiple threads then your TODO is correct and extra work to ensure thread safety is required.

(Optional, integer)
If set, feature importance for the top
most important features will be computed. Importance is calculated
using the SHAP (SHapley Additive exPlanations) method as described in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use the macro from https://github.com/elastic/elasticsearch/pull/52283/files#diff-876579125ed69329aeee2dff00706206R909 to avoid duplicating this text.

.collect(Collectors.toMap(featureNames::get, i -> featureImportance[i])));
}

//TODO synchronized?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is possible that this can be called from two different threads around the same time then there is a bigger problem that making this synchronized will not solve. In the private featureImportance(List<Double>, Map<String, String> featureDecoder) method above it would be possible for a simultaneous call to calculateNodeEstimatesIfNeeded() has changed maxDepth but not nodeEstimates at the time when these members are being read in featureImportance().

If multithreaded access to Tree objects is possible then the featureImportance(List<Double>, Map<String, String> featureDecoder) method needs to be synchronized and this method needs a comment saying it must only be called from featureImportance(List<Double>, Map<String, String> featureDecoder). Then nodeEstimates and maxDepth wouldn't need to be volatile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 I do think this will be accessible via multiple threads. Multiple docs can come into different pipelines on the same node referencing the same model. So, they would all hit the same instance in cache.

I am going to get this synchronized, and also add a node to the classes saying that they must be MT safe.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

run elasticsearch-ci/docs

@benwtrent benwtrent merged commit 20f5427 into elastic:master Feb 21, 2020
@benwtrent benwtrent deleted the feature/ml-inference-feature-importance branch February 21, 2020 21:36
@benwtrent benwtrent changed the title [ML] Adds feature importance to option to inference processor [ML] Adds feature importance option to inference processor Feb 21, 2020
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Feb 21, 2020
…c#52218)

This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991
benwtrent added a commit that referenced this pull request Feb 21, 2020
#52666)

This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants