Optimizes mean calculation routine in treeinterpreter/treeinterpreter.py #24
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I tried to use treeinterpreter to calculate feature contribution components for a large dataset which consist of 55K rows, each row ~ 15K features and even though I've parallelized my computation using Spark I was not able to run the code successfully.
One of the issues I was facing was tremendous amount of memory required by treeinterpreter for each run. It turned out that in my case most of the memory is used by _predict_forest to assemble lists of biases, contributions and predictions which are later used to calculate corresponding mean vectors.
To improve the code memory usage and runtime I propose to use iterative method for computing averages, as it is summarized in http://www.heikohoffmann.de/htmlthesis/node134.html