Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Feature importance performance optimization #1005

Merged
merged 28 commits into from
Feb 19, 2020

Conversation

valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Feb 17, 2020

This PR aims to improve the computation efficiency of the feature importance computation. To this end, it introduces continuous memory array to store elements from the split path, pre-reserved at the beginning. Furthermore, I split scale values away from fractionOnes, fractionZeros, and splitIndex to improve cache efficiency. Altogether, I could reduce the computation time by half.

I will look into improving SHAP algorithms by introducing suitable heuristics in a follow-up PR.

@valeriy42 valeriy42 removed the WIP label Feb 17, 2020
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great @valeriy42! My main comment is that it feels like we could better encapsulate this implementation if we made use of CPathElementAccessor to wrap both iterators. Can you see any obstacles to doing this, also WDYT?

include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
TDoubleVec scaleVector;
// need a bit more memory than max depth
pathVector.reserve(((maxDepthOverall + 2) * (maxDepthOverall + 3)) / 2);
scaleVector.reserve(((maxDepthOverall + 2) * (maxDepthOverall + 3)) / 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these probably need to be resizes to avoid copies in shapRecursive being undefined behaviour. Or else you need to use std::back_inserter iterator wrappers of the containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it would be nice to note down the origin of + 2 and + 3. I can see that why this is coming from the sum of an arithmetic progression in the worst case up to maxDepthOverall, but would be useful to explain this better. Also, at the same time you could explain the overall strategy of copying the "current path" to the end of each memory arena.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment in ca3d891

lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
@valeriy42
Copy link
Contributor Author

Thank you @tveasey for the review comments. I refactored the code and implemented your suggestions. Please let me know if everything is ok now.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @valeriy42. There are a couple other loop simplifications in unwindPath that got missed. However, happy to go ahead and approve. Great work!

lib/maths/CTreeShapFeatureImportance.cc Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
@valeriy42 valeriy42 merged commit bd8a143 into elastic:master Feb 19, 2020
@valeriy42 valeriy42 deleted the Performance-improvement branch February 19, 2020 14:13
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Feb 19, 2020
This PR aims to improve the computation efficiency of the feature importance computation. To this end, it introduces continuous memory array to store elements from the split path, pre-reserved at the beginning. Furthermore, I split scale values away from fractionOnes, fractionZeros, and splitIndex to improve cache efficiency. Altogether, I could reduce the computation time by half.

I will look into improving SHAP algorithms by introducing suitable heuristics in a follow-up PR.
valeriy42 added a commit that referenced this pull request Feb 20, 2020
This PR aims to improve the computation efficiency of the feature importance computation. To this end, it introduces continuous memory array to store elements from the split path, pre-reserved at the beginning. Furthermore, I split scale values away from fractionOnes, fractionZeros, and splitIndex to improve cache efficiency. Altogether, I could reduce the computation time by half.

I will look into improving SHAP algorithms by introducing suitable heuristics in a follow-up PR.

Backport of #1005
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants