-
Notifications
You must be signed in to change notification settings - Fork 66
[ML] Parallelise the feature importance calculation for classification and regression over trees #1277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good altogether. I just have a couple of minor questions.
core::parallel_for_each(m_Forest->begin(), m_Forest->size(), computeTreeShap); | ||
|
||
m_ReducedShapValues = m_PerThreadShapValues[0]; | ||
for (std::size_t i = 1; i < m_PerThreadShapValues.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note here that it can be replaces with std::reduce
and parallel execution once we have C++17 support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this can go until the standard introduces executors, because we definitely want these to be executed in a thread pool rather than have to spawn threads. So this will have to stay until we are able to adopt C++23 based on current plans. Given how far away this is, I'm not sure it is worth a comment at the moment.
core::CContainerPrinter::print(indices)); | ||
BOOST_REQUIRE_EQUAL(core::CContainerPrinter::print(expectedNames), | ||
core::CContainerPrinter::print(names)); | ||
BOOST_REQUIRE_EQUAL(core::CContainerPrinter::print(expectedShap), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to compare the string representatives instead of the numerical values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just for code brevity, i.e. it allows me to compare all values (to reasonable accuracy) in one line.
Thanks for the review @valeriy42! Note I made a somewhat significant refactor in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good work!
…n and regression over trees (elastic#1277)
We can split the work to compute feature importance up over the trees in the forest since we just sum up the per tree feature importances at the end.
In order to do this, I needed to expose an interface to
parallel_for_each
which takes a list of functions to run on subsets of the iterator range because we use member variables for the algorithm state to avoid reallocating for each training example, which I need to bind to each function upfront.