Skip to content

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented May 28, 2020

We can split the work to compute feature importance up over the trees in the forest since we just sum up the per tree feature importances at the end.

In order to do this, I needed to expose an interface to parallel_for_each which takes a list of functions to run on subsets of the iterator range because we use member variables for the algorithm state to avoid reallocating for each training example, which I need to bind to each function upfront.

@tveasey tveasey requested a review from valeriy42 May 28, 2020 09:22
@tveasey tveasey changed the title [ML] Parallelise the feature importance calculation over trees for classification and regression [ML] Parallelise the feature importance calculation for classification and regression over trees May 28, 2020
Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good altogether. I just have a couple of minor questions.

core::parallel_for_each(m_Forest->begin(), m_Forest->size(), computeTreeShap);

m_ReducedShapValues = m_PerThreadShapValues[0];
for (std::size_t i = 1; i < m_PerThreadShapValues.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note here that it can be replaces with std::reduce and parallel execution once we have C++17 support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can go until the standard introduces executors, because we definitely want these to be executed in a thread pool rather than have to spawn threads. So this will have to stay until we are able to adopt C++23 based on current plans. Given how far away this is, I'm not sure it is worth a comment at the moment.

core::CContainerPrinter::print(indices));
BOOST_REQUIRE_EQUAL(core::CContainerPrinter::print(expectedNames),
core::CContainerPrinter::print(names));
BOOST_REQUIRE_EQUAL(core::CContainerPrinter::print(expectedShap),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to compare the string representatives instead of the numerical values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just for code brevity, i.e. it allows me to compare all values (to reasonable accuracy) in one line.

@tveasey
Copy link
Contributor Author

tveasey commented May 28, 2020

Thanks for the review @valeriy42! Note I made a somewhat significant refactor in Concurrency to tighten up the new functions' error handling. Other than that I think I've commented on or addressed all your comments. Can you take another look and in particular it's worth checking ea0317c?

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work!

@tveasey tveasey merged commit c9c4863 into elastic:master May 29, 2020
@tveasey tveasey deleted the thread-feature-importance branch May 29, 2020 18:55
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request May 29, 2020
tveasey added a commit that referenced this pull request Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants