[ML] Parallelise the feature importance calculation for classification and regression over trees #1277

tveasey · 2020-05-28T09:22:24Z

We can split the work to compute feature importance up over the trees in the forest since we just sum up the per tree feature importances at the end.

In order to do this, I needed to expose an interface to parallel_for_each which takes a list of functions to run on subsets of the iterator range because we use member variables for the algorithm state to avoid reallocating for each training example, which I need to bind to each function upfront.

valeriy42

Looks good altogether. I just have a couple of minor questions.

include/core/Concurrency.h

lib/maths/CTreeShapFeatureImportance.cc

valeriy42 · 2020-05-28T12:20:21Z

lib/maths/CTreeShapFeatureImportance.cc

+        core::parallel_for_each(m_Forest->begin(), m_Forest->size(), computeTreeShap);
+
+        m_ReducedShapValues = m_PerThreadShapValues[0];
+        for (std::size_t i = 1; i < m_PerThreadShapValues.size(); ++i) {


Maybe add a note here that it can be replaces with std::reduce and parallel execution once we have C++17 support?

I don't think this can go until the standard introduces executors, because we definitely want these to be executed in a thread pool rather than have to spawn threads. So this will have to stay until we are able to adopt C++23 based on current plans. Given how far away this is, I'm not sure it is worth a comment at the moment.

valeriy42 · 2020-05-28T12:26:54Z

lib/maths/unittest/CTreeShapFeatureImportanceTest.cc

+                                    core::CContainerPrinter::print(indices));
+                BOOST_REQUIRE_EQUAL(core::CContainerPrinter::print(expectedNames),
+                                    core::CContainerPrinter::print(names));
+                BOOST_REQUIRE_EQUAL(core::CContainerPrinter::print(expectedShap),


Why is it necessary to compare the string representatives instead of the numerical values?

It was just for code brevity, i.e. it allows me to compare all values (to reasonable accuracy) in one line.

tveasey · 2020-05-28T16:58:25Z

Thanks for the review @valeriy42! Note I made a somewhat significant refactor in Concurrency to tighten up the new functions' error handling. Other than that I think I've commented on or addressed all your comments. Can you take another look and in particular it's worth checking ea0317c?

valeriy42

LGTM. Good work!

…n and regression over trees (elastic#1277)

…n and regression over trees (#1292) Backport #1277.

tveasey added 2 commits May 28, 2020 10:19

Thread feature importance by distributing over trees

5e4fb1f

Tidy up includes

0a78314

tveasey added >enhancement review v8.0.0 :ml/DataFrameAnalysis v7.9.0 labels May 28, 2020

tveasey requested a review from valeriy42 May 28, 2020 09:22

tveasey changed the title ~~[ML] Parallelise the feature importance calculation over trees for classification and regression~~ [ML] Parallelise the feature importance calculation for classification and regression over trees May 28, 2020

tveasey added 2 commits May 28, 2020 10:24

Docs

fee6ec1

Tweak

3f033f2

valeriy42 reviewed May 28, 2020

View reviewed changes

tveasey and others added 3 commits May 28, 2020 16:22

Better variable name

cc7ea3d

Fix error handling, reentry and better function signatures

ea0317c

Merge branch 'master' into thread-feature-importance

6408ca6

valeriy42 approved these changes May 29, 2020

View reviewed changes

tveasey merged commit c9c4863 into elastic:master May 29, 2020

tveasey deleted the thread-feature-importance branch May 29, 2020 18:55

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request May 29, 2020

[ML] Parallelise the feature importance calculation for classificatio…

db114c2

…n and regression over trees (elastic#1277)

tveasey mentioned this pull request May 29, 2020

[7.9][ML] Parallelise the feature importance calculation for classification and regression over trees #1292

Merged

tveasey added a commit that referenced this pull request Jun 2, 2020

[ML] Parallelise the feature importance calculation for classificatio…

427c31d

…n and regression over trees (#1292) Backport #1277.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Parallelise the feature importance calculation for classification and regression over trees #1277

[ML] Parallelise the feature importance calculation for classification and regression over trees #1277

Uh oh!

tveasey commented May 28, 2020 •

edited

Loading

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

Uh oh!

valeriy42 May 28, 2020

Uh oh!

tveasey May 28, 2020

Uh oh!

valeriy42 May 28, 2020

Uh oh!

tveasey May 28, 2020

Uh oh!

tveasey commented May 28, 2020

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

[ML] Parallelise the feature importance calculation for classification and regression over trees #1277

[ML] Parallelise the feature importance calculation for classification and regression over trees #1277

Uh oh!

Conversation

tveasey commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

valeriy42 May 28, 2020

Choose a reason for hiding this comment

Uh oh!

tveasey May 28, 2020

Choose a reason for hiding this comment

Uh oh!

valeriy42 May 28, 2020

Choose a reason for hiding this comment

Uh oh!

tveasey May 28, 2020

Choose a reason for hiding this comment

Uh oh!

tveasey commented May 28, 2020

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tveasey commented May 28, 2020 •

edited

Loading