[ML] Improve data access pattern computing best splits for classification and regression #1312

tveasey · 2020-06-11T17:01:10Z

The feature derivatives for boosted tree splits are laid out in increasing feature index order, but when we sample a feature bag we don't ensure it is sorted. This means we are getting random access rather than linear scan over the derivatives when computing the best split. This switches to sorting the feature bag after sampling and rejigs the function signatures to avoid allocating a vector for every node.

droberts195

LGTM

tveasey · 2020-06-11T19:32:00Z

retest

…e case + unit test

tveasey · 2020-06-12T10:57:26Z

Thanks for the review @droberts195! While I was in the area I realised:

I could avoid all allocations in categorical sample without replacement.
There was a bug in the case more samples than values were asked for: missing return. (This wasn't actually being exercised in the code, but it is clearly worth fixing.)

I also added a unit test. I decided it isn't worth breaking this out into a separate change (it is also essentially related to trying to improve cache performance), but it is worth also having a look at e548cc6

droberts195

Still LGTM

Neither of the comments is essential to change if the CI goes green and you need to merge quickly

lib/maths/CSampling.cc

…tion and regression (elastic#1312)

…tion and regression (#1315) Backport #1312.

Sort the feature bag

a3d0152

tveasey added >enhancement review v8.0.0 :ml/DataFrameAnalysis v7.9.0 labels Jun 11, 2020

Docs

7467fd0

droberts195 approved these changes Jun 11, 2020

View reviewed changes

Tidy up sample without replacement to avoid all allocations + fix edg…

e548cc6

…e case + unit test

droberts195 approved these changes Jun 12, 2020

View reviewed changes

lib/maths/CSampling.cc Outdated Show resolved Hide resolved

lib/maths/CSampling.cc Outdated Show resolved Hide resolved

Review comments

d957489

tveasey merged commit 029e232 into elastic:master Jun 12, 2020

tveasey deleted the sort-feature-bag branch June 12, 2020 13:46

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jun 12, 2020

[ML] Improve data access pattern computing best splits for classifica…

e4db3c5

…tion and regression (elastic#1312)

tveasey mentioned this pull request Jun 12, 2020

[7.9][ML] Improve data access pattern computing best splits for classification and regression #1315

Merged

tveasey added a commit that referenced this pull request Jun 12, 2020

[ML] Improve data access pattern computing best splits for classifica…

578c661

…tion and regression (#1315) Backport #1312.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Improve data access pattern computing best splits for classification and regression #1312

[ML] Improve data access pattern computing best splits for classification and regression #1312

Uh oh!

tveasey commented Jun 11, 2020

Uh oh!

droberts195 left a comment

Uh oh!

tveasey commented Jun 11, 2020

Uh oh!

tveasey commented Jun 12, 2020

Uh oh!

droberts195 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ML] Improve data access pattern computing best splits for classification and regression #1312

[ML] Improve data access pattern computing best splits for classification and regression #1312

Uh oh!

Conversation

tveasey commented Jun 11, 2020

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

tveasey commented Jun 11, 2020

Uh oh!

tveasey commented Jun 12, 2020

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!