[ML] Performance optimisation for classification and regression model training #2024

tveasey · 2021-09-13T11:06:48Z

Caching example splits proves to be a good runtime optimisation. Testing using our experiment driver on around 250 small data sets the mean runtime dropped from 380s to 320s. However, the runtime improvements were larger on the larger data sets. I'll attach some benchmarks for large datasets as well. I also removed CImmutableRadixSet which was no longer used since looking up splits is not on the hot path.

Note that this does have potential to change results slightly. The source of the difference is that we now store the candidate splits in float precision rather than double precision.

valeriy42

Good work and great results! I have just a few comments regarding readability. Other than those, LGTM! 👍

include/maths/CBoostedTreeFactory.h

include/core/CImmutableRadixSet.h

include/maths/CBoostedTreeLeafNodeStatistics.h

tveasey · 2021-09-14T14:35:04Z

The failure was timing related CStringStoreTest/testStringStore so I'm going to go ahead and merge this.

… training (#2027) Caching example splits proves to be a good runtime optimisation. Testing using our experiment driver on around 250 small data sets the mean runtime dropped from 380s to 320s. However, the runtime improvements were larger on the larger data sets. I'll attach some benchmarks for large datasets as well. I also removed CImmutableRadixSet which was no longer used since looking up splits is not on the hot path. Note that this does have potential to change results slightly. The source of the difference is that we now store the candidate splits in float precision rather than double precision. This ports #2024 to the incremental training feature branch. This collides with a certain amount of code change so I've essentially reapplied the change to this branch.

tveasey · 2021-09-16T07:36:25Z

Training on 80% Higgs 1M runtime before 9057s after 7641s.

… training (elastic#2024) Caching example splits proves to be a good runtime optimisation. Testing using our experiment driver on around 250 small data sets the mean runtime dropped from 380s to 320s. However, the runtime improvements were larger on the larger data sets. I'll attach some benchmarks for large datasets as well. I also removed CImmutableRadixSet which was no longer used since looking up splits is not on the hot path. Note that this does have potential to change results slightly. The source of the difference is that we now store the candidate splits in float precision rather than double precision.

Migrate to caching row splits

36f50ef

tveasey added >enhancement review :ml/DataFrameAnalysis v7.16.0 labels Sep 13, 2021

tveasey requested a review from valeriy42 September 13, 2021 11:06

tveasey added 5 commits September 13, 2021 12:08

Docs

18f6a82

We always need to initialize the splits cache

eec6006

Formatting

3e8e61e

Comment

6815826

Fix handling of missing values in best split

8c2d599

valeriy42 approved these changes Sep 14, 2021

View reviewed changes

include/maths/CBoostedTreeFactory.h Show resolved Hide resolved

include/core/CImmutableRadixSet.h Show resolved Hide resolved

include/maths/CBoostedTreeLeafNodeStatistics.h Show resolved Hide resolved

include/maths/CBoostedTreeLeafNodeStatistics.h Show resolved Hide resolved

tveasey added 3 commits September 14, 2021 12:12

Formatting + ensure candidate splits are sorted

cf2fb40

Memory thresholds

6a16457

Memory thresholds

00daafa

tveasey merged commit a5b812b into elastic:main Sep 14, 2021

tveasey deleted the cache-splits-main branch September 14, 2021 14:35

tveasey mentioned this pull request Sep 14, 2021

[ML] Performance optimisation for classification and regression model training #2027

Merged

tveasey mentioned this pull request Oct 11, 2021

[7.x][ML] Performance optimisation for classification and regression model… #2066

Merged

tveasey added the v8.0.0 label Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Performance optimisation for classification and regression model training #2024

[ML] Performance optimisation for classification and regression model training #2024

Uh oh!

tveasey commented Sep 13, 2021 •

edited

Loading

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tveasey commented Sep 14, 2021

Uh oh!

tveasey commented Sep 16, 2021

Uh oh!

Uh oh!

[ML] Performance optimisation for classification and regression model training #2024

[ML] Performance optimisation for classification and regression model training #2024

Uh oh!

Conversation

tveasey commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tveasey commented Sep 14, 2021

Uh oh!

tveasey commented Sep 16, 2021

Uh oh!

Uh oh!

tveasey commented Sep 13, 2021 •

edited

Loading