[ML] Faster quantile estimation #881

tveasey · 2019-12-09T15:17:36Z

This change was motivated by profiling boosted tree training on large data sets (particularly data sets with many metric valued features). In this case, updating the quantile sketch in order to decide on candidate splits can significantly contribute to overall run time (60% before this change).

This makes three significant changes:

It introduces a faster version of the quantile sketch which is less optimised for space. This has members for the collection of merge costs and their indices, used to decide which knots to merge next, since allocating these each time reduce is called is expensive. It also means we can cache the random numbers used to break ties.
It switches to using a bit mask of stale costs rather than having to search a collection to check if a cost needs to be recomputed.
It increases the amount by which the sketch is compressed on each reduce (and so this happens less frequently).

I also made a variety of small optimisations. All in all, I consistently get around a 2x performance improvement updating the quantile sketch as a result of these changes on Linux, Mac and Windows, for parameters I used for boosted tree training.

edsavage

LGTM

Just a few questions and suggestions - feel free to ignore!

include/maths/CQuantileSketch.h

lib/maths/CBoostedTreeImpl.cc

lib/maths/CQuantileSketch.cc

edsavage · 2019-12-12T12:48:16Z

lib/maths/CQuantileSketch.cc

+
+    std::size_t merged{this->target()};
+    std::ptrdiff_t numberMergeCandidates{static_cast<std::ptrdiff_t>(m_Knots.size()) - 3};
+    boost::random::uniform_01<double> u01;


Also, the template type defaults to double anyway but I guess it doesn't hurt to be explicit about it...

I kind of prefer to make this explicit; saves having to check docs/code.

lib/maths/CQuantileSketch.cc

lib/maths/unittest/CQuantileSketchTest.cc

tveasey · 2019-12-12T16:06:03Z

Thanks for the review @edsavage, good suggestions!

edsavage · 2019-12-12T16:52:33Z

Looks like CI failed due to some failing integration tests :-/

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsHundred" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:D22211090BFF5E96]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsHundred(ClassificationIT.java:139)


REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsKeyword" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13
org.elasticsearch.xpack.ml.integration.ClassificationIT > testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsKeyword FAILED
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:951CD433D8811756]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty(ClassificationIT.java:200)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsKeyword(ClassificationIT.java:213)

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13

org.elasticsearch.xpack.ml.integration.ClassificationIT > testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows FAILED
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:9E2EF94FB8764B89]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows(ClassificationIT.java:98)

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsInteger" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13

org.elasticsearch.xpack.ml.integration.ClassificationIT > testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsInteger FAILED
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:A0B70A97EEB6FA1A]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty(ClassificationIT.java:200)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsInteger(ClassificationIT.java:219)

tveasey · 2019-12-12T17:55:23Z

Indeed, I don't want to debug this as part of this change. It was caused by tangential change to use std uniform distribution. I've reverted that, but kept the change to the pseudo rng so we can cut across easily at some point. I'll raise an issue to investigate this at some point.

tveasey · 2019-12-13T09:54:58Z

In fact, I just needed to merge master.

Backport #881.

tveasey added 3 commits December 3, 2019 18:19

A fast to update version of quantile estimation

90185c1

More speed ups

1ed5f52

Merge branch 'master' into faster-quantile-estimation

16447c4

tveasey added >enhancement review v8.0.0 :ml/DataFrameAnalysis v7.6.0 labels Dec 9, 2019

tveasey and others added 5 commits December 9, 2019 15:20

Docs

4c5dd93

Making virtual adds a pointer to the vtable

02ba88f

Test fallout

8f26c22

Test threshold

f4db3ad

Merge branch 'master' into faster-quantile-estimation

a88ec02

tveasey requested a review from edsavage December 11, 2019 13:20

edsavage approved these changes Dec 12, 2019

View reviewed changes

tveasey added 4 commits December 12, 2019 15:33

Make checksum virtual

644fbbb

Switch to std uniform distribution

f10d5fa

Name and explain constant

d8058f5

Build fallout

3501c38

Revert distribution generator

65a8878

tveasey added 3 commits December 12, 2019 20:50

Merge branch 'master' into faster-quantile-estimation

2c447da

The failure was unrelated to the distribution change

d5bfd34

Build warnings

5cd0ec1

tveasey merged commit bbe151d into elastic:master Dec 13, 2019

tveasey deleted the faster-quantile-estimation branch December 13, 2019 09:55

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Dec 13, 2019

[ML] Faster quantile estimation (elastic#881)

afcd122

tveasey mentioned this pull request Dec 13, 2019

[7.6][ML] Faster quantile estimation #902

Merged

tveasey added a commit that referenced this pull request Dec 13, 2019

[7.6][ML] Faster quantile estimation (#902)

44655a5

Backport #881.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Faster quantile estimation #881

[ML] Faster quantile estimation #881

tveasey commented Dec 9, 2019

edsavage left a comment

edsavage Dec 12, 2019

tveasey Dec 12, 2019

tveasey commented Dec 12, 2019

edsavage commented Dec 12, 2019

tveasey commented Dec 12, 2019

tveasey commented Dec 13, 2019

[ML] Faster quantile estimation #881

[ML] Faster quantile estimation #881

Conversation

tveasey commented Dec 9, 2019

edsavage left a comment

Choose a reason for hiding this comment

edsavage Dec 12, 2019

Choose a reason for hiding this comment

tveasey Dec 12, 2019

Choose a reason for hiding this comment

tveasey commented Dec 12, 2019

edsavage commented Dec 12, 2019

tveasey commented Dec 12, 2019

tveasey commented Dec 13, 2019