Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Faster quantile estimation #881

Merged
merged 16 commits into from
Dec 13, 2019

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Dec 9, 2019

This change was motivated by profiling boosted tree training on large data sets (particularly data sets with many metric valued features). In this case, updating the quantile sketch in order to decide on candidate splits can significantly contribute to overall run time (60% before this change).

This makes three significant changes:

  1. It introduces a faster version of the quantile sketch which is less optimised for space. This has members for the collection of merge costs and their indices, used to decide which knots to merge next, since allocating these each time reduce is called is expensive. It also means we can cache the random numbers used to break ties.
  2. It switches to using a bit mask of stale costs rather than having to search a collection to check if a cost needs to be recomputed.
  3. It increases the amount by which the sketch is compressed on each reduce (and so this happens less frequently).

I also made a variety of small optimisations. All in all, I consistently get around a 2x performance improvement updating the quantile sketch as a result of these changes on Linux, Mac and Windows, for parameters I used for boosted tree training.

Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Just a few questions and suggestions - feel free to ignore!

include/maths/CQuantileSketch.h Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
lib/maths/CQuantileSketch.cc Outdated Show resolved Hide resolved

std::size_t merged{this->target()};
std::ptrdiff_t numberMergeCandidates{static_cast<std::ptrdiff_t>(m_Knots.size()) - 3};
boost::random::uniform_01<double> u01;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the template type defaults to double anyway but I guess it doesn't hurt to be explicit about it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of prefer to make this explicit; saves having to check docs/code.

lib/maths/CQuantileSketch.cc Outdated Show resolved Hide resolved
lib/maths/unittest/CQuantileSketchTest.cc Show resolved Hide resolved
@tveasey
Copy link
Contributor Author

tveasey commented Dec 12, 2019

Thanks for the review @edsavage, good suggestions!

@edsavage
Copy link
Contributor

Looks like CI failed due to some failing integration tests :-/

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsHundred" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:D22211090BFF5E96]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsHundred(ClassificationIT.java:139)


REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsKeyword" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13
org.elasticsearch.xpack.ml.integration.ClassificationIT > testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsKeyword FAILED
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:951CD433D8811756]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty(ClassificationIT.java:200)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsKeyword(ClassificationIT.java:213)

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13

org.elasticsearch.xpack.ml.integration.ClassificationIT > testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows FAILED
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:9E2EF94FB8764B89]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows(ClassificationIT.java:98)

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsInteger" -Dtests.seed=8A3A5CC6F4804780 -Dtests.security.manager=true -Dtests.locale=id-ID -Dtests.timezone=UCT -Dcompiler.java=13

org.elasticsearch.xpack.ml.integration.ClassificationIT > testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsInteger FAILED
    java.lang.AssertionError:
    Expected: <1>
         but: was <0>
        at __randomizedtesting.SeedInfo.seed([8A3A5CC6F4804780:A0B70A97EEB6FA1A]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertModelStatePersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:282)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty(ClassificationIT.java:200)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testWithOnlyTrainingRowsAndTrainingPercentIsFifty_DependentVariableIsInteger(ClassificationIT.java:219)

@tveasey
Copy link
Contributor Author

tveasey commented Dec 12, 2019

Indeed, I don't want to debug this as part of this change. It was caused by tangential change to use std uniform distribution. I've reverted that, but kept the change to the pseudo rng so we can cut across easily at some point. I'll raise an issue to investigate this at some point.

@tveasey
Copy link
Contributor Author

tveasey commented Dec 13, 2019

In fact, I just needed to merge master.

@tveasey tveasey merged commit bbe151d into elastic:master Dec 13, 2019
@tveasey tveasey deleted the faster-quantile-estimation branch December 13, 2019 09:55
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Dec 13, 2019
tveasey added a commit that referenced this pull request Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants