-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Faster quantile estimation #881
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just a few questions and suggestions - feel free to ignore!
lib/maths/CQuantileSketch.cc
Outdated
|
||
std::size_t merged{this->target()}; | ||
std::ptrdiff_t numberMergeCandidates{static_cast<std::ptrdiff_t>(m_Knots.size()) - 3}; | ||
boost::random::uniform_01<double> u01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the template type defaults to double
anyway but I guess it doesn't hurt to be explicit about it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of prefer to make this explicit; saves having to check docs/code.
Thanks for the review @edsavage, good suggestions! |
Looks like CI failed due to some failing integration tests :-/
|
Indeed, I don't want to debug this as part of this change. It was caused by tangential change to use std uniform distribution. I've reverted that, but kept the change to the pseudo rng so we can cut across easily at some point. I'll raise an issue to investigate this at some point. |
In fact, I just needed to merge master. |
This change was motivated by profiling boosted tree training on large data sets (particularly data sets with many metric valued features). In this case, updating the quantile sketch in order to decide on candidate splits can significantly contribute to overall run time (60% before this change).
This makes three significant changes:
reduce
is called is expensive. It also means we can cache the random numbers used to break ties.I also made a variety of small optimisations. All in all, I consistently get around a 2x performance improvement updating the quantile sketch as a result of these changes on Linux, Mac and Windows, for parameters I used for boosted tree training.