[ML] Improve quantile estimation performance for train #2390

tveasey · 2022-08-12T17:58:22Z

This is the second part of #2380.

Quantile estimation during training is expensive for data sets with many numeric features. This change makes various speed ups in this area:

Batch computes merge costs using explicit vectorisation,
Use a larger reduce factor when merging buckets. I've held the reduced size fixed to maintain roughly similar accuracy, the upshot being more memory but less work per bucket,
Improve handling of duplicate values: we should deduplicate new values before merging because it reduces the cost of inplace_merge,
Be a bit more careful with how random numbers for tiebreaks are generated.

Together these drop the amortised add cost from around 100ns to 50ns per item on my i9. The remaining costs are 50% std::nth_element and 30% std::sort + std::inplace_merge. Short of dropping the requirement we extract the smallest k merge costs it is unlikely we'd be able to make big inroads on these costs. Doing this it was useful to extend CFloatStorage to be useable in constexprs. I also mark all methods noexcept which they are and always will be, it might in theory allow some extra optimisations, although the compiler should be able to deduce these are no throw.

Separately, I cap the maximum number of values we'll use to estimate quantiles. The accuracy of the quantiles we care about converges quickly with sample size. For example, the count of values less (greater) than the sample 1st (99th) percentile would be ~ Binomial(n, 0.01), for sample size n, so relative error in count would be O(10.0 / n^(1/2)). I therefore cap the maximum sample size to 50000 for which this error is around 4%.

Finally, following on from #2364 we don't really need to expose interpolation outside the class (all uses were linear and that would be unlikely to change). I took the opportunity to remove the option from the constructor.

valeriy42

LGTM. Great results!

valeriy42 · 2022-08-15T11:12:02Z

lib/maths/common/CQuantileSketch.cc

+std::size_t fastSketchSize(double reductionFactor, std::size_t size) {
+    size = static_cast<std::size_t>(
+        static_cast<double>(size) * CQuantileSketch::REDUCTION_FACTOR / reductionFactor + 0.5);
+    return size + (3 - (size + 1) % 3) % 3;


Where does 3 come from? Can we use a constant here?

This ensures that the loop which computes the weights never have any left over costs to compute, since we take steps of size 3.

lib/maths/common/CQuantileSketch.cc

Speed up quantile estimation for train

077a2c7

tveasey added >enhancement review :ml/DataFrameAnalysis v8.5.0 labels Aug 12, 2022

tveasey requested a review from valeriy42 August 12, 2022 17:58

tveasey added 3 commits August 12, 2022 19:49

Docs

e6df77d

Revert accidentally committed experiment

92cce74

Cutdown sample costs

5d475b9

valeriy42 approved these changes Aug 15, 2022

View reviewed changes

tveasey added 2 commits August 16, 2022 12:16

Fix test

c98cf0c

arm fixes

a428e43

droberts195 reviewed Aug 16, 2022

View reviewed changes

lib/maths/common/CQuantileSketch.cc Outdated Show resolved Hide resolved

tveasey added 4 commits August 16, 2022 14:42

Review comment

f5cec97

Really fix arm

4e6acd7

Another fix

87f8799

More

bb830be

tveasey merged commit ef8785d into elastic:main Aug 16, 2022

tveasey deleted the quantiles-optimisation branch August 16, 2022 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Improve quantile estimation performance for train #2390

[ML] Improve quantile estimation performance for train #2390

Uh oh!

tveasey commented Aug 12, 2022

Uh oh!

valeriy42 left a comment

Uh oh!

valeriy42 Aug 15, 2022

Uh oh!

tveasey Aug 16, 2022

Uh oh!

Uh oh!

Uh oh!

[ML] Improve quantile estimation performance for train #2390

[ML] Improve quantile estimation performance for train #2390

Uh oh!

Conversation

tveasey commented Aug 12, 2022

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

valeriy42 Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

tveasey Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!