Skip to content

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Aug 12, 2022

This is the second part of #2380.

Quantile estimation during training is expensive for data sets with many numeric features. This change makes various speed ups in this area:

  1. Batch computes merge costs using explicit vectorisation,
  2. Use a larger reduce factor when merging buckets. I've held the reduced size fixed to maintain roughly similar accuracy, the upshot being more memory but less work per bucket,
  3. Improve handling of duplicate values: we should deduplicate new values before merging because it reduces the cost of inplace_merge,
  4. Be a bit more careful with how random numbers for tiebreaks are generated.

Together these drop the amortised add cost from around 100ns to 50ns per item on my i9. The remaining costs are 50% std::nth_element and 30% std::sort + std::inplace_merge. Short of dropping the requirement we extract the smallest k merge costs it is unlikely we'd be able to make big inroads on these costs. Doing this it was useful to extend CFloatStorage to be useable in constexprs. I also mark all methods noexcept which they are and always will be, it might in theory allow some extra optimisations, although the compiler should be able to deduce these are no throw.

Separately, I cap the maximum number of values we'll use to estimate quantiles. The accuracy of the quantiles we care about converges quickly with sample size. For example, the count of values less (greater) than the sample 1st (99th) percentile would be ~ Binomial(n, 0.01), for sample size n, so relative error in count would be O(10.0 / n^(1/2)). I therefore cap the maximum sample size to 50000 for which this error is around 4%.

Finally, following on from #2364 we don't really need to expose interpolation outside the class (all uses were linear and that would be unlikely to change). I took the opportunity to remove the option from the constructor.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great results!

std::size_t fastSketchSize(double reductionFactor, std::size_t size) {
size = static_cast<std::size_t>(
static_cast<double>(size) * CQuantileSketch::REDUCTION_FACTOR / reductionFactor + 0.5);
return size + (3 - (size + 1) % 3) % 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 3 come from? Can we use a constant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that the loop which computes the weights never have any left over costs to compute, since we take steps of size 3.

@tveasey tveasey merged commit ef8785d into elastic:main Aug 16, 2022
@tveasey tveasey deleted the quantiles-optimisation branch August 16, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants