Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve hyperparameter tuning performance #1941

Merged
merged 37 commits into from
Jul 12, 2021

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jul 2, 2021

Our default hyperparameter tuning becomes extremely runtime expensive on large data sets. Currently, our best solution is to suggest using a low train fraction, but this means final train (which is comparatively a couple of orders of magnitude faster) also sees less data. This can be addressed by using a smaller proportion than 1 - 1 / # folds data for train during hyperparameter tuning. This provides us with both an escape hatch to avoid pathological run time in the case someone runs against a very large data set and also the means of allowing the user to specify train fast, but sacrifice a small amount of accuracy. I tried this out on a few different data sets and the following result is typical:

train fraction run time / s R^2
0.05 31 0.9969
0.3 48 0.9985
0.5 * 64 0.9986

* This is the default behaviour. The performance gain is not proportional to the fraction of data used because:

  1. There are fixed overheads
  2. We already downsample when training and tune this parameter, preferring more downsampling if the accuracy isn't significantly affected.

(Note compared to downsampling while training this has a couple of advantages if runtime is a priority: it guaranties worst case performance and it improves caching since the same rows are always selected rather than a different random sample for each tree.)

This also introduces a hard limit on the maximum number of rows we will use for hyperaparameter tuning and enforces this by selecting a lower train fraction when this would be exceeded.

A second improvement relates to initial hyperparameter search. I observed that

  1. We often don't significantly improve on this during final BO driven hyperparameter optimisation
  2. We leave some performance on the floor because we don't get a good estimate of true minimum of the loss in the best value selected by line search for large data sets.

The second problem is slightly tricky. For small data sets losses at different hyperparameter settings are typically noisy and our current strategy of fitting a parabola through the points works well. For large data sets the loss curve is smooth, but often non-parabolic over the range we explore. Fitting a Lowess regression to the loss curve instead performs pretty well for finding the "true" minimum (also better than interpolation by GP which I tried). The issue with different amounts of noise is handled by choosing the amount of smoothing by maximum likelihood. This change generally gives us a small performance bump and importantly means we can often likely skip fine tuning altogether. As such I now allow max_optimization_rounds_per_hyperparameter to be set to zero.

One last issue was when we characterised loss variance across the folds we included variance in the mean loss. Particularly, for small data sets this could be due to sampling affects: one test fold contains examples which are harder to predict. Over multiple rounds we can estimate this effectively and remove this component of the variance when we fit the GP.

@tveasey tveasey added the v7.15.0 label Jul 2, 2021
Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on improving the performance by integrating function smoother into the minimizer! I have a couple of comments mostly regarding readability.

include/maths/CLowess.h Outdated Show resolved Hide resolved
include/maths/CLowess.h Outdated Show resolved Hide resolved
include/maths/CLowess.h Outdated Show resolved Hide resolved
include/maths/CLowess.h Outdated Show resolved Hide resolved
include/maths/CLowess.h Show resolved Hide resolved
BOOST_REQUIRE_CLOSE_ABSOLUTE(
0.0, bias, 4.0 * std::sqrt(noiseVariance / static_cast<double>(trainRows)));
// Good R^2...
BOOST_TEST_REQUIRE(rSquared > 0.98);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

lib/maths/unittest/CLowessTest.cc Outdated Show resolved Hide resolved
lib/maths/unittest/CLowessTest.cc Outdated Show resolved Hide resolved
lib/maths/unittest/CLowessTest.cc Outdated Show resolved Hide resolved
Comment on lines +199 to +203
// Test minimization of some training loss curves from boosted tree hyperparameter
// line searches for:
// 1. Miniboone
// 2. Car-parts
// 3. Boston
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work! 🚀

@tveasey
Copy link
Contributor Author

tveasey commented Jul 9, 2021

Thanks for the review @valeriy42! I think I've addressed everything. I also disabled writing out extra stats and model metadata for the time being. This requires changes to the Java code as well and I'll make those together.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work and thank you for this explanation of handling mean-variance. I have just a couple of minor comments: take it or leave it. LGTM 🚀

TSizeVecVec testingMasks;
this->setupMasks(numberFolds, trainingMasks, testingMasks);

TDoubleVec K(17);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred to call m_K an m_SmoothingParameter. Due to our coding standards, it has to be a capital K, although it relates to the small k in the formulas.

lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
Comment on lines +620 to +647
// So what are we doing here? When we supply function values we also supply their
// error variance. Typically these might be the mean test loss function across
// folds and their variance for a particular choice of hyperparameters. Sticking
// with this example, the variance allows us to estimate the error w.r.t. the
// true generalisation error due to finite sample size. We can think of the source
// of this variance as being due to two effects: one which shifts the loss values
// in each fold (this might be due to some folds simply having more hard examples)
// and another which permutes the order of loss values. A shift in the loss function
// is not something we wish to capture in the GP: it shouldn't materially affect
// where to choose points to test since any sensible optimisation strategy should
// only care about the difference in loss between points, which is unaffected by a
// shift. More formally, if we assume the shift and permutation errors are independent
// we have for losses l_i, mean loss per fold m_i and mean loss for a given set of
// hyperparameters m that the variance is
//
// sum_i{ (l_i - m)^2 } = sum_i{ (l_i - m_i + m_i - m)^2 }
// = sum_i{ (l_i - m_i)^2 } + sum_i{ (m_i - m)^2 }
// = "permutation variance" + "shift variance" (1)
//
// with the cross-term expected to be small by independence. (Note, the independence
// assumption is reasonable if one assumes that the shift is due to mismatch in hard
// examples since the we choose folds independently at random.) We can estimate the
// shift variance by looking at mean loss over all distinct hyperparameter settings
// and we assume it is supplied as the parameter m_ExplainedErrorVariance. It should
// also be smaller than the variance by construction although for numerical stability
// we prevent the difference becoming too small. As discussed, here we wish return
// the permutation variance which we get by rearranging (1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍 Thank you for the explanation! 📖

@tveasey tveasey merged commit 09d5444 into elastic:master Jul 12, 2021
@tveasey tveasey deleted the select-data-size branch July 12, 2021 13:13
tveasey added a commit that referenced this pull request Jul 28, 2021
…1960)

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941.
In particular,
1. We could miss out rare classes altogether from our validation set for small data sets.
2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose
never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the
fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we
compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired
counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n
examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with
the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small
data sets runtime is never problematic. For the multi-class classification problem which showed up this problem
accuracy increases from around 0.2 to 0.9 as a result of this change.
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Aug 16, 2021
Our default hyperparameter tuning becomes extremely runtime expensive on large data sets. Currently, our best
solution is to suggest using a low train fraction, but this means final train (which is comparatively a couple of orders of
magnitude faster) also sees less data. This can be addressed by using a smaller proportion than 1 - 1 / # folds data for
train during hyperparameter tuning. This provides us with both an escape hatch to avoid pathological run time in the
case someone runs against a very large data set and also the means of allowing the user to specify train fast, but
sacrifice a small amount of accuracy.

This also introduces a hard limit on the maximum number of rows we will use for hyperaparameter tuning and enforces
this by selecting a lower train fraction when this would be exceeded.

For small data sets losses at different hyperparameter settings are typically noisy and our current strategy of fitting a
parabola through the points works well. For large data sets the loss curve is smooth, but often non-parabolic over the
range we explore. Fitting a LOWESS regression to the loss curve instead performs pretty well for finding the "true"
minimum (also better than interpolation by GP which I tried). The issue with different amounts of noise is handled by
choosing the amount of smoothing by maximum likelihood. This change generally gives us a small performance bump
and importantly means we can often likely skip fine tuning altogether. As such I now allow
max_optimization_rounds_per_hyperparameter to be set to zero.

One last issue was when we characterised loss variance across the folds we included variance in the mean loss.
Particularly, for small data sets this could be due to sampling affects: one test fold contains examples which are harder
to predict. Over multiple rounds we can estimate this effectively and remove this component of the variance when we
fit the GP.
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Aug 16, 2021
…lastic#1960)

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941.
In particular,
1. We could miss out rare classes altogether from our validation set for small data sets.
2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose
never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the
fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we
compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired
counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n
examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with
the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small
data sets runtime is never problematic. For the multi-class classification problem which showed up this problem
accuracy increases from around 0.2 to 0.9 as a result of this change.
tveasey added a commit that referenced this pull request Aug 16, 2021
…1992)

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941.
In particular,
1. We could miss out rare classes altogether from our validation set for small data sets.
2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose
never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the
fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we
compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired
counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n
examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with
the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small
data sets runtime is never problematic. For the multi-class classification problem which showed up this problem
accuracy increases from around 0.2 to 0.9 as a result of this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants