Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve regression and classification QoR for small data sets #1960

Merged
merged 5 commits into from Jul 28, 2021

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jul 26, 2021

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941. In particular,

  1. We could miss out rare classes altogether from our validation set for small data sets.
  2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change!

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@tveasey tveasey merged commit 556afc1 into elastic:master Jul 28, 2021
@tveasey tveasey deleted the small-data-sets branch July 28, 2021 13:22
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Aug 16, 2021
…lastic#1960)

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941.
In particular,
1. We could miss out rare classes altogether from our validation set for small data sets.
2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose
never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the
fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we
compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired
counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n
examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with
the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small
data sets runtime is never problematic. For the multi-class classification problem which showed up this problem
accuracy increases from around 0.2 to 0.9 as a result of this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants