Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.x][ML] Improve regression and classification QoR for small data sets #1992

Merged
merged 2 commits into from
Aug 16, 2021

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Aug 16, 2021

Backport #1960.

…lastic#1960)

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941.
In particular,
1. We could miss out rare classes altogether from our validation set for small data sets.
2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose
never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the
fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we
compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired
counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n
examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with
the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small
data sets runtime is never problematic. For the multi-class classification problem which showed up this problem
accuracy increases from around 0.2 to 0.9 as a result of this change.
@tveasey tveasey merged commit 6c8e23c into elastic:7.x Aug 16, 2021
@tveasey tveasey deleted the port/1960 branch August 16, 2021 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant