[ML] Improve regression and classification QoR for small data sets #1960

tveasey · 2021-07-26T13:35:25Z

This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941. In particular,

We could miss out rare classes altogether from our validation set for small data sets.
We can lose a lot of accuracy by over restricting the number of features we use for small data sets.

Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold.

Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change!

valeriy42

LGTM 🚀

…lastic#1960) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.

Better handling of small data sets

5e6d4ef

tveasey added review >non-issue v8.0.0 :ml/DataFrameAnalysis v7.15.0 labels Jul 26, 2021

tveasey requested a review from valeriy42 July 26, 2021 13:35

tveasey removed the >non-issue label Jul 26, 2021

tveasey added 4 commits July 26, 2021 14:37

Docs

9f55a09

Correct test threshold for estimated memory for new behaviour

f38ef8d

Test failures on Linux and Windows

c709881

Test failure on Linux

9b1ff1b

valeriy42 approved these changes Jul 28, 2021

View reviewed changes

tveasey merged commit 556afc1 into elastic:master Jul 28, 2021

tveasey deleted the small-data-sets branch July 28, 2021 13:22

tveasey mentioned this pull request Aug 16, 2021

[7.x][ML] Improve regression and classification QoR for small data sets #1992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve regression and classification QoR for small data sets #1960

[ML] Improve regression and classification QoR for small data sets #1960

tveasey commented Jul 26, 2021

valeriy42 left a comment

[ML] Improve regression and classification QoR for small data sets #1960

[ML] Improve regression and classification QoR for small data sets #1960

Conversation

tveasey commented Jul 26, 2021

valeriy42 left a comment

Choose a reason for hiding this comment