Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.x][ML] Improve regression and classification QoR for small data sets #1992

Merged
merged 2 commits into from
Aug 16, 2021

Commits on Aug 16, 2021

  1. [ML] Improve regression and classification QoR for small data sets (e…

    …lastic#1960)
    
    This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941.
    In particular,
    1. We could miss out rare classes altogether from our validation set for small data sets.
    2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets.
    
    Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose
    never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the
    fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we
    compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired
    counts per class based on the frequencies in the remainder in the loop which samples each new fold.
    
    Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n
    examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with
    the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small
    data sets runtime is never problematic. For the multi-class classification problem which showed up this problem
    accuracy increases from around 0.2 to 0.9 as a result of this change.
    tveasey committed Aug 16, 2021
    Configuration menu
    Copy the full SHA
    7fea7c7 View commit details
    Browse the repository at this point in the history
  2. Test robustness

    tveasey committed Aug 16, 2021
    Configuration menu
    Copy the full SHA
    b8af317 View commit details
    Browse the repository at this point in the history