Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Stratified cross validation for regression #784

Merged
merged 9 commits into from
Oct 28, 2019

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Oct 25, 2019

This implements stratified fractional cross validation for regression, i.e. independently and uniformly random sample from each decile of the target variable. It also better controls the proportion of samples per fold (to be nearest integer to "total training examples" / "number folds") for classification.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a single minor comment about the variable naming.

Since stratified regression samplings for regression are not as common as for classification, it would be great if you would provide an activation switch for it. It would fine if you would do it in a follow-up PR.

}
};

TDoubleVec weights;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name weights is confusing here. Can you maybe rename it to something more self-explanatory, i.e. decentileCounts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good suggestion. I renamed in this commit. (Note I don't use decile because I passed in the number of buckets to optionally disable.)

@tveasey tveasey merged commit 4cde1d7 into elastic:master Oct 28, 2019
@tveasey tveasey deleted the stratified-cv-for-regression branch October 28, 2019 20:57
}
LOG_DEBUG(<< "variance in test set target percentile = "
<< maths::CBasicStatistics::variance(testTargetDecileMoments));
BOOST_REQUIRE(maths::CBasicStatistics::variance(testTargetDecileMoments) < 0.02);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that BOOST_TEST_REQUIRE here would print the values either side of the < in the event of failure, whereas BOOST_REQUIRE won't.

It seems that BOOST_REQUIRE(condition) is the equivalent of BOOST_TEST_REQUIRE((condition)). I wasn't aware of this until now, and BOOST_REQUIRE is the cleaner solution for complex conditions where diagnostics cannot be printed. But for this line the improved diagnostics could be printed by BOOST_TEST_REQUIRE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this in #788

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants