-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Stratified cross validation for regression #784
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a single minor comment about the variable naming.
Since stratified regression samplings for regression are not as common as for classification, it would be great if you would provide an activation switch for it. It would fine if you would do it in a follow-up PR.
lib/maths/CDataFrameUtils.cc
Outdated
} | ||
}; | ||
|
||
TDoubleVec weights; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name weights
is confusing here. Can you maybe rename it to something more self-explanatory, i.e. decentileCounts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok good suggestion. I renamed in this commit. (Note I don't use decile because I passed in the number of buckets to optionally disable.)
} | ||
LOG_DEBUG(<< "variance in test set target percentile = " | ||
<< maths::CBasicStatistics::variance(testTargetDecileMoments)); | ||
BOOST_REQUIRE(maths::CBasicStatistics::variance(testTargetDecileMoments) < 0.02); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that BOOST_TEST_REQUIRE
here would print the values either side of the <
in the event of failure, whereas BOOST_REQUIRE
won't.
It seems that BOOST_REQUIRE(condition)
is the equivalent of BOOST_TEST_REQUIRE((condition))
. I wasn't aware of this until now, and BOOST_REQUIRE
is the cleaner solution for complex conditions where diagnostics cannot be printed. But for this line the improved diagnostics could be printed by BOOST_TEST_REQUIRE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this in #788
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
This implements stratified fractional cross validation for regression, i.e. independently and uniformly random sample from each decile of the target variable. It also better controls the proportion of samples per fold (to be nearest integer to "total training examples" / "number folds") for classification.