-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 'RF' error for LightGBM Classifier #1302
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1302 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 213 213
Lines 13357 13387 +30
=======================================
+ Hits 13349 13379 +30
Misses 8 8
Continue to review full report at Codecov.
|
evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 thanks for digging into this! I understand 1 of the 2 bug fixes. I left a comment asking for an explanation of the 2nd, just so I can follow along. I also left a couple questions/suggestions about how to set up the new default parameters. Approved pending resolution of those conversations.
evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py
Show resolved
Hide resolved
@@ -30,7 +30,7 @@ class LightGBMClassifier(Estimator): | |||
SEED_MIN = 0 | |||
SEED_MAX = SEED_BOUNDS.max_bound | |||
|
|||
def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, **kwargs): | |||
def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, bagging_fraction=0.9, bagging_freq=0, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why default bagging_freq
to 0? Won't that cause the bug when boosting_type="rf"
? What default does lightgbm choose for this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LightGBM defaults to 0 for bagging_freq
. Users can set it to 1 and change bagging_fraction
if they want to speed up computation and randomly select data for other boosting types, but it's required to be 1 for boosting_type=rf
(along with 0 < bagging_fraction < 1.0
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This looks good. Is 0.9 the default bagging_fraction
in lightgbm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry it defaults to 1.0
evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py
Show resolved
Hide resolved
"n_jobs": n_jobs} | ||
"n_jobs": n_jobs, | ||
"bagging_freq": bagging_freq, | ||
"bagging_fraction": bagging_fraction} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 could you please explain why adding these two parameters fixed the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As some background, LightGBM has 4 boosting types: "gbdt", "dart", "goss", "rf". Bagging_freq
refers to the frequency of bagging, where it bags every bagging_freq = k
iterations (0 means it doesn't bag). bagging_fraction
refers to the amount of data randomly selected without resampling (1 means select all, 0 means none). This can help speed up the training process.
The default bagging_freq
that LightGBM sets is 0, which works with gbdt
, dart
, and goss
. However, for rf
, since its random forest, LightGBM requires that it uses bagging, which means bagging_freq
must be 1 and bagging_fraction
must be set to be below 1.0. By adding those two parameters and changing bagging_freq
when the boosting_type=rf
, we do a simple fix to avoid this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clear explanation! That makes sense.
Can we tweak the comment you left on line 48:
if the boosting type is random forest, bagging is required by lightgbm, so we set bagging_freq to 1 in order to avoid errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 looks great!
"n_jobs": n_jobs} | ||
"n_jobs": n_jobs, | ||
"bagging_freq": bagging_freq, | ||
"bagging_fraction": bagging_fraction} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clear explanation! That makes sense.
Can we tweak the comment you left on line 48:
if the boosting type is random forest, bagging is required by lightgbm, so we set bagging_freq to 1 in order to avoid errors
evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py
Show resolved
Hide resolved
@@ -30,7 +30,7 @@ class LightGBMClassifier(Estimator): | |||
SEED_MIN = 0 | |||
SEED_MAX = SEED_BOUNDS.max_bound | |||
|
|||
def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, **kwargs): | |||
def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, bagging_fraction=0.9, bagging_freq=0, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This looks good. Is 0.9 the default bagging_fraction
in lightgbm?
LightGBM doesn't have smart defaults for |
Thanks @bchen1116 . Yep, agreed, we may be able to find a better default for the value of 0.9 for |
fix #1251 and #1267
Add in
bagging_freq
andbagging_fraction
parameters for LightGBM classifierSet
num_leaves
hyperparameter to start at 2 rather than 1 since the LightGBM's expecting a value > 1