Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'RF' error for LightGBM Classifier #1302

Merged
merged 12 commits into from Oct 20, 2020
Merged

Fix 'RF' error for LightGBM Classifier #1302

merged 12 commits into from Oct 20, 2020

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Oct 13, 2020

fix #1251 and #1267

Add in bagging_freq and bagging_fraction parameters for LightGBM classifier
Set num_leaves hyperparameter to start at 2 rather than 1 since the LightGBM's expecting a value > 1

@bchen1116 bchen1116 self-assigned this Oct 13, 2020
@codecov
Copy link

codecov bot commented Oct 13, 2020

Codecov Report

Merging #1302 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1302   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files         213      213           
  Lines       13357    13387   +30     
=======================================
+ Hits        13349    13379   +30     
  Misses          8        8           
Impacted Files Coverage Δ
...ents/estimators/classifiers/lightgbm_classifier.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d254e31...b3e8b40. Read the comment docs.

Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 thanks for digging into this! I understand 1 of the 2 bug fixes. I left a comment asking for an explanation of the 2nd, just so I can follow along. I also left a couple questions/suggestions about how to set up the new default parameters. Approved pending resolution of those conversations.

@@ -30,7 +30,7 @@ class LightGBMClassifier(Estimator):
SEED_MIN = 0
SEED_MAX = SEED_BOUNDS.max_bound

def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, **kwargs):
def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, bagging_fraction=0.9, bagging_freq=0, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why default bagging_freq to 0? Won't that cause the bug when boosting_type="rf"? What default does lightgbm choose for this parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightGBM defaults to 0 for bagging_freq. Users can set it to 1 and change bagging_fraction if they want to speed up computation and randomly select data for other boosting types, but it's required to be 1 for boosting_type=rf (along with 0 < bagging_fraction < 1.0).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This looks good. Is 0.9 the default bagging_fraction in lightgbm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry it defaults to 1.0

"n_jobs": n_jobs}
"n_jobs": n_jobs,
"bagging_freq": bagging_freq,
"bagging_fraction": bagging_fraction}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 could you please explain why adding these two parameters fixed the bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As some background, LightGBM has 4 boosting types: "gbdt", "dart", "goss", "rf". Bagging_freq refers to the frequency of bagging, where it bags every bagging_freq = k iterations (0 means it doesn't bag). bagging_fraction refers to the amount of data randomly selected without resampling (1 means select all, 0 means none). This can help speed up the training process.

The default bagging_freq that LightGBM sets is 0, which works with gbdt, dart, and goss. However, for rf, since its random forest, LightGBM requires that it uses bagging, which means bagging_freq must be 1 and bagging_fraction must be set to be below 1.0. By adding those two parameters and changing bagging_freq when the boosting_type=rf, we do a simple fix to avoid this bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear explanation! That makes sense.

Can we tweak the comment you left on line 48:

if the boosting type is random forest, bagging is required by lightgbm, so we set bagging_freq to 1 in order to avoid errors

Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 looks great!

"n_jobs": n_jobs}
"n_jobs": n_jobs,
"bagging_freq": bagging_freq,
"bagging_fraction": bagging_fraction}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear explanation! That makes sense.

Can we tweak the comment you left on line 48:

if the boosting type is random forest, bagging is required by lightgbm, so we set bagging_freq to 1 in order to avoid errors

@@ -30,7 +30,7 @@ class LightGBMClassifier(Estimator):
SEED_MIN = 0
SEED_MAX = SEED_BOUNDS.max_bound

def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, **kwargs):
def __init__(self, boosting_type="gbdt", learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, n_jobs=-1, random_state=0, bagging_fraction=0.9, bagging_freq=0, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This looks good. Is 0.9 the default bagging_fraction in lightgbm?

@bchen1116
Copy link
Contributor Author

bchen1116 commented Oct 20, 2020

LightGBM doesn't have smart defaults for bagging_freq and bagging_fraction, so when boosting_type=rf, we have to manually set bagging_fraction < 1.0 with bagging_freq=0. We choose bagging_fraction = 0.9, although we didn't test for an 'ideal' value, nor did LightGBM have a recommended value, so this number can be updated whenever necessary.

@dsherry
Copy link
Collaborator

dsherry commented Oct 20, 2020

Thanks @bchen1116 . Yep, agreed, we may be able to find a better default for the value of 0.9 for bagging_fraction. Having this value for now is preferable to having more "magic" behavior where the actual value passed to lightgbm changes.

@dsherry dsherry mentioned this pull request Oct 29, 2020
@freddyaboulton freddyaboulton deleted the bc_1251_rf branch May 13, 2022 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LightGBM errors out during AutoMLSearch
2 participants