Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stacked ensemble and LightGBM errors in AutoMLSearch #1388

Merged
merged 16 commits into from
Nov 9, 2020

Conversation

angela97lin
Copy link
Contributor

Closes #1376

@angela97lin angela97lin added this to the November 2020 milestone Oct 31, 2020
@angela97lin angela97lin self-assigned this Oct 31, 2020
@angela97lin angela97lin marked this pull request as ready for review November 2, 2020 15:35
@angela97lin
Copy link
Contributor Author

Ah @bchen1116 I see you put up #1369 for review--our approach to getting rid of the warnings are different, but perhaps you have more context about what to do / whether I'm missing something? I could back out the LightGBM changes in favor for #1369 :)

@alteryx alteryx deleted a comment from codecov bot Nov 2, 2020
@angela97lin
Copy link
Contributor Author

angela97lin commented Nov 2, 2020

Codecov passes here: https://codecov.io/gh/alteryx/evalml/commit/d388e9a957332b5cdecf036633b523db8c2ebe4a/graphs

Edit: Looking at the regenerated codecov report, codecov fails because the total number of lines has decreased, so the percentage coverage is also decreasing. 😬

@codecov
Copy link

codecov bot commented Nov 2, 2020

Codecov Report

Merging #1388 (dd49ed0) into main (d78d1f2) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1388     +/-   ##
=========================================
- Coverage   100.0%   100.0%   -0.0%     
=========================================
  Files         214      214             
  Lines       14107    14073     -34     
=========================================
- Hits        14100    14066     -34     
  Misses          7        7             
Impacted Files Coverage Δ
...lines/components/ensemble/stacked_ensemble_base.py 100.0% <ø> (ø)
...components/ensemble/stacked_ensemble_classifier.py 100.0% <ø> (ø)
...ents/estimators/classifiers/lightgbm_classifier.py 100.0% <100.0%> (ø)
evalml/pipelines/components/utils.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d78d1f2...dd49ed0. Read the comment docs.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! didn't run into the errors when I tried to repro!

For LightGBM, We just have different implementations that result in the same end behavior, so I'm fine closing my PR in exchange of this. I did leave a comment on always leaving subsample=None

# avoid lightgbm warnings having to do with parameter aliases
if lg_parameters['bagging_freq']:
if lg_parameters['bagging_freq'] is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think the if/elif sections are fine, since for rf, we need to do bagging, whereas for goss, lightgbm doesn't accept bagging. For the last check though, I would set subsample and subsample_freq to None by default since as long as bagging values are passed in, it'll throw the warning if subsample values aren't None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Just to make sure I'm understanding, are you suggesting that rather than checking if bagging_freq is None, always update subsample and subsample_freq to None? :o

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I thought this was basically patching a bug in lightgbm. subsample is supposed to be an alias for bagging_fraction, and subsample_freq is supposed to be an alias for bagging_freq.

What I'm remembering is that @bchen1116 found that when you set bagging_fraction or bagging_freq, you have to set both subsample and subsample_freq to None in order to avoid a warning. (Is that right? I may be off by a little bit)

If I'm right, what @angela97lin has now should work, no? What case would that not cover? If I'm wrong, well, haha, we should do whatever avoids warnings and produces good performance from lightgbm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry @bchen1116 Oooo. Based on what you guys have said, does that mean I should be checking if either bagging_freq or bagging_fraction are set to None? (currently just if lg_parameters['bagging_freq'] is not None:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin yep I think so!

Perhaps the quickest way to resolve this is to write a unit test where we try various combos of all these params, run LightGBMClassifier.fit on a tiny dataset each time, and assert there are no warnings in warnings or in the output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that should do it. I think if bagging_freq is None, then it should ignore bagging_fraction, but I do think the safest way to do it would be to just set subsample/subsample_freq to none as long as either of the bagging args are set. Sorry, didn't see this earlier!

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
@bchen1116 bchen1116 mentioned this pull request Nov 2, 2020
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good! Thanks for making the change :) I left a couple of questions but nothing blocking hehe.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I left some questions, will approve when we resolve those conversations

# avoid lightgbm warnings having to do with parameter aliases
if lg_parameters['bagging_freq']:
if lg_parameters['bagging_freq'] is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I thought this was basically patching a bug in lightgbm. subsample is supposed to be an alias for bagging_fraction, and subsample_freq is supposed to be an alias for bagging_freq.

What I'm remembering is that @bchen1116 found that when you set bagging_fraction or bagging_freq, you have to set both subsample and subsample_freq to None in order to avoid a warning. (Is that right? I may be off by a little bit)

If I'm right, what @angela97lin has now should work, no? What case would that not cover? If I'm wrong, well, haha, we should do whatever avoids warnings and produces good performance from lightgbm.

evalml/pipelines/components/utils.py Show resolved Hide resolved
@dsherry
Copy link
Contributor

dsherry commented Nov 6, 2020

@angela97lin let's merge #1413 to tweak codecov and then retrigger the unit tests on this PR and see if that unblocks codecov. If not, I can definitely merge this 😅

@dsherry
Copy link
Contributor

dsherry commented Nov 6, 2020

@angela97lin let's merge #1413 to tweak codecov and then retrigger the unit tests on this PR and see if that unblocks codecov. If not, I can definitely merge this 😅

Looks like it worked 🎊

@angela97lin
Copy link
Contributor Author

@dsherry Did some testing, looks like it's not a pytest thing but rather, LightGBM since I'm not able to redirect the output from stdout:

from io import StringIO
import sys

import io
from contextlib import redirect_stdout, redirect_stderr

with io.StringIO() as buf, redirect_stdout(buf):
    print('redirected')
    clf = LightGBMClassifier(bagging_freq=1, bagging_fraction=0.5)
    clf.fit(X, y)
    output = buf.getvalue() # Is empty string

(I also tried with redirect_stderr.)

Really odd, I'm not sure then how the LightGBM warnings are being printed then, but I'm going to remove the tests that were testing pytest.warnings since that's not relevant here and merge this.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 ! Thanks for giving the warning coverage a shot, too bad lightgbm won't work with that 🤷‍♂️

@angela97lin
Copy link
Contributor Author

🚢 ! Thanks for giving the warning coverage a shot, too bad lightgbm won't work with that 🤷‍♂️

Yeah, I guess it's not a big issue for now--if we continue to see similar warnings, we can revisit this!

@angela97lin angela97lin merged commit cf50801 into main Nov 9, 2020
@angela97lin angela97lin deleted the ange_fix_stacked_ensemble branch November 9, 2020 18:56
@dsherry dsherry mentioned this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LightGBM and stacked ensemble classifiers fail for AutoMLSearch
4 participants