Skip to content

Integrate ensemble methods in AutoML#1253

Merged
angela97lin merged 64 commits intomainfrom
1130_ensemble_in_automl
Oct 22, 2020
Merged

Integrate ensemble methods in AutoML#1253
angela97lin merged 64 commits intomainfrom
1130_ensemble_in_automl

Conversation

@angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Oct 1, 2020

Closes #1130

@angela97lin angela97lin self-assigned this Oct 1, 2020
@angela97lin angela97lin added this to the October 2020 milestone Oct 1, 2020
@codecov
Copy link

codecov bot commented Oct 5, 2020

Codecov Report

Merging #1253 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1253      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13436    13555     +119     
==========================================
+ Hits        13429    13548     +119     
  Misses          7        7              
Impacted Files Coverage Δ
...lml/automl/automl_algorithm/iterative_algorithm.py 100.00% <100.00%> (ø)
evalml/automl/automl_search.py 99.61% <100.00%> (+0.01%) ⬆️
...lines/components/ensemble/stacked_ensemble_base.py 100.00% <100.00%> (ø)
...components/ensemble/stacked_ensemble_classifier.py 100.00% <100.00%> (ø)
.../components/ensemble/stacked_ensemble_regressor.py 100.00% <100.00%> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.00% <100.00%> (ø)
evalml/tests/conftest.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update edd91f5...8c1937f. Read the comment docs.

@angela97lin
Copy link
Contributor Author

@dsherry Okay! I kicked off a test last night. Seems like there aren't errors now, but I am getting a lot of warnings based on the update:

image

Are we curious and want to wait for tests to complete to see how performance changes with LightGBM? Or is that a nice to have, but not necessary for merging?

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my end - just a couple things to clean up.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think this is great! I have some questions (mainly for my understanding lol)

if self.batch_number == 1:
self._first_batch_results.append((score_to_minimize, pipeline.__class__))

if pipeline.model_family not in self._best_pipeline_params and score_to_minimize is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we check None and not np.nan (the score that AutoML uses when a pipeline fails)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha I think this came out of the test_pipelines_in_batch_return_none test which checks what if the pipeline returns None 😅 When None is returned, we get an TypeError: '<' not supported between instances of 'NoneType' and 'int'; np.nan doesn't throw this error since you can compare (though score_to_minimize < current_best_score will always be False so we never update the value to np.nan)

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin so exciting that we're ready to 🚢 this, nice going!

As discussed, let's change the automl default to ensembling=False. Other than that, all that's blocking: one comment in iterative algo, log message in _compute_cv_scores

It would be great to add the missing docstrings as well, and I left a couple other suggestions.

ensemble = _make_stacked_ensemble_pipeline(input_pipelines, input_pipelines[0].problem_type)
next_batch.append(ensemble)
else:
idx = (self._batch_number - 1) % len(self._first_batch_results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be updated to (self._batch_number - 1) % (len(self._first_batch_results) + 1), to match the other modular arithmetic you added in the elif above.

This selects which pipeline class we should tune in the current batch, when we're not in the 0th batch or in a stacked ensemble batch.

This is an aside, but in the future I hope we can figure out how to represent the state here in a less confusing way! I think our requirements for this code have grown a little beyond its means, haha.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, really good catch. We have to update it if self.ensembling is True:

num_pipeline_classes = (len(self._first_batch_results) + 1) if self.ensembling else len(self._first_batch_results) 
idx = (self._batch_number - 1) % num_pipeline_classes

assert any([p != dummy_binary_pipeline_classes[0]({}).parameters for p in all_parameters])

for i in range(1, 5):
for _ in range(len(dummy_binary_pipeline_classes)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally a style nit-pick / me just confusing myself, haha, but why the double-for here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the for i in range(1, 5) is every cycle of batches (goes through every pipeline class) so it tests that the ensemble is called at the end of every cycle. And then the inner for loop is for each pipeline class. Really confusing stuff 😂

@angela97lin
Copy link
Contributor Author

@dsherry @freddyaboulton Thanks for all of the great feedback! I've addressed all of the comments you guys made, so this PR should be good to merge :D If there's anything to follow up on, we can open a separate PR after!

@angela97lin angela97lin merged commit bc9d185 into main Oct 22, 2020
@angela97lin angela97lin deleted the 1130_ensemble_in_automl branch October 22, 2020 04:21
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate ensembling into AutoML

4 participants