Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed check for ensemble when training best pipeline #2037

Merged
merged 21 commits into from
Apr 6, 2021

Conversation

ParthivNaresh
Copy link
Contributor

Fixes #1931

@CLAassistant
Copy link

CLAassistant commented Mar 26, 2021

CLA assistant check
All committers have signed the CLA.

@ParthivNaresh
Copy link
Contributor Author

ParthivNaresh commented Mar 29, 2021

Perf tests here

@codecov
Copy link

codecov bot commented Mar 29, 2021

Codecov Report

Merging #2037 (4feb980) into main (8df4215) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2037     +/-   ##
=========================================
- Coverage   100.0%   100.0%   -0.0%     
=========================================
  Files         288      288             
  Lines       23425    23419      -6     
=========================================
- Hits        23415    23409      -6     
  Misses         10       10             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8df4215...4feb980. Read the comment docs.

@ParthivNaresh ParthivNaresh marked this pull request as ready for review March 31, 2021 00:58
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I think it would be good to answer the question you pointed out in the sequential engine, but that's not blocking your work.

@@ -34,7 +34,7 @@ def evaluate_batch(self, pipelines):
X, y = self.X_train, self.y_train
if pipeline.model_family == ModelFamily.ENSEMBLE:
X, y = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices]
elif self.ensembling_indices is not None:
elif self.ensembling_indices is not None: # Is this necessary?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone will have to comment on this. When this got extracted from AutoML into the engine, we lost track of who put it in there. It's kind of strange, undocumented behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to understand this behaviour better. Is this essentially setting the training indices to the contra of the ensembling_indices for non-Ensemble families when ensembling is turned on? @freddyaboulton @angela97lin @dsherry @bchen1116 @jeremyliweishih

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! We don't want to train the other pipelines on the data that we set aside for ensembling in order to prevent overfitting. This means that when we enable ensembling, we want to train ensemble pipelines on certain indices of the data and the other pipelines on the rest of the data

@ParthivNaresh
Copy link
Contributor Author

I'm turning this back into a draft PR because I can't confirm that Ensembling is running for these performance tests. I need at least 50 iterations and unfortunately every job I've tried to run with it hasn't finished. I'll be looking for another way around this.

@ParthivNaresh
Copy link
Contributor Author

I've decided to run with limited model families to reduce the iterations, this should result in a run while still allowing ensembling to take place

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making these changes. Left one cleanup comment.

@@ -34,7 +34,7 @@ def evaluate_batch(self, pipelines):
X, y = self.X_train, self.y_train
if pipeline.model_family == ModelFamily.ENSEMBLE:
X, y = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices]
elif self.ensembling_indices is not None:
elif self.ensembling_indices is not None: # Is this necessary?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! We don't want to train the other pipelines on the data that we set aside for ensembling in order to prevent overfitting. This means that when we enable ensembling, we want to train ensemble pipelines on certain indices of the data and the other pipelines on the rest of the data

# when best_pipeline == -1, model is ensembling,
# otherwise, the model is a different model
# the ensembling_num formula is taken from AutoMLSearch
if best_pipeline == -1:
assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(ensembling_indices)
assert len(mock_fit.call_args_list[-1][0][1]) == len(ensembling_indices)
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just move these 2 assert len(...) outside of the if-else statement to reduce repeated code.

if best_pipeline == -1:
    assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE
else:
    assert automl.best_pipeline.model_family != ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
assert len(mock_fit.call_args_list[-1][0][1]) == len(y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point!

@ParthivNaresh ParthivNaresh marked this pull request as ready for review April 5, 2021 15:04
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 !

X_train = self.X_train
y_train = self.y_train
X_train = self.X_train
y_train = self.y_train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 beautiful

@ParthivNaresh ParthivNaresh merged commit eb19b68 into main Apr 6, 2021
@freddyaboulton freddyaboulton deleted the 1931-Train-Best-Pipeline-Ensemble branch May 13, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML: Train best pipeline on entire dataset if it's an ensemble
5 participants