-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed check for ensemble when training best pipeline #2037
Conversation
Perf tests here |
Codecov Report
@@ Coverage Diff @@
## main #2037 +/- ##
=========================================
- Coverage 100.0% 100.0% -0.0%
=========================================
Files 288 288
Lines 23425 23419 -6
=========================================
- Hits 23415 23409 -6
Misses 10 10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I think it would be good to answer the question you pointed out in the sequential engine, but that's not blocking your work.
@@ -34,7 +34,7 @@ def evaluate_batch(self, pipelines): | |||
X, y = self.X_train, self.y_train | |||
if pipeline.model_family == ModelFamily.ENSEMBLE: | |||
X, y = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices] | |||
elif self.ensembling_indices is not None: | |||
elif self.ensembling_indices is not None: # Is this necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone will have to comment on this. When this got extracted from AutoML into the engine, we lost track of who put it in there. It's kind of strange, undocumented behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to understand this behaviour better. Is this essentially setting the training indices to the contra of the ensembling_indices
for non-Ensemble families when ensembling is turned on? @freddyaboulton @angela97lin @dsherry @bchen1116 @jeremyliweishih
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! We don't want to train the other pipelines on the data that we set aside for ensembling in order to prevent overfitting. This means that when we enable ensembling, we want to train ensemble pipelines on certain indices of the data and the other pipelines on the rest of the data
I'm turning this back into a draft PR because I can't confirm that Ensembling is running for these performance tests. I need at least 50 iterations and unfortunately every job I've tried to run with it hasn't finished. I'll be looking for another way around this. |
I've decided to run with limited model families to reduce the iterations, this should result in a run while still allowing ensembling to take place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for making these changes. Left one cleanup comment.
@@ -34,7 +34,7 @@ def evaluate_batch(self, pipelines): | |||
X, y = self.X_train, self.y_train | |||
if pipeline.model_family == ModelFamily.ENSEMBLE: | |||
X, y = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices] | |||
elif self.ensembling_indices is not None: | |||
elif self.ensembling_indices is not None: # Is this necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! We don't want to train the other pipelines on the data that we set aside for ensembling in order to prevent overfitting. This means that when we enable ensembling, we want to train ensemble pipelines on certain indices of the data and the other pipelines on the rest of the data
# when best_pipeline == -1, model is ensembling, | ||
# otherwise, the model is a different model | ||
# the ensembling_num formula is taken from AutoMLSearch | ||
if best_pipeline == -1: | ||
assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE | ||
assert len(mock_fit.call_args_list[-1][0][0]) == len(ensembling_indices) | ||
assert len(mock_fit.call_args_list[-1][0][1]) == len(ensembling_indices) | ||
assert len(mock_fit.call_args_list[-1][0][0]) == len(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just move these 2 assert len(...)
outside of the if-else statement to reduce repeated code.
if best_pipeline == -1:
assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE
else:
assert automl.best_pipeline.model_family != ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
assert len(mock_fit.call_args_list[-1][0][1]) == len(y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point!
…alteryx/evalml into 1931-Train-Best-Pipeline-Ensemble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 !
X_train = self.X_train | ||
y_train = self.y_train | ||
X_train = self.X_train | ||
y_train = self.y_train |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 beautiful
Fixes #1931