New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed check for ensemble when training best pipeline #2037
Changes from 8 commits
7637c95
5401a6f
75adb37
61cd146
d30ec10
6a3a1df
1675ba9
5816183
d24f00c
93d116e
89977dd
de1dfbd
cad5f56
1aeffb8
1de3377
7dde1ff
f5d9903
c35afcc
8ab3b2f
62ee784
4feb980
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,7 +34,7 @@ def evaluate_batch(self, pipelines): | |
X, y = self.X_train, self.y_train | ||
if pipeline.model_family == ModelFamily.ENSEMBLE: | ||
X, y = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices] | ||
elif self.ensembling_indices is not None: | ||
elif self.ensembling_indices is not None: # Is this necessary? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Someone will have to comment on this. When this got extracted from AutoML into the engine, we lost track of who put it in there. It's kind of strange, undocumented behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just wanted to understand this behaviour better. Is this essentially setting the training indices to the contra of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep! We don't want to train the other pipelines on the data that we set aside for ensembling in order to prevent overfitting. This means that when we enable ensembling, we want to train ensemble pipelines on certain indices of the data and the other pipelines on the rest of the data |
||
training_indices = [i for i in range(len(self.X_train)) if i not in self.ensembling_indices] | ||
X = self.X_train.iloc[training_indices] | ||
y = self.y_train.iloc[training_indices] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2254,15 +2254,13 @@ def test_automl_ensembling_best_pipeline(mock_fit, mock_score, mock_rankings, in | |
ensembling_num = (1 + len(automl.allowed_pipelines) + len(automl.allowed_pipelines) * automl._pipelines_per_batch + 1) + best_pipeline | ||
mock_rankings.return_value = pd.DataFrame({"id": ensembling_num, "pipeline_name": "stacked_ensembler", "score": 0.1}, index=[0]) | ||
automl.search() | ||
training_indices, ensembling_indices, _, _ = split_data(ww.DataTable(np.arange(X.shape[0])), y, problem_type='binary', test_size=ensemble_split_size, random_seed=0) | ||
training_indices, ensembling_indices = training_indices.to_dataframe()[0].tolist(), ensembling_indices.to_dataframe()[0].tolist() | ||
# when best_pipeline == -1, model is ensembling, | ||
# otherwise, the model is a different model | ||
# the ensembling_num formula is taken from AutoMLSearch | ||
if best_pipeline == -1: | ||
assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE | ||
assert len(mock_fit.call_args_list[-1][0][0]) == len(ensembling_indices) | ||
assert len(mock_fit.call_args_list[-1][0][1]) == len(ensembling_indices) | ||
assert len(mock_fit.call_args_list[-1][0][0]) == len(X) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can just move these 2 if best_pipeline == -1:
assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE
else:
assert automl.best_pipeline.model_family != ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
assert len(mock_fit.call_args_list[-1][0][1]) == len(y) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great point! |
||
assert len(mock_fit.call_args_list[-1][0][1]) == len(y) | ||
else: | ||
assert automl.best_pipeline.model_family != ModelFamily.ENSEMBLE | ||
assert len(mock_fit.call_args_list[-1][0][0]) == len(X) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 beautiful