Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed check for ensemble when training best pipeline #2037

Merged
merged 21 commits into from Apr 6, 2021
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
7637c95
Removed check for ensemble when training best pipeline
ParthivNaresh Mar 25, 2021
5401a6f
Release Notes
ParthivNaresh Mar 25, 2021
75adb37
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 25, 2021
61cd146
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 26, 2021
d30ec10
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 29, 2021
6a3a1df
Fix test
ParthivNaresh Mar 29, 2021
1675ba9
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 29, 2021
5816183
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 30, 2021
d24f00c
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 31, 2021
93d116e
test fix and comment removal
ParthivNaresh Mar 31, 2021
89977dd
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Mar 31, 2021
de1dfbd
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Apr 4, 2021
cad5f56
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Apr 5, 2021
1aeffb8
Merge branch '1931-Train-Best-Pipeline-Ensemble' of ssh://github.com/…
ParthivNaresh Apr 5, 2021
1de3377
dependency update check
ParthivNaresh Apr 5, 2021
7dde1ff
revert update
ParthivNaresh Apr 5, 2021
f5d9903
update networkx dependency
ParthivNaresh Apr 5, 2021
c35afcc
Trigger Build
ParthivNaresh Apr 5, 2021
8ab3b2f
Merge branch 'main' into 1931-Train-Best-Pipeline-Ensemble
ParthivNaresh Apr 5, 2021
62ee784
Trigger Build
ParthivNaresh Apr 5, 2021
4feb980
catboost update
ParthivNaresh Apr 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Expand Up @@ -29,6 +29,7 @@ Release Notes
* Updated ``OneHotEncoder`` to drop one redundant feature by default for features with two categories :pr:`1997`
* Added a ``PolynomialDetrender`` component :pr:`1992`
* Fixes
* Changed best pipeline to train on the entire dataset rather than just ensemble indices for ensemble problems :pr:`2037`
* Updated binary classification pipelines to use objective decision function during scoring of custom objectives :pr:`1934`
* Changes
* Removed ``data_checks`` parameter, ``data_check_results`` and data checks logic from ``AutoMLSearch`` :pr:`1935`
Expand Down
7 changes: 2 additions & 5 deletions evalml/automl/automl_search.py
Expand Up @@ -529,11 +529,8 @@ def _find_best_pipeline(self):
if not (self._best_pipeline and self._best_pipeline == self.get_pipeline(best_pipeline['id'])):
best_pipeline = self.get_pipeline(best_pipeline['id'])
if self._train_best_pipeline:
if best_pipeline.model_family == ModelFamily.ENSEMBLE:
X_train, y_train = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices]
else:
X_train = self.X_train
y_train = self.y_train
X_train = self.X_train
y_train = self.y_train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 beautiful

if hasattr(self.data_splitter, "transform_sample"):
train_indices = self.data_splitter.transform_sample(X_train, y_train)
X_train = X_train.iloc[train_indices]
Expand Down
2 changes: 1 addition & 1 deletion evalml/automl/engine/sequential_engine.py
Expand Up @@ -34,7 +34,7 @@ def evaluate_batch(self, pipelines):
X, y = self.X_train, self.y_train
if pipeline.model_family == ModelFamily.ENSEMBLE:
X, y = self.X_train.iloc[self.ensembling_indices], self.y_train.iloc[self.ensembling_indices]
elif self.ensembling_indices is not None:
elif self.ensembling_indices is not None: # Is this necessary?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone will have to comment on this. When this got extracted from AutoML into the engine, we lost track of who put it in there. It's kind of strange, undocumented behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to understand this behaviour better. Is this essentially setting the training indices to the contra of the ensembling_indices for non-Ensemble families when ensembling is turned on? @freddyaboulton @angela97lin @dsherry @bchen1116 @jeremyliweishih

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! We don't want to train the other pipelines on the data that we set aside for ensembling in order to prevent overfitting. This means that when we enable ensembling, we want to train ensemble pipelines on certain indices of the data and the other pipelines on the rest of the data

training_indices = [i for i in range(len(self.X_train)) if i not in self.ensembling_indices]
X = self.X_train.iloc[training_indices]
y = self.y_train.iloc[training_indices]
Expand Down
6 changes: 2 additions & 4 deletions evalml/tests/automl_tests/test_automl.py
Expand Up @@ -2254,15 +2254,13 @@ def test_automl_ensembling_best_pipeline(mock_fit, mock_score, mock_rankings, in
ensembling_num = (1 + len(automl.allowed_pipelines) + len(automl.allowed_pipelines) * automl._pipelines_per_batch + 1) + best_pipeline
mock_rankings.return_value = pd.DataFrame({"id": ensembling_num, "pipeline_name": "stacked_ensembler", "score": 0.1}, index=[0])
automl.search()
training_indices, ensembling_indices, _, _ = split_data(ww.DataTable(np.arange(X.shape[0])), y, problem_type='binary', test_size=ensemble_split_size, random_seed=0)
training_indices, ensembling_indices = training_indices.to_dataframe()[0].tolist(), ensembling_indices.to_dataframe()[0].tolist()
# when best_pipeline == -1, model is ensembling,
# otherwise, the model is a different model
# the ensembling_num formula is taken from AutoMLSearch
if best_pipeline == -1:
assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(ensembling_indices)
assert len(mock_fit.call_args_list[-1][0][1]) == len(ensembling_indices)
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just move these 2 assert len(...) outside of the if-else statement to reduce repeated code.

if best_pipeline == -1:
    assert automl.best_pipeline.model_family == ModelFamily.ENSEMBLE
else:
    assert automl.best_pipeline.model_family != ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
assert len(mock_fit.call_args_list[-1][0][1]) == len(y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point!

assert len(mock_fit.call_args_list[-1][0][1]) == len(y)
else:
assert automl.best_pipeline.model_family != ModelFamily.ENSEMBLE
assert len(mock_fit.call_args_list[-1][0][0]) == len(X)
Expand Down