Best pipeline trained by AutoMLSearch gets different score than cloned version trained on `X_train` #2844

angela97lin · 2021-09-24T22:13:48Z

Repro:


# dataset and modification comes from vowpal wabbit example
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.make_hastie_10_2(n_samples=15000, random_state=1)
X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=256)

automl = evalml.automl.AutoMLSearch(X_train=X_train, y_train=y_train, 
                                    problem_type='binary', verbose=True, max_batches=3,
                                    train_best_pipeline=True)
automl.search()
automl.best_pipeline.score(X_test, y_test, objectives=['Accuracy Binary'])

This score (0.933) is different from:

clone = automl.best_pipeline.clone()
clone.fit(X_train, y_train)
clone.score(X_test, y_test, objectives=['Accuracy Binary'])

This feels like a bug to me. If we set train_best_pipeline to False and then fit on X_train, and y_train, we get the same result as the clone:

automl = evalml.automl.AutoMLSearch(X_train=X_train, y_train=y_train, 
                                    problem_type='binary', verbose=True, max_batches=3,
                                    train_best_pipeline=False)
automl.search()
automl.best_pipeline.fit(X_train, y_train)
automl.best_pipeline.score(X_test, y_test, objectives=['Accuracy Binary'])

Is it possible that the data we're using to train the best pipeline is not the whole X_train/y_train? Or some data transformation is happening?

The text was updated successfully, but these errors were encountered:

freddyaboulton · 2021-09-27T15:04:21Z

@angela97lin Does it happen for regression problems? Could it be threshold tuning? AutoMLSearch's train_best_pipeline will tune the threshold by default I believe.

angela97lin · 2021-09-27T16:04:37Z

@freddyaboulton Nope, it doesn't happen for regression problems so I think you're correct. Setting optimize_thresholds to False (defaults to True) also changes the behavior s.t. the two results are the same.

Next question then is: Do we want this type of behavior / is this intentional? I can see why we would want this, but question if it's a little confusing for users. Only the best pipeline is able to trained with the threshold, right?

Is it confusing that a "cloned" version of the pipeline external to automl search would not produce the same results? How do we make this more clear?
If a user wanted to choose pipeline with id=0, there's not a good way to do so without digging into automl search and the tuned threshold parameter in the same way that we do for the best pipeline, right? I think it can be confusing / unclear that grabbing the best pipeline will be fitted using the threshold, but grabbing any other pipeline and manually training it will not use the fitted threshold.

TLDR: the behavior seems a bit inconsistent and can be confusing or lead to misunderstandings. Is this what we're okay with, or should we clarify this somehow?

freddyaboulton · 2021-09-27T17:31:31Z

@angela97lin Great points. I think this behavior is intentional. I think it was introduced in these PRs:

#1943 Made it so that we tuned thresholds by default
#2320 Added a secondary objective that we use for tuning when the primary objective is not tunable.

Only the best pipeline is able to trained with the threshold, right?

Every pipeline during automl search is trained with threshold tuning because of the default values set in #1943 and #2320.

So the problem is that AutoMLSearch is tuning thresholds by default but there's no way for users to know that/recreate that outside of search if they wanted to "export" the pipelines out of AutoMLSearch.

Would one of the two fix the problem:

Store the optimal threshold value for each pipeline in AutoMLSearch.results and make sure get_pipeline sets the threshold value correctly?
Make threshold optimization a step in the pipeline?

dsherry · 2021-09-29T19:19:57Z

Discussion looks good. Whoever picks this up, we need to decide on what to do next!

Questions from refinement:

Does cloning copy the bin class threshold? (We in refinement think no)
If not, why not?

angela97lin added the bug Issues tracking problems with existing features. label Sep 24, 2021

dsherry assigned eccabay Oct 12, 2021

eccabay mentioned this issue Oct 22, 2021

Maintain pipeline threshold when returning searched pipelines #2948

Merged

eccabay closed this as completed in #2948 Oct 25, 2021

This was referenced Mar 7, 2022

Fix threshold return in AutoMLSearch #3360

Merged

Design - TuningComponent #3361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on `X_train` #2844

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on `X_train` #2844

angela97lin commented Sep 24, 2021 •

edited

Loading

freddyaboulton commented Sep 27, 2021

angela97lin commented Sep 27, 2021

freddyaboulton commented Sep 27, 2021

dsherry commented Sep 29, 2021

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on X_train #2844

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on X_train #2844

Comments

angela97lin commented Sep 24, 2021 • edited Loading

freddyaboulton commented Sep 27, 2021

angela97lin commented Sep 27, 2021

freddyaboulton commented Sep 27, 2021

dsherry commented Sep 29, 2021

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on `X_train` #2844

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on `X_train` #2844

angela97lin commented Sep 24, 2021 •

edited

Loading