Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best pipeline trained by AutoMLSearch gets different score than cloned version trained on X_train #2844

Closed
angela97lin opened this issue Sep 24, 2021 · 4 comments · Fixed by #2948
Assignees
Labels
bug Issues tracking problems with existing features.

Comments

@angela97lin
Copy link
Contributor

angela97lin commented Sep 24, 2021

Repro:


# dataset and modification comes from vowpal wabbit example
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.make_hastie_10_2(n_samples=15000, random_state=1)
X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=256)

automl = evalml.automl.AutoMLSearch(X_train=X_train, y_train=y_train, 
                                    problem_type='binary', verbose=True, max_batches=3,
                                    train_best_pipeline=True)
automl.search()
automl.best_pipeline.score(X_test, y_test, objectives=['Accuracy Binary'])

This score (0.933) is different from:

clone = automl.best_pipeline.clone()
clone.fit(X_train, y_train)
clone.score(X_test, y_test, objectives=['Accuracy Binary'])

This feels like a bug to me. If we set train_best_pipeline to False and then fit on X_train, and y_train, we get the same result as the clone:

automl = evalml.automl.AutoMLSearch(X_train=X_train, y_train=y_train, 
                                    problem_type='binary', verbose=True, max_batches=3,
                                    train_best_pipeline=False)
automl.search()
automl.best_pipeline.fit(X_train, y_train)
automl.best_pipeline.score(X_test, y_test, objectives=['Accuracy Binary'])

Is it possible that the data we're using to train the best pipeline is not the whole X_train/y_train? Or some data transformation is happening?

@angela97lin angela97lin added the bug Issues tracking problems with existing features. label Sep 24, 2021
@freddyaboulton
Copy link
Contributor

@angela97lin Does it happen for regression problems? Could it be threshold tuning? AutoMLSearch's train_best_pipeline will tune the threshold by default I believe.

@angela97lin
Copy link
Contributor Author

@freddyaboulton Nope, it doesn't happen for regression problems so I think you're correct. Setting optimize_thresholds to False (defaults to True) also changes the behavior s.t. the two results are the same.

Next question then is: Do we want this type of behavior / is this intentional? I can see why we would want this, but question if it's a little confusing for users. Only the best pipeline is able to trained with the threshold, right?

  1. Is it confusing that a "cloned" version of the pipeline external to automl search would not produce the same results? How do we make this more clear?
  2. If a user wanted to choose pipeline with id=0, there's not a good way to do so without digging into automl search and the tuned threshold parameter in the same way that we do for the best pipeline, right? I think it can be confusing / unclear that grabbing the best pipeline will be fitted using the threshold, but grabbing any other pipeline and manually training it will not use the fitted threshold.

TLDR: the behavior seems a bit inconsistent and can be confusing or lead to misunderstandings. Is this what we're okay with, or should we clarify this somehow?

@freddyaboulton
Copy link
Contributor

@angela97lin Great points. I think this behavior is intentional. I think it was introduced in these PRs:

#1943 Made it so that we tuned thresholds by default
#2320 Added a secondary objective that we use for tuning when the primary objective is not tunable.

Only the best pipeline is able to trained with the threshold, right?

Every pipeline during automl search is trained with threshold tuning because of the default values set in #1943 and #2320.

So the problem is that AutoMLSearch is tuning thresholds by default but there's no way for users to know that/recreate that outside of search if they wanted to "export" the pipelines out of AutoMLSearch.

Would one of the two fix the problem:

  1. Store the optimal threshold value for each pipeline in AutoMLSearch.results and make sure get_pipeline sets the threshold value correctly?
  2. Make threshold optimization a step in the pipeline?

@dsherry
Copy link
Contributor

dsherry commented Sep 29, 2021

Discussion looks good. Whoever picks this up, we need to decide on what to do next!

Questions from refinement:

  • Does cloning copy the bin class threshold? (We in refinement think no)
  • If not, why not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants