Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pipeline API to accept component graph and other class attributes as parameters #2091

Merged
merged 98 commits into from Apr 24, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Apr 6, 2021

Closes #1956

Would also close #1984 and close #652 :)
Pros:

Cons:

  • To use in AutoML, if init is overriden then we need to override new/clone too.
  • Conceptually a bit odd: parameters should only be used for first pipeline, then hyperparameter tuning happens. Maybe this is okay, since we do this internally anyways? What should the parameters of a pipeline mean?
  • For time series, gap and max_delay need to be specified for input pipelines.

General notes:

  • allowed_pipeline now takes in pipeline instances.
  • For pipeline instances to be accepted by AutoMLSearch, must implement their own clone method? Maybe kwargs? Override init --> need to override new, clone
  • Keeping track of pipeline instances via _pipelines_searched on the AutoMLSearch object. Can't add to _results because we deepcopy results often, and deepcopying LightGBM models causes a lot of printout 馃槵
  • Added concept of frozen_parameters to IterativeAlgorithm which represents parameters that should not be changed throughout AutoMLSearch.

Defining a pipeline class:

class LogisticRegressionBinaryPipeline(BinaryClassificationPipeline):
    component_graph = ['Imputer', 'One Hot Encoder', 'Standard Scaler', 'Logistic Regression Classifier']
    def __init__(self, parameters, random_seed=0):
        return super().__init__(self.component_graph, None, {}, None, random_seed)

    def new(self, parameters, random_seed):
        """Constructs a new pipeline with the same components, parameters, and random state.

        Returns:
            A new instance of this pipeline with identical components, parameters, and random state.
        """
        return self.__class__(parameters, random_seed)
    def clone(self):
        """Constructs a new pipeline with the same components, parameters, and random state.

        Returns:
            A new instance of this pipeline with identical components, parameters, and random state.
        """
        # return self.__class__(self.parameters, random_seed=self.random_seed)
        return self.__class__(self.parameters, self.random_seed)

LogisticRegressionBinaryPipeline({})


automl = AutoMLSearch(X, y, problem_type="binary", error_callback=raise_error_callback, max_batches=20, 
                     allowed_pipelines=[LogisticRegressionBinaryPipeline({})])

Without defining a new pipeline class:

pipeline = BinaryClassificationPipeline(["One Hot Encoder", "Random Forest Classifier"],
                                        "some custom name",
                                       parameters={},
                                       custom_hyperparameters={},
                                       random_seed=0)


pipeline.fit(X, y)
pipeline._component_graph
pipeline.estimator
# X , y = load_breast_cancer(return_pandas=True)

pipeline.score(X, y, ['log loss binary'])

pipeline.predict(X).to_series()
pipeline.predict_proba(X).to_dataframe()

Pickling:

Saving in one notebook:
image


with open("best.pkl", 'wb') as f:
    pickle.dump(automl.best_pipeline, f)

Loading in another notebook:
image

import pickle

best_pipeline = None
with open('best.pkl', 'rb') as f:
    best_pipeline = pickle.load(f)
best_pipeline.fit(X, y)

To discuss:

  • Should pipeline instances be aware of hyperparameters? Conceptually, there's no reason for them to be, AutoML should be, but each pipeline instance should only care about the parameters to use, not the hyperparameter range.
  • A reason why it should be on the pipeline could be so that the same components in different pipelines can have different hyperparameter ranges. Ex: imputer on one pipeline has impute strategy of [mean, most_frequent], imputer on another has just [mean]. However, we currently use names as a way of identifying pipelines (no duplicate names allowed) so if we could store this on automl via pipeline, this could suffice? For now, working on cleaning things up. We could keep hyperparameters on pipelines in this PR if it requires too much change, and then clean up in separate PR.
  • Pipeline has random seed, and is passed into AutoMLSearch as an allowed_pipeline. Do we use AutoML random seed? OR pipeline random seed?
  • Right now, pipeline_params passed in as custom_hyperparameters but also used to instantiate what parameters the first pipeline should use.
  • generate_pipeline_code won't work well with overridden __init__; this is true even in main now, but overriding init makes a lot more sense for custom pipeline classes now. But do we still need something like generate_pipeline_code?
  • There are quite a few doc changes that would need to happen, too. Here, I've just updated the docs to pass but I think it's a better idea to file a separate issue since this PR is getting quite out of hand 馃槄
  • Pipeline utils make_pipeline_from_components is really no longer necessary, but will keep for now, can file something else to remove.

Post-discussion with @freddyaboulton @chukarsten @dsherry:

  • There are certainly some flaws or gaps to a more ideal API here: We agreed that it's odd to pass in pipeline instances as allowed_pipelines when we don't use the parameters. We only use component_graph, name/custom_name, and custom_hyperparameters. Perhaps a better API would be to pass in component graphs to AutoMLSearch instead:
automl = AutoMLSearch(X, y, problem_type="binary", allowed_component_graphs = {"Some name for component graph": ComponentGraph(...),})```

This would make it more clear that we are not using the parameters from the pipeline instance passed in

  • We currently have two ways of specifying custom_hyperparameters for pipelines in AutoML: via the pipeline class' custom_hyperparameters and via pipeline_params. We will need to revisit whether it makes sense for the custom hyperparameter ranges to be stored on the pipeline. It makes sense for each component to be aware of the hyperparameter ranges.
  • Don't use parameters on pipeline instance (ex: random_seed), use what is specified by AutoMLSearch. Pipeline instances are just the wrapper to grab component graph and name :d

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I left a couple questions. Will review tests shortly and then approve!

docs/source/user_guide/pipelines.ipynb Show resolved Hide resolved
evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
parameters = copy.copy(self.pipeline_parameters)
if self.problem_configuration:
parameters.update({'pipeline': self.problem_configuration})
self._frozen_pipeline_parameters.update({'pipeline': self.problem_configuration})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin why is this necessary? IterativeAlgorithm._transform_parameters currently handles adding pipeline-level parameters. Unless I'm missing some details I think we should either use that approach or this approach, but not both. I'm not saying the solution in _transform_parameters is the best, haha, but I think it does the same thing as this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the logic right now in main:

We pass the pipeline parameters, noticeably without the drop-columns parameters, to make_pipelines. This means that there are no custom hyperparameters set for the drop-columns component. Then, we use IterativeAlgo to handle all the wrangling of this specific case. In main, in we move the creation of allowed_pipelines from L297 to L302, we would error out:

ValueError: Default parameters for components in pipeline Decision Tree Classifier w/ Imputer + Drop Columns Transformer not in the hyperparameter ranges: Point (['most_frequent', 'mean', ['index_col'], 'gini', 'auto', 6]) is not within the bounds of the space ([('most_frequent',), ('mean', 'median', 'most_frequent'), ('index_col',), ('gini', 'entropy'), ('auto', 'sqrt', 'log2'), (4, 10)]).

(https://github.com/alteryx/evalml/blob/main/evalml/automl/automl_search.py).

Here, I had passed in parameters (rather than a copy) to the custom hyperparameters, which caused the same error. It's still a little odd to me that this logic is dependent on us not passing extra information for the drop columns transformer here and then passing that information to the IterativeAlgorithm to handle later, and I was hoping to ease some of that, because if we have more components that take in lists as parameters, we'll have to hardcode more code... but also okay with not doing this for now if it convolutes the logic even further!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Actually, running some more tests, I think this might still be a good idea. Before, allowed_pipelines created classes. Now, we create instances. For time-series problems, we need to pass in the parameters. These parameters are found via pipeline_parameters. That's why this block of code was moved up, so that we can pass in the appropriate parameters to the time series pipeline instances. However, this logic that we currently have, pipeline_parameters also potentially contains smaller custom hyperparameter spaces, which will error out as parameters.

I noticed this when test_automl_pipeline_params_kwargs failed 馃槗 Open to filing an issue to somehow clean this up, or maybe this is the cleanup we need to be aware of as we restructure and rethink IterativeAlgorithm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, understood. If this looks good to you, I am on board. Yep, I hope we can revisit this after this PR and find a good way to clarify how we pass parameters to the automl algo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll take another pass at this before merging, and filed #2187 to revisit this!

evalml/automl/automl_search.py Show resolved Hide resolved
evalml/pipelines/pipeline_base.py Show resolved Hide resolved
evalml/pipelines/pipeline_base.py Show resolved Hide resolved
evalml/pipelines/pipeline_base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin ok this looks ready to go! 馃憦馃憦馃憦

I left some comments but nothing blocking merge.

cat_laser_pickle

evalml/tests/automl_tests/dask_test_utils.py Show resolved Hide resolved
evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
evalml/tests/conftest.py Show resolved Hide resolved
@@ -173,6 +174,9 @@ def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_i
class PipelineWithDimReduction(BinaryClassificationPipeline):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

For #2184 I'd suggest we set the custom_name field!

evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants