Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace allowed_pipelines with allowed_component_graphs #2364

Merged
merged 98 commits into from Jun 18, 2021

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Jun 10, 2021

Fixes #2159 and #2166

Here are the docs

Here's an example of the new implementation:

from pprint import pp

from evalml.automl import AutoMLSearch
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=100, n_features=20, n_informative=2, n_redundant=2, random_state=0
    )

allowed_component_graph = {
        "Logistic Regression Binary Pipeline": [
            "Imputer",
            "One Hot Encoder",
            "Standard Scaler",
            "Logistic Regression Classifier",
        ]
    }
automl = AutoMLSearch(
    X_train=X,
    y_train=y,
    problem_type="binary",
    max_iterations=3,
    allowed_component_graphs=[allowed_component_graph],
    ensembling=True,
)
automl.search()

for pipe in automl.full_rankings['parameters']:
    pp(pipe)

================= OUTPUT ==================
{'Imputer': {'categorical_impute_strategy': 'most_frequent',
             'numeric_impute_strategy': 'mean',
             'categorical_fill_value': None,
             'numeric_fill_value': None},
 'One Hot Encoder': {'top_n': 10,
                     'features_to_encode': None,
                     'categories': None,
                     'drop': 'if_binary',
                     'handle_unknown': 'ignore',
                     'handle_missing': 'error'},
 'Logistic Regression Classifier': {'penalty': 'l2',
                                    'C': 1.0,
                                    'n_jobs': -1,
                                    'multi_class': 'auto',
                                    'solver': 'lbfgs'}}
{'Imputer': {'categorical_impute_strategy': 'most_frequent',
             'numeric_impute_strategy': 'most_frequent',
             'categorical_fill_value': None,
             'numeric_fill_value': None},
 'One Hot Encoder': {'top_n': 10,
                     'features_to_encode': None,
                     'categories': None,
                     'drop': 'if_binary',
                     'handle_unknown': 'ignore',
                     'handle_missing': 'error'},
 'Logistic Regression Classifier': {'penalty': 'l2',
                                    'C': 8.474044870453413,
                                    'n_jobs': -1,
                                    'multi_class': 'auto',
                                    'solver': 'lbfgs'}}
{'Baseline Classifier': {'strategy': 'mode'}}

@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

Merging #2364 (81a0067) into main (8c51031) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2364     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        281     281             
  Lines      25014   25057     +43     
=======================================
+ Hits       24917   24957     +40     
- Misses        97     100      +3     
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_engine_base.py 100.0% <ø> (ø)
...lml/automl/automl_algorithm/iterative_algorithm.py 99.2% <100.0%> (-<0.1%) ⬇️
evalml/automl/automl_search.py 99.4% <100.0%> (-<0.1%) ⬇️
evalml/automl/utils.py 100.0% <100.0%> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <100.0%> (ø)
.../tests/automl_tests/dask_tests/test_automl_dask.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.8% <100.0%> (+0.2%) ⬆️
.../automl_tests/test_automl_search_classification.py 100.0% <100.0%> (ø)
...ests/automl_tests/test_automl_search_regression.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl_utils.py 100.0% <100.0%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c51031...81a0067. Read the comment docs.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't yet finish, but left a few questions that would help me understand! Looking good so far!

evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
docs/source/user_guide/automl.ipynb Outdated Show resolved Hide resolved
evalml/automl/utils.py Outdated Show resolved Hide resolved
objective=objective,
additional_objectives=[],
optimize_thresholds=False,
n_jobs=1,
)
automl._automl_algorithm = IterativeAlgorithm(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you need to define this since we're patching IterativeAlgo? Why did we not need to do it before?

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
objective="log loss binary",
additional_objectives=["f1"],
)
automl._automl_algorithm = IterativeAlgorithm(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the DummyPipeline score is being mocked for the purpose of this test (which is something ComponentGraph doesn't have), I have to specify that pipeline being passed into the IterativeAlgorithm since we can't pass it through AutoMLSearch anymore. This is the same reason it was added for the other tests as well! Luckily it doesn't look like this increases the execution time

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh Looks good! I have some minor comments I'd like to discuss before merge though.

evalml/automl/utils.py Outdated Show resolved Hide resolved
docs/source/user_guide/automl.ipynb Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
)
unique_names = set()
for graph in allowed_component_graphs:
unique_names.add(list(graph.keys())[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should change the expected type of allowed_component_graphs from a list of dicts to just a dict.

So from

            allowed_component_graphs=[
                {"Name_0": [dummy_classifier_estimator_class]},
                {"Name_1": [dummy_classifier_estimator_class]},
                {"Name_2": [dummy_classifier_estimator_class]},
            ],

to

            allowed_component_graphs={
                "Name_0": [dummy_classifier_estimator_class],
                "Name_1": [dummy_classifier_estimator_class],
                "Name_2": [dummy_classifier_estimator_class]
},

I see there's some duplicated code in this pr to unpack the list, e.g list(graph.keys())[0] and next(iter(component_graph)) making this change may make things clearer/reduce code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go a step further...is there any reason to not just expect a list or dictionary of ComponentGraph objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the format was set as such to support random_seed but since we're not passing that anymore, this would definitely be a better alternative.

evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
"mean",
}
if row["pipeline_name"] == "Name_linear":
assert row["parameters"]["Imputer"]["numeric_impute_strategy"] == "mean"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's causing this diff? Is it the deletion of _frozen_pipeline_parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so, since the custom_hyperparameter is set to a constant value, all pipelines will return mean for the Imputer

evalml/tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that was a lot of work. Good job. I think we should discuss whether we want to preserve the two different ways of passing in the component graphs that are allowed. To me, it would seem the way to do it would be to reduce the complexity and only pass in a dict of component graphs (if they need to be named). The handfulls of instances where IterativeAlgorithm is set in the tests is also weird, but I don't think these are blocking to me.

evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
)
unique_names = set()
for graph in allowed_component_graphs:
unique_names.add(list(graph.keys())[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go a step further...is there any reason to not just expect a list or dictionary of ComponentGraph objects?

Comment on lines 167 to 179
automl._automl_algorithm = IterativeAlgorithm(
max_iterations=4,
allowed_pipelines=pipelines,
tuner_class=SKOptTuner,
random_seed=0,
n_jobs=-1,
number_features=X.shape[1],
pipelines_per_batch=5,
ensembling=False,
text_in_ensembling=False,
pipeline_params={},
custom_hyperparameters=None,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these instances of having to set the IterativeAlgorithm are kinda weird.

@ParthivNaresh ParthivNaresh merged commit cbc2799 into main Jun 18, 2021
@chukarsten chukarsten mentioned this pull request Jun 22, 2021
@freddyaboulton freddyaboulton deleted the Replace-AllowedPipelines-ComponentGraphs branch May 13, 2022 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update AutoMLSearch to accept allowed_component_graphs instead of allowed_pipelines
4 participants