Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn users when pipeline parameters aren't used when initializing pipelines #2564

Merged
merged 12 commits into from Jul 29, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Jul 27, 2021

fix #1839

I opted to use warnings rather than raise errors since pipeline initialization occurs during AutoMLSearch as well. In the case where users pass in a valid pipeline_param, we don't want to error it out since we pass the entirety of pipeline_parameters to every pipeline we initialize.

automl = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', 
                      pipeline_parameters={"CatBoost Classifier": {'n_estimators': 20}}), max_batches=2)

It would also be too much logic to include in AutoMLSearch to filter out only the appropriate components we need from pipeline parameters into the pipelines we are initializing. (ie determining which components are necessary in which pipelines before make_pipelines so we can trim and pass the appropriate pipeline parameters)

In this scenario, we don't want to raise an error since that would prevent AutoMLSearch from happening.

However, raising a warning allows the search to continue while also giving the user feedback on whether or not the component parameters are being used.

@bchen1116 bchen1116 self-assigned this Jul 27, 2021
@codecov
Copy link

codecov bot commented Jul 27, 2021

Codecov Report

Merging #2564 (0f35646) into main (9e8a42d) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2564     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        287     287             
  Lines      26338   26402     +64     
=======================================
+ Hits       26304   26368     +64     
  Misses        34      34             
Impacted Files Coverage Δ
evalml/exceptions/__init__.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
evalml/exceptions/exceptions.py 100.0% <100.0%> (ø)
evalml/pipelines/component_graph.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.8% <100.0%> (+0.1%) ⬆️
.../automl_tests/test_automl_search_classification.py 100.0% <100.0%> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e8a42d...0f35646. Read the comment docs.

@@ -466,11 +467,11 @@ def __init__(
self.sampler_method,
self.sampler_balanced_ratio,
)
if self._sampler_name not in parameters:
if self._sampler_name not in parameters and self._sampler_name is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this to make sure we don't add the sampling ratio to the param dictionary if the sampler_name is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I think so. It wasn't doing anything except adding {None: {sampler_ratio: 0.25}} to the param dictionary, which doesn't do anything but also isn't necessary and could cause problems in the future (like with this PR). It should be fixed now!

@@ -627,6 +636,27 @@ def __init__(
custom_hyperparameters=custom_hyperparameters,
)

def _catch_warnings(self, warning_list):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to catch all warnings thrown during pipeline initialization and only throw the warnings for the component names that appear in all warnings. For instance, if we had pipelines:

p1 = ['Imputer', 'Standard Scaler', 'Logistic Regression Classifier']
p2 = ['Imputer', 'Random Forest Classifier']
params_passed_in = {"Logistic Regression Classifier": {"some value": 2}, 
                   {"Undersampler": {"another val": 1}}

then we would only want to raise warning for Undersampler since it isn't used in any of the pipelines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh when I first read the issue I didn't foresee we would need this in AutoMLSearch but I think it's the right call!

"pipeline_parameters,set_values",
[
({"Logistic Regression Classifier": {"penalty": "l1"}}, {}),
({"Logistic Regression": {"penalty": "l1"}}, {"Logistic Regression"}),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should issue a warning since the component name we've defined in the graph is Logistic Regression Classifier rather than Logistic Regression

@bchen1116 bchen1116 requested a review from a team July 28, 2021 14:53
@@ -627,6 +636,27 @@ def __init__(
custom_hyperparameters=custom_hyperparameters,
)

def _catch_warnings(self, warning_list):
if len(warning_list) == len(self.allowed_pipelines) and len(warning_list) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think we can simplify this implementation a bit if we raise our own custom warning that subclasses UserWarning.

class ParameterNotUsedWarning(UserWarning):

    def __init__(self, components):
        self.components = components

        msg = f"Parameters for components {components} will not be used to instantiate the pipeline since they don't appear in the pipeline"
        super().__init__(msg)

ComponentGraph now becomes:

        diff = param_set.difference(set(self.component_instances.keys()))
        if len(diff):
            warnings.warn(ParameterNotUsedWarning(diff))

_catch_warnings now becomes:

    def _catch_warnings(self, warning_list):
        if len(warning_list) == len(self.allowed_pipelines) and len(warning_list) > 0:
            # we find the value(s) that we must throw the warning for
            final_message = set([])
            for idx, msg in enumerate(warning_list):
                if idx == 0:
                    final_message = final_message.union(msg.message.components)
                else:
                    final_message.intersection(msg.message.components)
            warnings.warn(ParameterNotUsedWarning(list(final_message)))

I think this is cleaner because it defines an api for accessing the components we want to warn about rather than messing with the string. I worry that'll be harder to maintain if we ever decide to change the message. I think the implementation is easier to understand as well. We can also explicitly filter for our custom warning in AutoMLSearch which seems cleaner too.

Something like:

            with warnings.catch_warnings(record=True) as w:
                warnings.filterwarnings("always", category=ParameterNotUsedWarning)

Let me know what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea! Yeah, I can change the impl here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton added the suggestion you raised. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bchen1116 I will take a look!

@@ -87,3 +87,11 @@ class ObjectiveCreationError(Exception):

class NoPositiveLabelException(Exception):
"""Exception when a particular classification label for the 'positive' class cannot be found in the column index or unique values"""


class ParameterNotUsedWarning(UserWarning):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 This looks great to me!

Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, Bryan. I am sorry this scope creeded up so much. Thanks for being flexible and talking me through it.

for estimator in allowed_estimators
]
self._catch_warnings(w)
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this addition is so confusing to me. This else branch is intended for if self.allowed_component_graphs is None, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

@bchen1116 bchen1116 merged commit 577753c into main Jul 29, 2021
@chukarsten chukarsten mentioned this pull request Aug 3, 2021
@freddyaboulton freddyaboulton deleted the bc_1839_parameters branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline Parameters: Warn/Raise on Extra Parameters
3 participants