-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to convert actions to a preprocessing pipeline #2968
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2968 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 307 307
Lines 29265 29283 +18
=======================================
+ Hits 29174 29192 +18
Misses 91 91
Continue to review full report at Codecov.
|
@@ -23,7 +23,10 @@ def __init__(self, indices_to_drop=None, random_seed=0): | |||
): | |||
raise ValueError("All input indices must be unique.") | |||
self.indices_to_drop = indices_to_drop | |||
super().__init__(parameters=None, component_obj=None, random_seed=random_seed) | |||
parameters = {"indices_to_drop": self.indices_to_drop} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding indices to parameters so they can be accessed when creating a pipeline :)
@@ -389,10 +413,14 @@ def _make_component_list_from_actions(actions): | |||
TargetImputer(impute_strategy=metadata["impute_strategy"]) | |||
) | |||
elif action.action_code == DataCheckActionCode.DROP_ROWS: | |||
indices = action.metadata["indices"] | |||
components.append(DropRowsTransformer(indices_to_drop=indices)) | |||
indices_to_drop.extend(action.metadata["indices"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cleanup here: updating code to just return one Drop Rows Transformer, similar to Drop Columns.
@@ -491,8 +491,7 @@ def test_invalid_target_data_check_initialize_with_none_objective(): | |||
) | |||
|
|||
|
|||
@pytest.mark.parametrize("problem_type", ["regression"]) | |||
def test_invalid_target_data_check_regression_problem_nonnumeric_data(problem_type): | |||
def test_invalid_target_data_check_regression_problem_nonnumeric_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to parametrize one value 😛
@@ -35,11 +35,11 @@ | |||
) | |||
from evalml.pipelines.utils import ( | |||
_get_pipeline_base_class, | |||
_make_component_list_from_actions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing tests of private method with our new method!
@@ -166,6 +166,7 @@ def _get_preprocessing_components( | |||
|
|||
def _get_pipeline_base_class(problem_type): | |||
"""Returns pipeline base class for problem_type.""" | |||
problem_type = handle_problem_types(problem_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting catch. Did this not cause us problems before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope--but only because it's a private method that we use in tests 😂
Just happened to be that everywhere where we used it, we used the ProblemTypes enum so this didn't cause problems before but I noticed it since I tried using strings!
for component in component_list: | ||
parameters[component.name] = component.parameters | ||
component_dict = PipelineBase._make_component_dict_from_component_list( | ||
[component.name for component in component_list] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason we can't pass the component_list
in directly here? _make_component_dict_from_component_list
calls handle_component_class
, so if I understand correctly it should be able to handle the list directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slight difference, I think, is that passing in a list of components directly will create a pipeline where the component class is the key value, whereas this uses the name instead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey Angela! Love this PR as I Iove when we use our own code to do things and do them intuitively, which I think this does. I am approving even though I see the description mentions the merging of the pre-processing pipeline with the standard pipelines. I am just assuming that that work got split out somewhere else! If it was something we wanted to tackle in this PR, though, I would expect some tests indicating the intended functionality of the helper function and documentation support in the user guide to show the way we intend the user to use DataCheckActions! If this is happening in a subsequent PR, cool beans!
) | ||
|
||
|
||
@pytest.mark.parametrize("problem_type", ["binary", "multiclass", "regression"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR supposed to encompass the helper function to merge the two pipelines together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I filed #2997 for this :')
…ml into 2058_preprocessing_pipeline
Closes #2058