Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to convert actions to a preprocessing pipeline #2968

Merged
merged 11 commits into from
Oct 31, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #2058

@angela97lin angela97lin self-assigned this Oct 27, 2021
@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #2968 (e194778) into main (b95f40b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2968     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        307     307             
  Lines      29265   29283     +18     
=======================================
+ Hits       29174   29192     +18     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/automl/utils.py 100.0% <ø> (ø)
...ransformers/preprocessing/drop_rows_transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.5% <100.0%> (+0.1%) ⬆️
...ests/component_tests/test_drop_rows_transformer.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b95f40b...e194778. Read the comment docs.

@@ -23,7 +23,10 @@ def __init__(self, indices_to_drop=None, random_seed=0):
):
raise ValueError("All input indices must be unique.")
self.indices_to_drop = indices_to_drop
super().__init__(parameters=None, component_obj=None, random_seed=random_seed)
parameters = {"indices_to_drop": self.indices_to_drop}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding indices to parameters so they can be accessed when creating a pipeline :)

@@ -389,10 +413,14 @@ def _make_component_list_from_actions(actions):
TargetImputer(impute_strategy=metadata["impute_strategy"])
)
elif action.action_code == DataCheckActionCode.DROP_ROWS:
indices = action.metadata["indices"]
components.append(DropRowsTransformer(indices_to_drop=indices))
indices_to_drop.extend(action.metadata["indices"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cleanup here: updating code to just return one Drop Rows Transformer, similar to Drop Columns.

@@ -491,8 +491,7 @@ def test_invalid_target_data_check_initialize_with_none_objective():
)


@pytest.mark.parametrize("problem_type", ["regression"])
def test_invalid_target_data_check_regression_problem_nonnumeric_data(problem_type):
def test_invalid_target_data_check_regression_problem_nonnumeric_data():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to parametrize one value 😛

@@ -35,11 +35,11 @@
)
from evalml.pipelines.utils import (
_get_pipeline_base_class,
_make_component_list_from_actions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing tests of private method with our new method!

@@ -166,6 +166,7 @@ def _get_preprocessing_components(

def _get_pipeline_base_class(problem_type):
"""Returns pipeline base class for problem_type."""
problem_type = handle_problem_types(problem_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting catch. Did this not cause us problems before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope--but only because it's a private method that we use in tests 😂

Just happened to be that everywhere where we used it, we used the ProblemTypes enum so this didn't cause problems before but I noticed it since I tried using strings!

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
for component in component_list:
parameters[component.name] = component.parameters
component_dict = PipelineBase._make_component_dict_from_component_list(
[component.name for component in component_list]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason we can't pass the component_list in directly here? _make_component_dict_from_component_list calls handle_component_class, so if I understand correctly it should be able to handle the list directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slight difference, I think, is that passing in a list of components directly will create a pipeline where the component class is the key value, whereas this uses the name instead!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Angela! Love this PR as I Iove when we use our own code to do things and do them intuitively, which I think this does. I am approving even though I see the description mentions the merging of the pre-processing pipeline with the standard pipelines. I am just assuming that that work got split out somewhere else! If it was something we wanted to tackle in this PR, though, I would expect some tests indicating the intended functionality of the helper function and documentation support in the user guide to show the way we intend the user to use DataCheckActions! If this is happening in a subsequent PR, cool beans!

)


@pytest.mark.parametrize("problem_type", ["binary", "multiclass", "regression"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR supposed to encompass the helper function to merge the two pipelines together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I filed #2997 for this :')

@angela97lin angela97lin merged commit 6d16f96 into main Oct 31, 2021
@angela97lin angela97lin deleted the 2058_preprocessing_pipeline branch October 31, 2021 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add method to convert actions to a preprocessing pipeline
3 participants