Skip to content

Add method to convert actions to a preprocessing pipeline#2968

Merged
angela97lin merged 11 commits intomainfrom
2058_preprocessing_pipeline
Oct 31, 2021
Merged

Add method to convert actions to a preprocessing pipeline#2968
angela97lin merged 11 commits intomainfrom
2058_preprocessing_pipeline

Conversation

@angela97lin
Copy link
Contributor

Closes #2058

@angela97lin angela97lin self-assigned this Oct 27, 2021
@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #2968 (e194778) into main (b95f40b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2968     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        307     307             
  Lines      29265   29283     +18     
=======================================
+ Hits       29174   29192     +18     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/automl/utils.py 100.0% <ø> (ø)
...ransformers/preprocessing/drop_rows_transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.5% <100.0%> (+0.1%) ⬆️
...ests/component_tests/test_drop_rows_transformer.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b95f40b...e194778. Read the comment docs.

raise ValueError("All input indices must be unique.")
self.indices_to_drop = indices_to_drop
super().__init__(parameters=None, component_obj=None, random_seed=random_seed)
parameters = {"indices_to_drop": self.indices_to_drop}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding indices to parameters so they can be accessed when creating a pipeline :)

elif action.action_code == DataCheckActionCode.DROP_ROWS:
indices = action.metadata["indices"]
components.append(DropRowsTransformer(indices_to_drop=indices))
indices_to_drop.extend(action.metadata["indices"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cleanup here: updating code to just return one Drop Rows Transformer, similar to Drop Columns.


@pytest.mark.parametrize("problem_type", ["regression"])
def test_invalid_target_data_check_regression_problem_nonnumeric_data(problem_type):
def test_invalid_target_data_check_regression_problem_nonnumeric_data():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to parametrize one value 😛

)
from evalml.pipelines.utils import (
_get_pipeline_base_class,
_make_component_list_from_actions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing tests of private method with our new method!

@angela97lin angela97lin marked this pull request as ready for review October 28, 2021 02:47

def _get_pipeline_base_class(problem_type):
"""Returns pipeline base class for problem_type."""
problem_type = handle_problem_types(problem_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting catch. Did this not cause us problems before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope--but only because it's a private method that we use in tests 😂

Just happened to be that everywhere where we used it, we used the ProblemTypes enum so this didn't cause problems before but I noticed it since I tried using strings!

for component in component_list:
parameters[component.name] = component.parameters
component_dict = PipelineBase._make_component_dict_from_component_list(
[component.name for component in component_list]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason we can't pass the component_list in directly here? _make_component_dict_from_component_list calls handle_component_class, so if I understand correctly it should be able to handle the list directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slight difference, I think, is that passing in a list of components directly will create a pipeline where the component class is the key value, whereas this uses the name instead!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Angela! Love this PR as I Iove when we use our own code to do things and do them intuitively, which I think this does. I am approving even though I see the description mentions the merging of the pre-processing pipeline with the standard pipelines. I am just assuming that that work got split out somewhere else! If it was something we wanted to tackle in this PR, though, I would expect some tests indicating the intended functionality of the helper function and documentation support in the user guide to show the way we intend the user to use DataCheckActions! If this is happening in a subsequent PR, cool beans!

)


@pytest.mark.parametrize("problem_type", ["binary", "multiclass", "regression"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR supposed to encompass the helper function to merge the two pipelines together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I filed #2997 for this :')

@angela97lin angela97lin merged commit 6d16f96 into main Oct 31, 2021
@angela97lin angela97lin deleted the 2058_preprocessing_pipeline branch October 31, 2021 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add method to convert actions to a preprocessing pipeline

3 participants