Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989

Merged
merged 47 commits into from
Mar 31, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Mar 16, 2021

Closes #1881

This suddenly became a much bigger PR, so happy to split this up or explain more if it's too confusing 😅

  • Add actions to InvalidTargetDataCheck to impute target with missing values.
  • Add TargetImputer component that can impute target with missing values
  • Update _make_component_list_from_actions to handle new code, IMPUTE_COL
  • Update _retain_custom_types_and_initalize_woodwork to handle DataColumns
  • Update InvalidTargetDataCheck to separate out when target is fully null vs has nulls with two different DataCheckMessageCodes (TARGET_HAS_NULL vs TARGET_IS_EMPTY_OR_FULLY_NULL). Only add an action when TARGET_HAS_NULL
    • For the fully null or empty case, we return, rather than letting the other checks run. I think this makes sense than having other warnings (ex: not having two unique values) also be returned, as it is what the immediate issue is.
  • Cleanup: updated InvalidTargetDataCheck to return TARGET_BINARY_NOT_TWO_UNIQUE_VALUES` for time series binary problems as well
  • Cleanup: updated InvalidTargetDataCheck to return TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS for time series multiclass as well

ANGE TODO / to check:

  • TargetImputer in pipelines, then fit and score. Make sure no errors.

@angela97lin angela97lin self-assigned this Mar 16, 2021
@angela97lin angela97lin changed the title Adds recommended actions for InvalidTargetDataCheck and update _make_component_from_actions to address this action Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action Mar 17, 2021
@codecov
Copy link

codecov bot commented Mar 18, 2021

Codecov Report

Merging #1989 (f0cdcbd) into main (c335c4e) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1989     +/-   ##
=========================================
- Coverage   100.0%   100.0%   -0.0%     
=========================================
  Files         282      284      +2     
  Lines       23004    23271    +267     
=========================================
+ Hits        22995    23261    +266     
- Misses          9       10      +1     
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/data_checks/data_check_action_code.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <100.0%> (ø)
...lines/components/transformers/imputers/__init__.py 100.0% <100.0%> (ø)
...components/transformers/imputers/target_imputer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c335c4e...f0cdcbd. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review March 21, 2021 04:00
messages = invalid_targets_check.validate(X, y)
assert messages == {

expected = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cleaning up duplicate expected values :)

@@ -57,7 +59,7 @@ def validate(self, X, y):
"code": "TARGET_HAS_NULL",\
"details": {"num_null_rows": 2, "pct_null_rows": 50}}],\
"warnings": [],\
"actions": []}
"actions": [{'code': 'IMPUTE_COL', 'details': {'column': None, 'impute_strategy': 'most_frequent', 'is_target': True}}]}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted a way to specify that we want to impute the target without relying on the name of the column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@@ -69,37 +69,45 @@ def _convert_woodwork_types_wrapper(pd_data):
return pd_data


def _retain_custom_types_and_initalize_woodwork(old_datatable, new_dataframe, ltypes_to_ignore=None):
def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to handle DataColumns/Series

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! These tests are super thorough, and big fan of the cleanup! I left a few nitpicks and documentation fix comments, but nothing blocking!

@angela97lin angela97lin dismissed freddyaboulton’s stale review March 25, 2021 20:05

Addressed all changes :)

@CLAassistant
Copy link

CLAassistant commented Mar 26, 2021

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this was a doozy. I think the only thing that I'd be a little iffy on is what we do when the user specifies a constant imputation of one type but the column is full of data of the other type. I don't really see anything blocking, but would like to address that! Great job!

@@ -82,18 +90,27 @@ def validate(self, X, y):
details={"unsupported_type": y.logical_type.type_string}).to_dict())
y_df = _convert_woodwork_types_wrapper(y.to_series())
null_rows = y_df.isnull()
if null_rows.any():
if null_rows.all():
results["errors"].append(DataCheckError(message="Target values are either empty or fully null.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: are "empty" and "fully null" different? If they're not I'd just go with "Target values are fully null."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they're different in that empty refers to len(y) == 0, and fully null is len(y) != 0 but all nan values 😢

@pytest.mark.parametrize("fill_value, y, y_expected", [(None, pd.Series([np.nan, 0, 5]), pd.Series([0, 0, 5])),
(None, pd.Series([np.nan, "a", "b"]), pd.Series(["missing_value", "a", "b"]).astype("category")),
(3, pd.Series([np.nan, 0, 5]), pd.Series([3, 0, 5])),
(3, pd.Series([np.nan, "a", "b"]), pd.Series([3, "a", "b"]).astype("category"))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last parametrized test case is a very interesting one. Do we want to match types? Like if the integer 3 is put in, do we want it filling with the integer 3? Or the string 3? Do we want to allow cross-type imputation? Or perhaps raise a value error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good question! This follows the behavior in SimpleImputer / Imputer right now, but I think it's okay because the type of the series is category, which allows for mixed-type categories:

image

Happy to file an issue if you think this is worth a greater discussion though!

@angela97lin angela97lin merged commit 2f46b6a into main Mar 31, 2021
@angela97lin angela97lin deleted the 1881_fill_in_actions_cont branch March 31, 2021 16:30
@chukarsten chukarsten mentioned this pull request Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include recommended actions for each data check's set of actions
5 participants