-
Notifications
You must be signed in to change notification settings - Fork 92
Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1989 +/- ##
=========================================
- Coverage 100.0% 100.0% -0.0%
=========================================
Files 282 284 +2
Lines 23004 23271 +267
=========================================
+ Hits 22995 23261 +266
- Misses 9 10 +1
Continue to review full report at Codecov.
|
… into 1881_fill_in_actions_cont
| messages = invalid_targets_check.validate(X, y) | ||
| assert messages == { | ||
|
|
||
| expected = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cleaning up duplicate expected values :)
| "details": {"num_null_rows": 2, "pct_null_rows": 50}}],\ | ||
| "warnings": [],\ | ||
| "actions": []} | ||
| "actions": [{'code': 'IMPUTE_COL', 'details': {'column': None, 'impute_strategy': 'most_frequent', 'is_target': True}}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted a way to specify that we want to impute the target without relying on the name of the column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
|
|
||
|
|
||
| def _retain_custom_types_and_initalize_woodwork(old_datatable, new_dataframe, ltypes_to_ignore=None): | ||
| def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to handle DataColumns/Series
bchen1116
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! These tests are super thorough, and big fan of the cleanup! I left a few nitpicks and documentation fix comments, but nothing blocking!
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this was a doozy. I think the only thing that I'd be a little iffy on is what we do when the user specifies a constant imputation of one type but the column is full of data of the other type. I don't really see anything blocking, but would like to address that! Great job!
| null_rows = y_df.isnull() | ||
| if null_rows.any(): | ||
| if null_rows.all(): | ||
| results["errors"].append(DataCheckError(message="Target values are either empty or fully null.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: are "empty" and "fully null" different? If they're not I'd just go with "Target values are fully null."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they're different in that empty refers to len(y) == 0, and fully null is len(y) != 0 but all nan values 😢
| @pytest.mark.parametrize("fill_value, y, y_expected", [(None, pd.Series([np.nan, 0, 5]), pd.Series([0, 0, 5])), | ||
| (None, pd.Series([np.nan, "a", "b"]), pd.Series(["missing_value", "a", "b"]).astype("category")), | ||
| (3, pd.Series([np.nan, 0, 5]), pd.Series([3, 0, 5])), | ||
| (3, pd.Series([np.nan, "a", "b"]), pd.Series([3, "a", "b"]).astype("category"))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last parametrized test case is a very interesting one. Do we want to match types? Like if the integer 3 is put in, do we want it filling with the integer 3? Or the string 3? Do we want to allow cross-type imputation? Or perhaps raise a value error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… into 1881_fill_in_actions_cont

Closes #1881
This suddenly became a much bigger PR, so happy to split this up or explain more if it's too confusing 😅
InvalidTargetDataCheckto impute target with missing values.TargetImputercomponent that can impute target with missing values_make_component_list_from_actionsto handle new code,IMPUTE_COL_retain_custom_types_and_initalize_woodworkto handle DataColumnsInvalidTargetDataCheckto separate out when target is fully null vs has nulls with two different DataCheckMessageCodes (TARGET_HAS_NULL vs TARGET_IS_EMPTY_OR_FULLY_NULL). Only add an action when TARGET_HAS_NULLInvalidTargetDataCheck to returnTARGET_BINARY_NOT_TWO_UNIQUE_VALUES` for time series binary problems as wellInvalidTargetDataCheckto returnTARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASSfor time series multiclass as wellANGE TODO / to check: