-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom woodwork types in components #1784
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1784 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 252 255 +3
Lines 20250 20564 +314
=========================================
+ Hits 20242 20556 +314
Misses 8 8
Continue to review full report at Codecov.
|
…yx/evalml into 1662_custom_woodwork_in_components
@@ -572,3 +572,5 @@ General Utils | |||
drop_rows_with_nans | |||
infer_feature_types | |||
save_plot | |||
is_all_numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change but adding to API ref :)
@@ -17,8 +17,6 @@ | |||
|
|||
logger = get_logger(__file__) | |||
|
|||
numeric_and_boolean_ww = [ww.logical_types.Integer, ww.logical_types.Double, ww.logical_types.Boolean] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all Woodwork related utils to woodwork_utils.py
:)
@@ -117,7 +117,7 @@ def test_simple_imputer_all_bool_return_original(data_type, make_data_type): | |||
|
|||
@pytest.mark.parametrize("data_type", ['pd', 'ww']) | |||
def test_simple_imputer_bool_dtype_object(data_type, make_data_type): | |||
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object) | |||
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype='boolean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update (and many of the others made in the tests) was made because now we try to convert the columns back to the original types. Before, the original dtype was specified as object, so the column would have been stored as a categorical col and we would have expected X_expected_arr to return a categorical col. We need to explicitly set to boolean for the expected outcome to be boolean.
I think these conversions make sense over previous behavior--user input is boolean, so output is boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Glad we don't have to do this anymore.
@@ -34,17 +35,17 @@ def transform(self, X, y=None): | |||
Returns: | |||
ww.DataTable: Transformed X | |||
""" | |||
X_ww = _convert_to_woodwork_structure(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to move these out of the try/except because I think it's more clear what the try/except is for. If it errors out while doing Woodwork conversions (as it did while I was testing stuff), it shouldn't throw a MethodPropertyNotFoundError
:)
@@ -271,153 +268,6 @@ def test_rename_column_names_to_numeric(): | |||
assert X_renamed.logical_types == {0: ww.logical_types.Categorical, 1: ww.logical_types.Categorical} | |||
|
|||
|
|||
def test_convert_woodwork_types_wrapper_with_nan(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to test_woodwork_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Looks good to me! I have some questions on the unit tests but nothing blocking.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
@@ -117,7 +117,7 @@ def test_simple_imputer_all_bool_return_original(data_type, make_data_type): | |||
|
|||
@pytest.mark.parametrize("data_type", ['pd', 'ww']) | |||
def test_simple_imputer_bool_dtype_object(data_type, make_data_type): | |||
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object) | |||
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype='boolean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Glad we don't have to do this anymore.
transformed = dft.transform(X, y) | ||
assert isinstance(transformed, ww.DataTable) | ||
if logical_type == Integer or logical_type == Double or logical_type == Categorical: | ||
assert transformed.logical_types == {0: logical_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should modify the delayed feature transformer to carry over the ints. I'd be happy to file an issue if you are on board!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds good! I'm guessing shift
just returns floats which is why we end up with Doubles right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man. Great work getting through all this and testing it all...what a feat of endurance. Just a few questions I had here and there. If you want to get back to me on them, just ping me on Slack so I can go check it out, but nothing to really stop merge.
X = _convert_to_woodwork_structure(X) | ||
X = infer_feature_types(X) | ||
if y is not None: | ||
y = _convert_to_woodwork_structure(y) | ||
y = infer_feature_types(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my boy right there! I DREAM about this code chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I've seen this so often I swear I do too 😆
X_ww = infer_feature_types(X) | ||
X = _convert_woodwork_types_wrapper(X_ww.to_dataframe()) | ||
if y is not None: | ||
y = infer_feature_types(y) | ||
y = _convert_woodwork_types_wrapper(y.to_series()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There it is <3
X = _convert_to_woodwork_structure(X) | ||
cat_cols = list(X.select('category').columns) | ||
X = infer_feature_types(X) | ||
cat_cols = list(X.select(['category', 'boolean']).columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prompted this? This seems like a fine change...but just curious. This comment is also a bookmark so I can go make sure this is tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can double check we're testing this but IIRC, it's so that we treat boolean columns like categorical cols. We don't want to pass it to our numeric_imputer which might try to use "mean" as the impute strategy and end up with nonsense for the imputed values.
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/delayed_feature_transformer.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/drop_null_columns.py
Show resolved
Hide resolved
try: | ||
X = ww.DataTable(X_df, logical_types={0: logical_type}) | ||
except TypeError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm tired, but what is the intent behind this chunk here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so basically, I'm trying every possible logical type conversion that's allowed (see desc of PR for allowed vs error), but rather than manually keeping track of all possible conversions, I'm just trying. Ex: Int --> Nat Lang is valid conversion and therefore a valid override a user might provide to EvalML for us to test. Nat Lang --> Int might not always be (and will raise a TypeError). If its not, don't try to pass to EvalML to test.
Closes #1662. Another large-ish PR but not too much logic change--just replacing the final line to convert to WW to new util method, and writing a lot of tests. I initially wanted to iterate through all the components and test like the pattern we already have in
test_components.py
but realized that each component has its own nuanced behavior, so decided against it and added the tests to their individual files. The parameterization is similar in many cases though. 😬woodwork_utils.py
and move tests totest_woodwork_utils.py
_convert_to_woodwork_structure
withinfer_feature_types
by combining the two implementations.Notes on compatible and incompatible types. Mostly for my own understanding, but also helped inform test cases.
Datetime
Categorical
Integer
Double
**Natural Language **
Natural Language --> Datetime: errors
Natural Language --> Double: works, sometimes
Natural Language --> Categorical: works
Natural Language --> Integer: works, sometimes
There's no good way of confirming when it is safe to convert floats to category or when it is not without trying. For example: can convert ['1', '2'] (categories) to float, but not ['a', 'b']. Hence, current implementation will try to set the type. If it fails, then don't do anything. If it succeeds, add to logical type dictionary to pass along to the initialization of Woodwork DataTable, so that Woodwork does not have to do type inference, saving us some time.
Look into: featuretools component which changed the columns of the original input DF?