Support custom woodwork types in components#1784
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1784 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 252 255 +3
Lines 20250 20564 +314
=========================================
+ Hits 20242 20556 +314
Misses 8 8
Continue to review full report at Codecov.
|
…yx/evalml into 1662_custom_woodwork_in_components
| drop_rows_with_nans | ||
| infer_feature_types | ||
| save_plot | ||
| is_all_numeric |
There was a problem hiding this comment.
Unrelated change but adding to API ref :)
|
|
||
| logger = get_logger(__file__) | ||
|
|
||
| numeric_and_boolean_ww = [ww.logical_types.Integer, ww.logical_types.Double, ww.logical_types.Boolean] |
There was a problem hiding this comment.
Moved all Woodwork related utils to woodwork_utils.py :)
| @pytest.mark.parametrize("data_type", ['pd', 'ww']) | ||
| def test_simple_imputer_bool_dtype_object(data_type, make_data_type): | ||
| X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object) | ||
| X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype='boolean') |
There was a problem hiding this comment.
This update (and many of the others made in the tests) was made because now we try to convert the columns back to the original types. Before, the original dtype was specified as object, so the column would have been stored as a categorical col and we would have expected X_expected_arr to return a categorical col. We need to explicitly set to boolean for the expected outcome to be boolean.
I think these conversions make sense over previous behavior--user input is boolean, so output is boolean.
There was a problem hiding this comment.
Agreed! Glad we don't have to do this anymore.
| Returns: | ||
| ww.DataTable: Transformed X | ||
| """ | ||
| X_ww = _convert_to_woodwork_structure(X) |
There was a problem hiding this comment.
I decided to move these out of the try/except because I think it's more clear what the try/except is for. If it errors out while doing Woodwork conversions (as it did while I was testing stuff), it shouldn't throw a MethodPropertyNotFoundError :)
| assert X_renamed.logical_types == {0: ww.logical_types.Categorical, 1: ww.logical_types.Categorical} | ||
|
|
||
|
|
||
| def test_convert_woodwork_types_wrapper_with_nan(): |
There was a problem hiding this comment.
Moved to test_woodwork_utils.py
freddyaboulton
left a comment
There was a problem hiding this comment.
@angela97lin Looks good to me! I have some questions on the unit tests but nothing blocking.
| @pytest.mark.parametrize("data_type", ['pd', 'ww']) | ||
| def test_simple_imputer_bool_dtype_object(data_type, make_data_type): | ||
| X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object) | ||
| X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype='boolean') |
There was a problem hiding this comment.
Agreed! Glad we don't have to do this anymore.
| transformed = dft.transform(X, y) | ||
| assert isinstance(transformed, ww.DataTable) | ||
| if logical_type == Integer or logical_type == Double or logical_type == Categorical: | ||
| assert transformed.logical_types == {0: logical_type, |
There was a problem hiding this comment.
I think we should modify the delayed feature transformer to carry over the ints. I'd be happy to file an issue if you are on board!
There was a problem hiding this comment.
Sure, that sounds good! I'm guessing shift just returns floats which is why we end up with Doubles right now?
chukarsten
left a comment
There was a problem hiding this comment.
Oh man. Great work getting through all this and testing it all...what a feat of endurance. Just a few questions I had here and there. If you want to get back to me on them, just ping me on Slack so I can go check it out, but nothing to really stop merge.
| X = _convert_to_woodwork_structure(X) | ||
| X = infer_feature_types(X) | ||
| if y is not None: | ||
| y = _convert_to_woodwork_structure(y) | ||
| y = infer_feature_types(y) |
There was a problem hiding this comment.
That's my boy right there! I DREAM about this code chunk.
There was a problem hiding this comment.
Honestly, I've seen this so often I swear I do too 😆
| X_ww = infer_feature_types(X) | ||
| X = _convert_woodwork_types_wrapper(X_ww.to_dataframe()) | ||
| if y is not None: | ||
| y = infer_feature_types(y) | ||
| y = _convert_woodwork_types_wrapper(y.to_series()) |
| X = _convert_to_woodwork_structure(X) | ||
| cat_cols = list(X.select('category').columns) | ||
| X = infer_feature_types(X) | ||
| cat_cols = list(X.select(['category', 'boolean']).columns) |
There was a problem hiding this comment.
What prompted this? This seems like a fine change...but just curious. This comment is also a bookmark so I can go make sure this is tested.
There was a problem hiding this comment.
I can double check we're testing this but IIRC, it's so that we treat boolean columns like categorical cols. We don't want to pass it to our numeric_imputer which might try to use "mean" as the impute strategy and end up with nonsense for the imputed values.
| try: | ||
| X = ww.DataTable(X_df, logical_types={0: logical_type}) | ||
| except TypeError: | ||
| continue |
There was a problem hiding this comment.
I think I'm tired, but what is the intent behind this chunk here?
There was a problem hiding this comment.
Ah so basically, I'm trying every possible logical type conversion that's allowed (see desc of PR for allowed vs error), but rather than manually keeping track of all possible conversions, I'm just trying. Ex: Int --> Nat Lang is valid conversion and therefore a valid override a user might provide to EvalML for us to test. Nat Lang --> Int might not always be (and will raise a TypeError). If its not, don't try to pass to EvalML to test.
Closes #1662. Another large-ish PR but not too much logic change--just replacing the final line to convert to WW to new util method, and writing a lot of tests. I initially wanted to iterate through all the components and test like the pattern we already have in
test_components.pybut realized that each component has its own nuanced behavior, so decided against it and added the tests to their individual files. The parameterization is similar in many cases though. 😬woodwork_utils.pyand move tests totest_woodwork_utils.py_convert_to_woodwork_structurewithinfer_feature_typesby combining the two implementations.Notes on compatible and incompatible types. Mostly for my own understanding, but also helped inform test cases.
Datetime
Categorical
Integer
Double
**Natural Language **
Natural Language --> Datetime: errors
Natural Language --> Double: works, sometimes
Natural Language --> Categorical: works
Natural Language --> Integer: works, sometimes
There's no good way of confirming when it is safe to convert floats to category or when it is not without trying. For example: can convert ['1', '2'] (categories) to float, but not ['a', 'b']. Hence, current implementation will try to set the type. If it fails, then don't do anything. If it succeeds, add to logical type dictionary to pass along to the initialization of Woodwork DataTable, so that Woodwork does not have to do type inference, saving us some time.
Look into: featuretools component which changed the columns of the original input DF?