Update AutoMLSearch to support WoodWork DataTables#1299
Conversation
evalml/utils/gen_utils.py
Outdated
| Arguments: | ||
| df | ||
| """ | ||
| nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64'} |
There was a problem hiding this comment.
@angela97lin Here are the additional pandas dtypes.
https://github.com/pandas-dev/pandas/blob/0846dc1fdd8751492787f66b2e51cc1b168b5f20/pandas/__init__.py#L53
- Out of these, Woodwork uses CategoricalDtype, StringDtype, and BooleanDtype
- We will use Float64Dtype once the pandas releases 1.2.0 (which is when it will come out).
There was a problem hiding this comment.
Thanks, @gsheni! This was super helpful :)
Codecov Report
@@ Coverage Diff @@
## main #1299 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 213 213
Lines 13385 13436 +51
=======================================
+ Hits 13378 13429 +51
Misses 7 7
Continue to review full report at Codecov.
|
freddyaboulton
left a comment
There was a problem hiding this comment.
@angela97lin I think this is great! We should handle the conversion of the NaturalLanguage woodwork type to pandas dtype before we merge to not introduce a breaking change on the fraud dataset.
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Agree with the questions Freddy brought up.
|
Holding merge until Woodwork is added to conda forge. |
| pd.StringDtype: 'object'} | ||
| if isinstance(pd_data, pd.api.extensions.ExtensionArray): | ||
| pd_data = pd.Series(pd_data) | ||
| if isinstance(pd_data, pd.Series) and type(pd_data.dtype) in nullable_to_numpy_mapping: |
There was a problem hiding this comment.
@angela97lin is there a reason to do
type(pd_data.dtype) in nullable_to_numpy_mappingover
pd_data.dtype in nullable_to_numpy_mapping?
There was a problem hiding this comment.
There's a slight nuance. For example:
>> pd_data.dtype
Int64Dtype()
>> type(pd_data.dtype)
<class 'pandas.core.arrays.integer.Int64Dtype'>
So alternatively, I could update the nullable_to_numpy_mapping to hold instances (pd.Int64Dtype()) instead of classes (pd.Int64Dtype) to make this work. I don't have a preference for one over the other!
| for col_name, col in pd_data.iteritems(): | ||
| if type(col.dtype) in nullable_to_numpy_mapping: | ||
| pd_data[col_name] = pd_data[col_name].astype(nullable_to_numpy_mapping[type(col.dtype)]) | ||
| return pd_data |
There was a problem hiding this comment.
We should never reach this line, right? My suggestion is to have this raise an exception
There was a problem hiding this comment.
The way things are currently written, we end up here if we have a pd.api.extensions.ExtensionArray that doesn't use a nullable type or a dataframe!
| """ | ||
| nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64', | ||
| pd.BooleanDtype: 'bool', | ||
| pd.StringDtype: 'object'} |
There was a problem hiding this comment.
So we don't need an entry for floats?
There was a problem hiding this comment.
We will later, but Float64Dtype is not out yet! I think @gsheni mentioned it'll be released in pandas 1.2.0 (#1299 (comment))
dsherry
left a comment
There was a problem hiding this comment.
@angela97lin nice work!! Looks solid!
I left some notes, but nothing blocking, except updating the release notes.
Closes #1286, closes #1142, closes #1124