-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update AutoMLSearch to support WoodWork DataTables #1299
Conversation
evalml/utils/gen_utils.py
Outdated
Arguments: | ||
df | ||
""" | ||
nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Here are the additional pandas dtypes.
https://github.com/pandas-dev/pandas/blob/0846dc1fdd8751492787f66b2e51cc1b168b5f20/pandas/__init__.py#L53
- Out of these, Woodwork uses CategoricalDtype, StringDtype, and BooleanDtype
- We will use Float64Dtype once the pandas releases 1.2.0 (which is when it will come out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @gsheni! This was super helpful :)
Codecov Report
@@ Coverage Diff @@
## main #1299 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 213 213
Lines 13385 13436 +51
=======================================
+ Hits 13378 13429 +51
Misses 7 7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I think this is great! We should handle the conversion of the NaturalLanguage
woodwork type to pandas dtype before we merge to not introduce a breaking change on the fraud dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Agree with the questions Freddy brought up.
Holding merge until Woodwork is added to conda forge. |
pd.StringDtype: 'object'} | ||
if isinstance(pd_data, pd.api.extensions.ExtensionArray): | ||
pd_data = pd.Series(pd_data) | ||
if isinstance(pd_data, pd.Series) and type(pd_data.dtype) in nullable_to_numpy_mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin is there a reason to do
type(pd_data.dtype) in nullable_to_numpy_mapping
over
pd_data.dtype in nullable_to_numpy_mapping
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a slight nuance. For example:
>> pd_data.dtype
Int64Dtype()
>> type(pd_data.dtype)
<class 'pandas.core.arrays.integer.Int64Dtype'>
So alternatively, I could update the nullable_to_numpy_mapping
to hold instances (pd.Int64Dtype()
) instead of classes (pd.Int64Dtype
) to make this work. I don't have a preference for one over the other!
for col_name, col in pd_data.iteritems(): | ||
if type(col.dtype) in nullable_to_numpy_mapping: | ||
pd_data[col_name] = pd_data[col_name].astype(nullable_to_numpy_mapping[type(col.dtype)]) | ||
return pd_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never reach this line, right? My suggestion is to have this raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way things are currently written, we end up here if we have a pd.api.extensions.ExtensionArray that doesn't use a nullable type or a dataframe!
""" | ||
nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64', | ||
pd.BooleanDtype: 'bool', | ||
pd.StringDtype: 'object'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't need an entry for floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will later, but Float64Dtype is not out yet! I think @gsheni mentioned it'll be released in pandas 1.2.0 (#1299 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin nice work!! Looks solid!
I left some notes, but nothing blocking, except updating the release notes.
Closes #1286, closes #1142, closes #1124