Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AutoMLSearch to support WoodWork DataTables #1299

Merged
merged 31 commits into from
Oct 21, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Oct 12, 2020

Closes #1286, closes #1142, closes #1124

Arguments:
df
"""
nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64'}
Copy link
Contributor

@gsheni gsheni Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Here are the additional pandas dtypes.
https://github.com/pandas-dev/pandas/blob/0846dc1fdd8751492787f66b2e51cc1b168b5f20/pandas/__init__.py#L53

  • Out of these, Woodwork uses CategoricalDtype, StringDtype, and BooleanDtype
  • We will use Float64Dtype once the pandas releases 1.2.0 (which is when it will come out).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @gsheni! This was super helpful :)

@codecov
Copy link

codecov bot commented Oct 19, 2020

Codecov Report

Merging #1299 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1299   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files         213      213           
  Lines       13385    13436   +51     
=======================================
+ Hits        13378    13429   +51     
  Misses          7        7           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.60% <100.00%> (+<0.01%) ⬆️
evalml/data_checks/invalid_targets_data_check.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
evalml/tests/utils_tests/test_cli_utils.py 100.00% <100.00%> (ø)
evalml/utils/__init__.py 100.00% <100.00%> (ø)
evalml/utils/gen_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ed93c7...5ab6c9f. Read the comment docs.

@angela97lin angela97lin self-assigned this Oct 19, 2020
@angela97lin angela97lin added this to the October 2020 milestone Oct 19, 2020
@angela97lin angela97lin marked this pull request as ready for review October 19, 2020 21:25
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think this is great! We should handle the conversion of the NaturalLanguage woodwork type to pandas dtype before we merge to not introduce a breaking change on the fraud dataset.

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Show resolved Hide resolved
evalml/data_checks/invalid_targets_data_check.py Outdated Show resolved Hide resolved
core-requirements.txt Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Agree with the questions Freddy brought up.

evalml/utils/gen_utils.py Show resolved Hide resolved
core-requirements.txt Show resolved Hide resolved
evalml/utils/gen_utils.py Show resolved Hide resolved
evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
@angela97lin
Copy link
Contributor Author

Holding merge until Woodwork is added to conda forge.

pd.StringDtype: 'object'}
if isinstance(pd_data, pd.api.extensions.ExtensionArray):
pd_data = pd.Series(pd_data)
if isinstance(pd_data, pd.Series) and type(pd_data.dtype) in nullable_to_numpy_mapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin is there a reason to do

type(pd_data.dtype) in nullable_to_numpy_mapping

over

pd_data.dtype in nullable_to_numpy_mapping

?

Copy link
Contributor Author

@angela97lin angela97lin Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a slight nuance. For example:

>> pd_data.dtype
Int64Dtype()
>> type(pd_data.dtype)
<class 'pandas.core.arrays.integer.Int64Dtype'>

So alternatively, I could update the nullable_to_numpy_mapping to hold instances (pd.Int64Dtype()) instead of classes (pd.Int64Dtype) to make this work. I don't have a preference for one over the other!

for col_name, col in pd_data.iteritems():
if type(col.dtype) in nullable_to_numpy_mapping:
pd_data[col_name] = pd_data[col_name].astype(nullable_to_numpy_mapping[type(col.dtype)])
return pd_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never reach this line, right? My suggestion is to have this raise an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way things are currently written, we end up here if we have a pd.api.extensions.ExtensionArray that doesn't use a nullable type or a dataframe!

"""
nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64',
pd.BooleanDtype: 'bool',
pd.StringDtype: 'object'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't need an entry for floats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will later, but Float64Dtype is not out yet! I think @gsheni mentioned it'll be released in pandas 1.2.0 (#1299 (comment))

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin nice work!! Looks solid!

I left some notes, but nothing blocking, except updating the release notes.

@angela97lin angela97lin merged commit 1e4307a into main Oct 21, 2020
@angela97lin angela97lin deleted the 1286_datatables_in_automl branch October 21, 2020 19:58
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants