Update AutoMLSearch to support WoodWork DataTables #1299

angela97lin · 2020-10-12T20:36:52Z

Closes #1286, closes #1142, closes #1124

… into 1286_datatables_in_automl

gsheni · 2020-10-14T22:24:13Z

evalml/utils/gen_utils.py

+    Arguments:
+        df 
+    """
+    nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64'}


@angela97lin Here are the additional pandas dtypes.
https://github.com/pandas-dev/pandas/blob/0846dc1fdd8751492787f66b2e51cc1b168b5f20/pandas/__init__.py#L53

Out of these, Woodwork uses CategoricalDtype, StringDtype, and BooleanDtype

We will use Float64Dtype once the pandas releases 1.2.0 (which is when it will come out).

Thanks, @gsheni! This was super helpful :)

codecov · 2020-10-19T17:55:35Z

Codecov Report

Merging #1299 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1299   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files         213      213           
  Lines       13385    13436   +51     
=======================================
+ Hits        13378    13429   +51     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.60% <100.00%> (+<0.01%)`	⬆️
evalml/data_checks/invalid_targets_data_check.py	`100.00% <100.00%> (ø)`
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`
evalml/tests/utils_tests/test_cli_utils.py	`100.00% <100.00%> (ø)`
evalml/utils/__init__.py	`100.00% <100.00%> (ø)`
evalml/utils/gen_utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ed93c7...5ab6c9f. Read the comment docs.

… into 1286_datatables_in_automl

freddyaboulton

@angela97lin I think this is great! We should handle the conversion of the NaturalLanguage woodwork type to pandas dtype before we merge to not introduce a breaking change on the fraud dataset.

evalml/utils/gen_utils.py

evalml/automl/automl_search.py

evalml/data_checks/invalid_targets_data_check.py

core-requirements.txt

evalml/automl/automl_search.py

bchen1116

LGTM! Agree with the questions Freddy brought up.

evalml/utils/gen_utils.py

core-requirements.txt

evalml/utils/gen_utils.py

… into 1286_datatables_in_automl

evalml/automl/automl_search.py

angela97lin · 2020-10-21T15:12:35Z

Holding merge until Woodwork is added to conda forge.

dsherry · 2020-10-21T18:20:41Z

evalml/utils/gen_utils.py

+                                 pd.StringDtype: 'object'}
+    if isinstance(pd_data, pd.api.extensions.ExtensionArray):
+        pd_data = pd.Series(pd_data)
+    if isinstance(pd_data, pd.Series) and type(pd_data.dtype) in nullable_to_numpy_mapping:


@angela97lin is there a reason to do

type(pd_data.dtype) in nullable_to_numpy_mapping

over

pd_data.dtype in nullable_to_numpy_mapping

?

There's a slight nuance. For example:

>> pd_data.dtype Int64Dtype() >> type(pd_data.dtype) <class 'pandas.core.arrays.integer.Int64Dtype'>

So alternatively, I could update the nullable_to_numpy_mapping to hold instances (pd.Int64Dtype()) instead of classes (pd.Int64Dtype) to make this work. I don't have a preference for one over the other!

dsherry · 2020-10-21T18:25:21Z

evalml/utils/gen_utils.py

+        for col_name, col in pd_data.iteritems():
+            if type(col.dtype) in nullable_to_numpy_mapping:
+                pd_data[col_name] = pd_data[col_name].astype(nullable_to_numpy_mapping[type(col.dtype)])
+    return pd_data


We should never reach this line, right? My suggestion is to have this raise an exception

The way things are currently written, we end up here if we have a pd.api.extensions.ExtensionArray that doesn't use a nullable type or a dataframe!

dsherry · 2020-10-21T18:25:48Z

evalml/utils/gen_utils.py

+    """
+    nullable_to_numpy_mapping = {pd.Int64Dtype: 'int64',
+                                 pd.BooleanDtype: 'bool',
+                                 pd.StringDtype: 'object'}


So we don't need an entry for floats?

We will later, but Float64Dtype is not out yet! I think @gsheni mentioned it'll be released in pandas 1.2.0 (#1299 (comment))

evalml/automl/automl_search.py

evalml/data_checks/invalid_targets_data_check.py

docs/source/release_notes.rst

evalml/tests/automl_tests/test_automl.py

dsherry

@angela97lin nice work!! Looks solid!

I left some notes, but nothing blocking, except updating the release notes.

angela97lin added 8 commits October 12, 2020 16:11

init?

0d6e557

fix typo

7b84e7e

testing

0ed259f

minor cleanup

bb6dc9f

remove one pdb

063ae26

Merge branch 'main' into 1286_datatables_in_automl

9811963

add util method

9764b26

Merge branch '1286_datatables_in_automl' of github.com:alteryx/evalml…

847b4a6

… into 1286_datatables_in_automl

gsheni reviewed Oct 14, 2020

View reviewed changes

angela97lin added 2 commits October 19, 2020 13:08

fix some tests

3897ca9

fix more tests

7be1355

angela97lin added 6 commits October 19, 2020 16:07

clean up docstr

0f44ae6

fix rename

dd4402f

Merge branch 'main' into 1286_datatables_in_automl

a46ef8d

adding test

ccb7127

Merge branch '1286_datatables_in_automl' of github.com:alteryx/evalml…

1aa7f60

… into 1286_datatables_in_automl

linting

9ec7e42

angela97lin self-assigned this Oct 19, 2020

angela97lin added this to the October 2020 milestone Oct 19, 2020

angela97lin marked this pull request as ready for review October 19, 2020 21:25

angela97lin requested review from gsheni, dsherry, freddyaboulton, eccabay, bchen1116 and jeremyliweishih October 19, 2020 21:29

update docstr

72f953a

freddyaboulton approved these changes Oct 20, 2020

View reviewed changes

bchen1116 approved these changes Oct 20, 2020

View reviewed changes

evalml/utils/gen_utils.py Show resolved Hide resolved

angela97lin added 2 commits October 20, 2020 13:36

clean up

d143b92

Merge branch 'main' into 1286_datatables_in_automl

a7e7f3c

gsheni reviewed Oct 20, 2020

View reviewed changes

core-requirements.txt Show resolved Hide resolved

evalml/utils/gen_utils.py Show resolved Hide resolved

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved

angela97lin added 4 commits October 20, 2020 14:27

update to remove mapping and do check instead

48aebdc

Merge branch '1286_datatables_in_automl' of github.com:alteryx/evalml…

7028e64

… into 1286_datatables_in_automl

update tests

a806679

add logging warning

e6bd629

gsheni reviewed Oct 20, 2020

View reviewed changes

evalml/automl/automl_search.py Outdated Show resolved Hide resolved

fix logger warning

ac44992

angela97lin mentioned this pull request Oct 21, 2020

Integrate Woodwork DataTables into EvalML #1229

Closed

Merge branch 'main' into 1286_datatables_in_automl

6c7b33e

Merge branch 'main' into 1286_datatables_in_automl

dc655b0