Update components to accept Woodwork inputs #1423

angela97lin · 2020-11-10T19:05:35Z

Continuation of work for #1288

Had to tweak _convert_woodwork_types_wrapper to properly handle how NaN was converted from nullable array --> pd.Series/DataFrame.

Ultimately, I updated s.t. functionality is preserved as it would be for Series/DataFrames.

A column with booleans + nulls using nullable types will be inferred as 'boolean' type. However, this causes a problem when converting to 'bool' type since NaN values are not allowed and will error out. Converting to bool type by manually handling will convert nan values to True--undesirable. Hence, we will convert to object type (and allow SimpleImputer to fill in the values!)

s = pd.Series([True, np.nan, False, np.nan, True], dtype='bool')
--> converts nan to True
s = pd.Series([True, np.nan, False, np.nan, True])
--> dtype = object, no conversion

https://stackoverflow.com/questions/36825925/expressions-with-true-and-is-true-give-different-results

https://app.circleci.com/pipelines/github/alteryx/evalml/7300/workflows/0d609f82-10ba-4cf8-84b2-01e9ecf7864f/jobs/81615

codecov · 2020-11-10T19:10:48Z

Codecov Report

Merging #1423 (c00809f) into main (db658b9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1423     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      220             
  Lines       14582    14672     +90     
=========================================
+ Hits        14575    14665     +90     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/pipelines/classification_pipeline.py	`100.0% <ø> (ø)`
...components/ensemble/stacked_ensemble_classifier.py	`100.0% <ø> (ø)`
evalml/pipelines/regression_pipeline.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_lsa.py	`100.0% <ø> (ø)`
evalml/utils/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/components/component_base.py	`100.0% <100.0%> (ø)`
...lines/components/ensemble/stacked_ensemble_base.py	`100.0% <100.0%> (ø)`
...ents/estimators/classifiers/baseline_classifier.py	`100.0% <100.0%> (ø)`
...ents/estimators/classifiers/catboost_classifier.py	`100.0% <100.0%> (ø)`
...ents/estimators/classifiers/lightgbm_classifier.py	`100.0% <100.0%> (ø)`
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db658b9...c00809f. Read the comment docs.

…288_ww_components

freddyaboulton

@angela97lin I think this looks good! My main comment is that in all components we follow the formula: data -> woodwork -> pandas when in a lot of cases we don't need to go through woodwork for the component to work as expected .

I realize this may be a premature performance optimization, but I've seen the conversion pandas to woodwork take a while and it would kill users to have to do that same conversion in every component if they passed in pandas data. I think it would be best to do the woodwork conversion on an "as-needed" basis. Thoughts?

Also, I don't think our component unit tests pass in ww data. Is that coming in a separate PR?

freddyaboulton · 2020-11-17T15:18:16Z

evalml/utils/gen_utils.py

+    X_t = X
+    if isinstance(X, np.ndarray):
+        return X
+    if isinstance(X, ww.DataTable):


ww has a rename method on DataTables. I think that would be better than converting to pandas, renaming, then converting back to ww.

Also, as this method is written now, would the types passed in by the user be lost? Just wondering.

Ah, the rename method isn't in the latest release yet, so this will have to do for now :(

But yes, I think you're right--updating to preserve types, since that's important!

evalml/pipelines/components/estimators/regressors/baseline_regressor.py

evalml/pipelines/components/component_base.py

evalml/pipelines/components/estimators/classifiers/baseline_classifier.py

evalml/pipelines/components/transformers/column_selectors.py

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

freddyaboulton · 2020-11-17T15:56:47Z

evalml/tests/component_tests/test_datetime_featurizer.py

@@ -77,6 +77,6 @@ def test_datetime_featurizer_no_datetime_cols():

 def test_datetime_featurizer_numpy_array_input():
    datetime_transformer = DateTimeFeaturizer()
-    X = np.array(['2007-02-03', '2016-06-07', '2020-05-19'], dtype='datetime64')
+    X = np.array([['2007-02-03'], ['2016-06-07'], ['2020-05-19']], dtype='datetime64')


Why does this need to be 2d now?

Previously, we just had code that wrapped this in a DataFrame. Now, the conversion code checks for the dimensionality to determine whether to create a Series or DataFrame (since the same conversion code is used for both), so having a 1D numpy would have converted this to a Series instead, which isn't what we want.

freddyaboulton · 2020-11-17T16:06:58Z

evalml/tests/component_tests/test_one_hot_encoder.py

@@ -260,6 +267,10 @@ def test_more_top_n_unique_values_large():
    encoder = OneHotEncoder(top_n=3, random_state=random_seed)
    encoder.fit(X)
    X_t = encoder.transform(X)
+
+    # Conversion changes the resulting dataframe dtype, resulting in a different random state, so we need make the conversion here too


What does converting to woodwork have to do with the random state? 😨

@freddyaboulton I believe the conversion to woodwork can update some dtypes (object --> category) which then results in differences when we sample. 😓 Check this out:

bchen1116

Things look good to me! I made a few comments for clarifications, but generally agree that if a conversion is unnecessary, we should leave it out.

evalml/pipelines/components/ensemble/stacked_ensemble_base.py

evalml/pipelines/components/estimators/regressors/baseline_regressor.py

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

evalml/pipelines/components/transformers/preprocessing/datetime_featurizer.py

evalml/tests/component_tests/test_lsa.py

evalml/tests/component_tests/test_one_hot_encoder.py

evalml/utils/gen_utils.py

dsherry

@angela97lin this rocks!!! It's so cool that we've made it and have woodwork fully integrated into our components, pipelines and automl 🎊

I left a few questions, including some on the gen_utils data transformation methods (_convert_woodwork_types_wrapper etc). The only other thing I had is that we should make sure we have good test coverage of those methods, trying them with each of the various different types as input. I remember we added some coverage along these lines but I'm blanking on the specifics, so if you're able to respond to that that would be great :)

Otherwise, looking ready to 🚢💨!

evalml/pipelines/classification_pipeline.py

evalml/pipelines/components/component_base.py

evalml/pipelines/components/estimators/classifiers/catboost_classifier.py

evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py

evalml/pipelines/components/estimators/regressors/baseline_regressor.py

evalml/utils/gen_utils.py

angela97lin · 2020-11-17T20:55:21Z

@freddyaboulton You're right that there are certainly some situations where the conversion to Woodwork structures doesn't add value here, but I think it is fine to add for consistency, and allows automl to pass the woodwork structures down to the components without worrying about the data structure. Currently, we're in this weird intermediate state where that's not true, but once work has been done in #1289 to pass the Woodwork data structures directly to the pipelines, which passes it to the components, the conversion code will basically be a no-op! Having the conversion code handled in these util methods also allows us to get rid of the if isinstance(X, pd.DataFrame) (or checks for Woodwork) blocks, because it's handled in a centralized place :)

I see your concern with the conversion though, and it's difficult to fully see how time consuming or how much more memory this takes. If this gets out of hand, we can revisit and optimize? 😁

…288_ww_components

dsherry · 2020-11-19T01:00:04Z

I see your concern with the conversion though, and it's difficult to fully see how time consuming or how much more memory this takes. If this gets out of hand, we can revisit and optimize? 😁

Yep that sounds good to me. If we can get #1289 done quickly, this won't matter, because automl search will handle the conversion to woodwork types first before the pipelines/components do their thing. And if not, and we do see problems with automl resulting from this, we can do another release quickly once #1289 is merged.

One thing to check before merging is that the unit tests don't take significantly longer to run than on main. If they did, that could be potential evidence that there's a problem. This is unlikely though because we don't use large datasets in our tests.

Given that we're talking about performance, @angela97lin how would you feel about running some perf tests on this? I think it would be fine to merge this and test after the fact, since we know this is an intermediate step along the way, but it would help us decide when to release.

angela97lin · 2020-11-19T01:11:49Z

@dsherry Yup, that sounds good! I'll merge this in first, and if we don't have time to merge in the automl change before the next release, I'll run some perf tests to make sure the conversions aren't taking too long.

init

7b47df0

angela97lin added this to the November 2020 milestone Nov 10, 2020

angela97lin self-assigned this Nov 10, 2020

angela97lin added 13 commits November 10, 2020 14:37

fix catboost

168236a

fix tests

1b3f7ed

some minor cleanup

5a40a90

fix np.array instead of np.ndarray

cf912bb

update docstr

b18cfad

fix test

43b8edd

release notes

e2d3643

Merge branch 'main' into 1288_ww_components

0b40ffc

clean up unnecessary code

0d79f32

Merge branch '1288_ww_components' of github.com:alteryx/evalml into 1…

c3627c8

…288_ww_components

remove unnecessary lines

90f7f8c

remove dup line

3a1c1af

Merge branch 'main' into 1288_ww_components

a485675

angela97lin mentioned this pull request Nov 11, 2020

Inconsistency between LightGBM and XGBoost handling of column names containing <, >, and other unaccepted symbols. #1421

Closed

angela97lin added 12 commits November 10, 2020 21:48

linting

2295f78

comment out code from pipelines now handled in components

dfd526d

messy testing

0079558

update transformers with wrapper

e866adb

fix datetime featurizer

20a6e3d

more datetime fixing

b132b23

update numpy test

bbe91fc

update to new ww api

2f6dcd6

fix simple imputer and drop null cols

5286d3c

fix imputer tests

9f8e54a

fix lsa and pca

1b690c5

fix ohe categories

a4309fe

linting

1e4280e

angela97lin marked this pull request as ready for review November 16, 2020 22:03

angela97lin requested review from dsherry, bchen1116, freddyaboulton, christopherbunn, eccabay and jeremyliweishih November 16, 2020 22:03

Merge branch 'main' into 1288_ww_components

9b118db

freddyaboulton approved these changes Nov 17, 2020

View reviewed changes

bchen1116 approved these changes Nov 17, 2020

View reviewed changes

dsherry approved these changes Nov 17, 2020

View reviewed changes

angela97lin added 2 commits November 17, 2020 13:43

fix some comments

9d13a53

cleaning up some more comments

50cad83

angela97lin added 7 commits November 18, 2020 15:18

add tests

4ad4ee1

Merge branch 'main' into 1288_ww_components

7d21081

fix regression pipeline

9d20614

Merge branch '1288_ww_components' of github.com:alteryx/evalml into 1…

7d30748

…288_ww_components

fix tests

2aca6dc

add test for gen utils

2f79324

fix indices

18245eb

dsherry mentioned this pull request Nov 19, 2020

Fix SimpleImputer error which occurs when all features are bool type #1215

Merged

fix index for arrays

c00809f

dsherry mentioned this pull request Nov 19, 2020

Update components from selecting using pandas dtypes to selecting using DataTable semantic or logical types #1290

Closed

angela97lin merged commit f54abd3 into main Nov 19, 2020

angela97lin deleted the 1288_ww_components branch November 19, 2020 01:12

dsherry mentioned this pull request Nov 24, 2020

Release v0.16.0 #1468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update components to accept Woodwork inputs #1423

Update components to accept Woodwork inputs #1423

angela97lin commented Nov 10, 2020 •

edited

Loading

codecov bot commented Nov 10, 2020 •

edited

Loading

freddyaboulton left a comment •

edited

Loading

freddyaboulton Nov 17, 2020

angela97lin Nov 18, 2020

freddyaboulton Nov 17, 2020

angela97lin Nov 18, 2020

freddyaboulton Nov 17, 2020

angela97lin Nov 17, 2020

bchen1116 left a comment

dsherry left a comment

angela97lin commented Nov 17, 2020

dsherry commented Nov 19, 2020

angela97lin commented Nov 19, 2020

Update components to accept Woodwork inputs #1423

Update components to accept Woodwork inputs #1423

Conversation

angela97lin commented Nov 10, 2020 • edited Loading

codecov bot commented Nov 10, 2020 • edited Loading

Codecov Report

freddyaboulton left a comment • edited Loading

Choose a reason for hiding this comment

freddyaboulton Nov 17, 2020

Choose a reason for hiding this comment

angela97lin Nov 18, 2020

Choose a reason for hiding this comment

freddyaboulton Nov 17, 2020

Choose a reason for hiding this comment

angela97lin Nov 18, 2020

Choose a reason for hiding this comment

freddyaboulton Nov 17, 2020

Choose a reason for hiding this comment

angela97lin Nov 17, 2020

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

angela97lin commented Nov 17, 2020

dsherry commented Nov 19, 2020

angela97lin commented Nov 19, 2020

angela97lin commented Nov 10, 2020 •

edited

Loading

codecov bot commented Nov 10, 2020 •

edited

Loading

freddyaboulton left a comment •

edited

Loading