Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update components to accept Woodwork inputs #1423

Merged
merged 58 commits into from
Nov 19, 2020
Merged

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Nov 10, 2020

Continuation of work for #1288

Relevant: pandas-dev/pandas#30038

Had to tweak _convert_woodwork_types_wrapper to properly handle how NaN was converted from nullable array --> pd.Series/DataFrame.

Ultimately, I updated s.t. functionality is preserved as it would be for Series/DataFrames.

A column with booleans + nulls using nullable types will be inferred as 'boolean' type. However, this causes a problem when converting to 'bool' type since NaN values are not allowed and will error out. Converting to bool type by manually handling will convert nan values to True--undesirable. Hence, we will convert to object type (and allow SimpleImputer to fill in the values!)

s = pd.Series([True, np.nan, False, np.nan, True], dtype='bool')
--> converts nan to True
s = pd.Series([True, np.nan, False, np.nan, True])
--> dtype = object, no conversion

https://stackoverflow.com/questions/36825925/expressions-with-true-and-is-true-give-different-results

https://app.circleci.com/pipelines/github/alteryx/evalml/7300/workflows/0d609f82-10ba-4cf8-84b2-01e9ecf7864f/jobs/81615

@angela97lin angela97lin added this to the November 2020 milestone Nov 10, 2020
@angela97lin angela97lin self-assigned this Nov 10, 2020
@codecov
Copy link

codecov bot commented Nov 10, 2020

Codecov Report

Merging #1423 (c00809f) into main (db658b9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1423     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      220             
  Lines       14582    14672     +90     
=========================================
+ Hits        14575    14665     +90     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/pipelines/classification_pipeline.py 100.0% <ø> (ø)
...components/ensemble/stacked_ensemble_classifier.py 100.0% <ø> (ø)
evalml/pipelines/regression_pipeline.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_lsa.py 100.0% <ø> (ø)
evalml/utils/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/component_base.py 100.0% <100.0%> (ø)
...lines/components/ensemble/stacked_ensemble_base.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/baseline_classifier.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/catboost_classifier.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/lightgbm_classifier.py 100.0% <100.0%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db658b9...c00809f. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think this looks good! My main comment is that in all components we follow the formula: data -> woodwork -> pandas when in a lot of cases we don't need to go through woodwork for the component to work as expected .

I realize this may be a premature performance optimization, but I've seen the conversion pandas to woodwork take a while and it would kill users to have to do that same conversion in every component if they passed in pandas data. I think it would be best to do the woodwork conversion on an "as-needed" basis. Thoughts?

Also, I don't think our component unit tests pass in ww data. Is that coming in a separate PR?

X_t = X
if isinstance(X, np.ndarray):
return X
if isinstance(X, ww.DataTable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ww has a rename method on DataTables. I think that would be better than converting to pandas, renaming, then converting back to ww.

Also, as this method is written now, would the types passed in by the user be lost? Just wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the rename method isn't in the latest release yet, so this will have to do for now :(

But yes, I think you're right--updating to preserve types, since that's important!

evalml/pipelines/components/component_base.py Outdated Show resolved Hide resolved
@@ -77,6 +77,6 @@ def test_datetime_featurizer_no_datetime_cols():

def test_datetime_featurizer_numpy_array_input():
datetime_transformer = DateTimeFeaturizer()
X = np.array(['2007-02-03', '2016-06-07', '2020-05-19'], dtype='datetime64')
X = np.array([['2007-02-03'], ['2016-06-07'], ['2020-05-19']], dtype='datetime64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be 2d now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we just had code that wrapped this in a DataFrame. Now, the conversion code checks for the dimensionality to determine whether to create a Series or DataFrame (since the same conversion code is used for both), so having a 1D numpy would have converted this to a Series instead, which isn't what we want.

@@ -260,6 +267,10 @@ def test_more_top_n_unique_values_large():
encoder = OneHotEncoder(top_n=3, random_state=random_seed)
encoder.fit(X)
X_t = encoder.transform(X)

# Conversion changes the resulting dataframe dtype, resulting in a different random state, so we need make the conversion here too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does converting to woodwork have to do with the random state? 😨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I believe the conversion to woodwork can update some dtypes (object --> category) which then results in differences when we sample. 😓 Check this out:
image

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look good to me! I made a few comments for clarifications, but generally agree that if a conversion is unnecessary, we should leave it out.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin this rocks!!! It's so cool that we've made it and have woodwork fully integrated into our components, pipelines and automl 🎊

I left a few questions, including some on the gen_utils data transformation methods (_convert_woodwork_types_wrapper etc). The only other thing I had is that we should make sure we have good test coverage of those methods, trying them with each of the various different types as input. I remember we added some coverage along these lines but I'm blanking on the specifics, so if you're able to respond to that that would be great :)

Otherwise, looking ready to 🚢💨!

evalml/pipelines/classification_pipeline.py Show resolved Hide resolved
evalml/pipelines/components/component_base.py Outdated Show resolved Hide resolved
evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
evalml/utils/gen_utils.py Show resolved Hide resolved
evalml/utils/gen_utils.py Show resolved Hide resolved
evalml/utils/gen_utils.py Show resolved Hide resolved
@angela97lin
Copy link
Contributor Author

@freddyaboulton You're right that there are certainly some situations where the conversion to Woodwork structures doesn't add value here, but I think it is fine to add for consistency, and allows automl to pass the woodwork structures down to the components without worrying about the data structure. Currently, we're in this weird intermediate state where that's not true, but once work has been done in #1289 to pass the Woodwork data structures directly to the pipelines, which passes it to the components, the conversion code will basically be a no-op! Having the conversion code handled in these util methods also allows us to get rid of the if isinstance(X, pd.DataFrame) (or checks for Woodwork) blocks, because it's handled in a centralized place :)

I see your concern with the conversion though, and it's difficult to fully see how time consuming or how much more memory this takes. If this gets out of hand, we can revisit and optimize? 😁

@dsherry
Copy link
Contributor

dsherry commented Nov 19, 2020

I see your concern with the conversion though, and it's difficult to fully see how time consuming or how much more memory this takes. If this gets out of hand, we can revisit and optimize? 😁

Yep that sounds good to me. If we can get #1289 done quickly, this won't matter, because automl search will handle the conversion to woodwork types first before the pipelines/components do their thing. And if not, and we do see problems with automl resulting from this, we can do another release quickly once #1289 is merged.

One thing to check before merging is that the unit tests don't take significantly longer to run than on main. If they did, that could be potential evidence that there's a problem. This is unlikely though because we don't use large datasets in our tests.

Given that we're talking about performance, @angela97lin how would you feel about running some perf tests on this? I think it would be fine to merge this and test after the fact, since we know this is an intermediate step along the way, but it would help us decide when to release.

@angela97lin
Copy link
Contributor Author

@dsherry Yup, that sounds good! I'll merge this in first, and if we don't have time to merge in the automl change before the next release, I'll run some perf tests to make sure the conversions aren't taking too long.

@angela97lin angela97lin merged commit f54abd3 into main Nov 19, 2020
@angela97lin angela97lin deleted the 1288_ww_components branch November 19, 2020 01:12
@dsherry dsherry mentioned this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants