New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update components to accept Woodwork inputs #1423
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1423 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 220 220
Lines 14582 14672 +90
=========================================
+ Hits 14575 14665 +90
Misses 7 7
Continue to review full report at Codecov.
|
…288_ww_components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I think this looks good! My main comment is that in all components we follow the formula: data -> woodwork -> pandas when in a lot of cases we don't need to go through woodwork for the component to work as expected .
I realize this may be a premature performance optimization, but I've seen the conversion pandas to woodwork take a while and it would kill users to have to do that same conversion in every component if they passed in pandas data. I think it would be best to do the woodwork conversion on an "as-needed" basis. Thoughts?
Also, I don't think our component unit tests pass in ww data. Is that coming in a separate PR?
X_t = X | ||
if isinstance(X, np.ndarray): | ||
return X | ||
if isinstance(X, ww.DataTable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ww has a rename method on DataTables. I think that would be better than converting to pandas, renaming, then converting back to ww.
Also, as this method is written now, would the types passed in by the user be lost? Just wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the rename
method isn't in the latest release yet, so this will have to do for now :(
But yes, I think you're right--updating to preserve types, since that's important!
evalml/pipelines/components/estimators/classifiers/baseline_classifier.py
Show resolved
Hide resolved
evalml/pipelines/components/estimators/classifiers/baseline_classifier.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
@@ -77,6 +77,6 @@ def test_datetime_featurizer_no_datetime_cols(): | |||
|
|||
def test_datetime_featurizer_numpy_array_input(): | |||
datetime_transformer = DateTimeFeaturizer() | |||
X = np.array(['2007-02-03', '2016-06-07', '2020-05-19'], dtype='datetime64') | |||
X = np.array([['2007-02-03'], ['2016-06-07'], ['2020-05-19']], dtype='datetime64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be 2d now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we just had code that wrapped this in a DataFrame
. Now, the conversion code checks for the dimensionality to determine whether to create a Series or DataFrame (since the same conversion code is used for both), so having a 1D numpy would have converted this to a Series instead, which isn't what we want.
@@ -260,6 +267,10 @@ def test_more_top_n_unique_values_large(): | |||
encoder = OneHotEncoder(top_n=3, random_state=random_seed) | |||
encoder.fit(X) | |||
X_t = encoder.transform(X) | |||
|
|||
# Conversion changes the resulting dataframe dtype, resulting in a different random state, so we need make the conversion here too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does converting to woodwork have to do with the random state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton I believe the conversion to woodwork can update some dtypes (object --> category) which then results in differences when we sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look good to me! I made a few comments for clarifications, but generally agree that if a conversion is unnecessary, we should leave it out.
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/datetime_featurizer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin this rocks!!! It's so cool that we've made it and have woodwork fully integrated into our components, pipelines and automl
I left a few questions, including some on the gen_utils data transformation methods (_convert_woodwork_types_wrapper
etc). The only other thing I had is that we should make sure we have good test coverage of those methods, trying them with each of the various different types as input. I remember we added some coverage along these lines but I'm blanking on the specifics, so if you're able to respond to that that would be great :)
Otherwise, looking ready to
evalml/pipelines/components/estimators/classifiers/catboost_classifier.py
Show resolved
Hide resolved
evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py
Outdated
Show resolved
Hide resolved
@freddyaboulton You're right that there are certainly some situations where the conversion to Woodwork structures doesn't add value here, but I think it is fine to add for consistency, and allows automl to pass the woodwork structures down to the components without worrying about the data structure. Currently, we're in this weird intermediate state where that's not true, but once work has been done in #1289 to pass the Woodwork data structures directly to the pipelines, which passes it to the components, the conversion code will basically be a no-op! Having the conversion code handled in these util methods also allows us to get rid of the I see your concern with the conversion though, and it's difficult to fully see how time consuming or how much more memory this takes. If this gets out of hand, we can revisit and optimize? |
Yep that sounds good to me. If we can get #1289 done quickly, this won't matter, because automl search will handle the conversion to woodwork types first before the pipelines/components do their thing. And if not, and we do see problems with automl resulting from this, we can do another release quickly once #1289 is merged. One thing to check before merging is that the unit tests don't take significantly longer to run than on Given that we're talking about performance, @angela97lin how would you feel about running some perf tests on this? I think it would be fine to merge this and test after the fact, since we know this is an intermediate step along the way, but it would help us decide when to release. |
@dsherry Yup, that sounds good! I'll merge this in first, and if we don't have time to merge in the automl change before the next release, I'll run some perf tests to make sure the conversions aren't taking too long. |
Continuation of work for #1288
Relevant: pandas-dev/pandas#30038
Had to tweak
_convert_woodwork_types_wrapper
to properly handle how NaN was converted from nullable array --> pd.Series/DataFrame.Ultimately, I updated s.t. functionality is preserved as it would be for Series/DataFrames.
A column with booleans + nulls using nullable types will be inferred as 'boolean' type. However, this causes a problem when converting to 'bool' type since NaN values are not allowed and will error out. Converting to
bool
type by manually handling will convert nan values to True--undesirable. Hence, we will convert toobject
type (and allow SimpleImputer to fill in the values!)https://stackoverflow.com/questions/36825925/expressions-with-true-and-is-true-give-different-results
https://app.circleci.com/pipelines/github/alteryx/evalml/7300/workflows/0d609f82-10ba-4cf8-84b2-01e9ecf7864f/jobs/81615