Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize woodwork with partial schemas #2774

Merged
merged 24 commits into from
Sep 21, 2021
Merged

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Sep 13, 2021

Closes #2744

@codecov
Copy link

codecov bot commented Sep 13, 2021

Codecov Report

Merging #2774 (550bd3d) into main (b48ff30) will decrease coverage by 0.1%.
The diff coverage is 92.7%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2774     +/-   ##
=======================================
- Coverage   99.8%   99.8%   -0.0%     
=======================================
  Files        297     297             
  Lines      27744   27720     -24     
=======================================
- Hits       27676   27643     -33     
- Misses        68      77      +9     
Impacted Files Coverage Δ
evalml/utils/__init__.py 100.0% <ø> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 99.9% <50.0%> (-0.1%) ⬇️
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.6% <50.0%> (-0.4%) ⬇️
evalml/tests/pipeline_tests/test_pipelines.py 99.8% <50.0%> (-<0.1%) ⬇️
evalml/tests/component_tests/test_components.py 99.3% <70.0%> (-0.7%) ⬇️
...nents/transformers/dimensionality_reduction/lda.py 100.0% <100.0%> (ø)
...nents/transformers/dimensionality_reduction/pca.py 100.0% <100.0%> (ø)
...components/transformers/encoders/target_encoder.py 100.0% <100.0%> (ø)
...transformers/feature_selection/feature_selector.py 100.0% <100.0%> (ø)
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b48ff30...550bd3d. Read the comment docs.

return _retain_custom_types_and_initalize_woodwork(
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
)
columns = X_ww.ww.select(exclude="categorical").columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a new copy of the dataset, which seems excessive if we only care about column names.

Can you use return_schema=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's an interesting detail, thanks for pointing out! Will do.

@eccabay eccabay marked this pull request as ready for review September 13, 2021 20:54
@eccabay eccabay requested a review from a team September 13, 2021 20:54
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job Becca. Love this. Anything that moves us towards a more natural usage of our dependency libraries and cuts down on our "compatibility libraries" is a good thing!

X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
)
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns
X_t_df.ww.init(schema=X_ww.ww.schema._get_subset_schema(columns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the equivalent of getting the full schema (isn't that like a dict?) and subsetting down to the columns manually? I'm a little iffy on using WW's private func, but that may be a request to them to publicize this function.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Thank you for doing this! I agree with @chukarsten that this is a great step forward in how we use woodwork.

I left some comments that I'll like to resolve before merging. Since we're touching a lot of components that are used in AutoMLSearch I think it's important we run perf tests on this branch to make sure we're not accidentally converting types or something. Last time we touched how logical types are preserved, the perf tests helped us catch a bug before merging: #2297

@@ -58,7 +54,8 @@ def transform(self, X, y=None):
index=X_ww.index,
columns=[f"component_{i}" for i in range(X_t.shape[1])],
)
return _retain_custom_types_and_initalize_woodwork(X_ww, X_t)
X_t.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this makes sense to me. The dimensionality reduction components create an entirely new set of features, all numeric, so there are no logical types to "retain".

That being said, one of the goals we're working towards is to preserve as much of the ww schema throughout the pipeline as possible. Just want to make it clear that won't always be possible.

return _retain_custom_types_and_initalize_woodwork(
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
)
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool use of exclude here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string "categorical" will exclude just the Categorical logical type, but the string 'category' exclude all logical types that have the 'category' standard tag (including PostalCode, CountryCode, etc).

X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
)
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns
X_t_df.ww.init(schema=X_ww.ww.schema._get_subset_schema(columns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a woodwork issue to make _get_subset_schema public!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed woodwork issue #1143!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay and @freddyaboulton So the X_ww.ww.select call above with return_schema=True should be returning the same subset schema of X_ww that you're getting with the _get_subset_schema call!

You could just do line 84 and 85 as:

no_cat_schema = X_ww.ww.select(exclude="categorical", return_schema=True)
X_t_df.ww.init(schema=no_cat_schema)

(If this doesn't work, definitely let me know!)

Also, I think it's a coincidence that the .columns attribute worked in getting the subset schema here. TableSchema.columns is actually a dictionary of ColumnSchema objects, but it just so happens that the way we reference the columns in _get_subset_gchema works for a dictionary where the keys are columns 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tamargrey !

original_ltypes, X_no_all_null
)

return X_no_all_null
Copy link
Contributor

@freddyaboulton freddyaboulton Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this only works because the only types that can be nullable and supported by our imputer are Categorical and Double. Since we don't allow nullable types, a column with ints and None ([1, 2, None]) can only be Double and [True, False, None] can only be Categorical. That means that the data returned by the SimpleImputer will have the same logical types.

Here is a snippet of how this implementation will produce a broken ww schema if nullable types are allowed.

image

Nullable types are out of scope of this issue. I'm just bringing this up here because we're thinking of supporting nullable types soon (#2745) so maybe this is just an FYI for @chukarsten and @ParthivNaresh who are assigned to that issue.

return _retain_custom_types_and_initalize_woodwork(
X_ww.ww.logical_types, X_t_df
)
X_t_df.ww.init(schema=X_ww.ww.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this. Since this is the base class, there's no way to know how the transformer is modifying the features. X_t_df could have fewer columns than X_w, which could result in an error or if they have the same columns the types could be getting converted.

I added assert False before line 53 and 72 in this file and ran pytest -n 8 evalml/tests/component_tests. Only one test failed and it's because a MockTransformer relied on the implementation of transform in the base class.

That tells me that we can get rid of transform in the base class because every component we test (except a mock one used in tests) implements transform. We should mark that as an abstract method. The other thing that tells me is that all components we test rely on self.fit(X, y).transform(X, y) since none of the tests triggered the assertion in fit_transform. We should get rid of the try in fit_transform.

@@ -218,55 +208,6 @@ def test_infer_feature_types_raises_invalid_schema_error():
infer_feature_types(df)


def test_ordinal_retains_order_min():
Copy link
Contributor

@freddyaboulton freddyaboulton Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Is this the only place we had coverage for this bug: #2456 ?

If so @eccabay let's add coverage somewhere else for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is maintaining this test? This PR is moving the typing responsibility from us (through _retain_custom_types_and_initialize_woodwork) to woodwork itself and their init functionality. Therefore any testing we do on this now is technically just testing third party behavior rather than our own. Is there a specific spot you'd like to see this sort of test?

Copy link
Contributor

@freddyaboulton freddyaboulton Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to make sure our components/pipelines preserve the ordinal logical type whenever possible. Before we had coverage that if a logical type was ordinal when it was passed to a component, it would be ordinal when it came out. I agree that initializing via schema should handle that for us but It'd be easy to accidentally refactor our code in such a way that ordinal gets cast to categorical by mistake:

image

Let's file an issue for adding coverage for ordinal in the #2443 epic.

@@ -134,6 +131,5 @@ def transform(self, X, y=None):
X_categorical = X.ww[self._categorical_cols.tolist()]
imputed = self._categorical_imputer.transform(X_categorical)
X_no_all_null[X_categorical.columns] = imputed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thehomebrewnerd @tuethan1999 @tamargrey, how come this works?

imputed has a valid ww schema, X_no_all_null has a valid ww schema, and even though we're updating columns of X_no_all_null by not going through the ww accessor, X_no_all_null has a valid woodwork schema after the update.

Is this because imputed and X_no_all_null always have all the same logical types? Is there a way to do this bulk column update by going through the ww accessor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does imputed have a valid ww schema because one is initialized in the transform call? Or is it the same object as X_categorical?

Either way, as long as imputed doesn't change the dtype (and having the same logical types would mean the dtypes were the same), it wouldn't invalidate the schema when it gets added to X_no_all_null (this is assuming you guys still aren't using a woodwork index).

But if the columns in X_categorical are not a subset of those in X_no_all_null that could also invalidate the schema (maybe if a categorical column was somehow fully null?).

This operation can happen outside of Woodwork and not invalidate the schema, but it's more of a coincidence than any thing else. So if there's ever a chance the woodwork schema could be invalidated in either of the ways described above, you'd want a way of going through woodwork. One option would be to get the bulk column update with a pandas method that woodwork hasn't overwritten--like X_no_all_null.ww.update(imputed) or something

Copy link
Contributor

@freddyaboulton freddyaboulton Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tamargrey for the thoughtful response! Yea imputed has a valid ww schema because it is initialized in transform.

@eccabay Can you try using ww.update here to see if it works?

Copy link
Contributor Author

@eccabay eccabay Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, X_no_all_null.ww.update(imputed) wipes out the woodwork initialization for X_no_all_null. Re-initializing X_no_all_null afterward works, but I'm not sure if it's any better than what we currently have. Using update instead also causes woodwork to throw a lot of warnings ("Operation performed by update has invalidated the Woodwork typing information"), which may get annoying. Thoughts?

Copy link
Contributor

@tamargrey tamargrey Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that the update call invalidated the schema. Can you see the TypingInfoMismatchWarning message that's being produced (there should be something that indicates why the schema has been invalidated)?

The fact that the schema is invalidated is either a side effect of pandas' update, or it's an indication that the resulting X_no_all_null dataframe never had a valid schema but that wasn't caught because none of the changes happened inside of woodwork. If you revert back to the way it was here (so not going through woodwork) but add an assertion that ww.is_schema_valid(X_no_all_null, X_no_all_null.ww.schema) and the assertion passes, then it's likely the update call. Otherwise, it's that the schema was never valid, so going through woodwork will be the route to take.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full warning is this:

evalml/tests/component_tests/test_imputer.py::test_imputer_does_not_erase_ww_info
  Operation performed by update has invalidated the Woodwork typing information:
   dtype mismatch for column b between DataFrame dtype, object, and Categorical dtype, category.
   Please initialize Woodwork with DataFrame.ww.init

When I set the assertion of ww.is_schema_valid, a single test failed. However, that test mocked the SimpleImputer.transform, in which case imputed was not a woodwork object (which it's otherwise guaranteed to be).

Copy link
Contributor

@tamargrey tamargrey Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that sounds to me like update is just changing the dtypes to object. So probably not worth going through woodwork for the update call. I'll leave it up to you guys whether you're comfortable with assuming the schema is valid after these changes happen outside of woodwork or whether you want to formally call init_with_full_schema as a sanity check that that's actually true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey Thank you for your thoughts on this! I think for now we can assume that the dtypes will always match. That's because the features without missing values will not change and the features with missing values can only be Double or Categorical in which case the imputation will not change the dtype. Of course, that's only because nullable types are currently not allowed.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with this! I left a nit and a question, but generally agree with Freddy's cleanup statements. Otherwise, looks good to go

return _retain_custom_types_and_initalize_woodwork(
X_ww.ww.logical_types, feature_matrix
)
typed_columns = set(X_ww.columns).intersection(set(feature_matrix.columns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this set intersection? It seem it should be ok with just feature_matrix.columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the set, since calculate_feature_matrix generates new columns that didn't exist in X_ww, whose schema these columns are passed into.

Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more woodwork usage comments. Very cool to see the usage of partial schema init here!

return _retain_custom_types_and_initalize_woodwork(
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
)
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string "categorical" will exclude just the Categorical logical type, but the string 'category' exclude all logical types that have the 'category' standard tag (including PostalCode, CountryCode, etc).

return _retain_custom_types_and_initalize_woodwork(
X_ww.ww.logical_types, X_t
)
X_t.ww.init(schema=X_ww.ww.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a good location to use X_t.ww.init_with_full_schema, which is stricter than init in that it would require X_t to already be valid for the schema (meaning none of its dtypes need to be updated at init, for example).

I'm not sure whether that's actually true here, but if it's something that you'd expect (say, none of the dtypes have changed from X_ww during the fit_transform call to create X_t), using init_with_full_schema will alert you to situations where that's not the case instead of silently re-updating the dtype.

In general, it might be worth looking over the places where you're using init(schema=schema) across evalML and changing to init_with_full_schema in any locations where you would expect the dataframe to already be valid for that schema and where it'd be problematic if it wasn't.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making changes @eccabay ! I think this is good to merge pending perf test results!

@@ -6,6 +6,8 @@ Release Notes
* Fixed bug where ``calculate_permutation_importance`` was not calculating the right value for pipelines with target transformers :pr:`2782`
* Fixed bug where transformed target values were not used in ``fit`` for time series pipelines :pr:`2780`
* Changes
* Changed woodwork initialization to use partial schemas :pr:`2774`
* Made ``Transformer.transform()`` an abstract method :pr:`2744`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's list this as a breaking change?

@@ -134,6 +131,5 @@ def transform(self, X, y=None):
X_categorical = X.ww[self._categorical_cols.tolist()]
imputed = self._categorical_imputer.transform(X_categorical)
X_no_all_null[X_categorical.columns] = imputed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey Thank you for your thoughts on this! I think for now we can assume that the dtypes will always match. That's because the features without missing values will not change and the features with missing values can only be Double or Categorical in which case the imputation will not change the dtype. Of course, that's only because nullable types are currently not allowed.

@@ -92,6 +92,22 @@ def fit(self, X, y):
mock_feature_selector.fit_transform(pd.DataFrame())


def test_feature_selectors_drop_columns_maintains_woodwork():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this!

@eccabay
Copy link
Contributor Author

eccabay commented Sep 17, 2021

Ran some perf tests, nothing seems to have changed dramatically with this update!

Perf tests on Confluence found here

@freddyaboulton
Copy link
Contributor

@eccabay Awesome! Also, recently we started attaching perf tests results to a confluence page so that they're all in the same place.

@eccabay eccabay merged commit 4c51a67 into main Sep 21, 2021
@eccabay eccabay deleted the 2744_ww_partial_schemas branch September 21, 2021 20:01
@chukarsten chukarsten mentioned this pull request Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modify our code to initialize woodwork with partial schemas
5 participants