Initialize woodwork with partial schemas #2774

eccabay · 2021-09-13T14:59:10Z

codecov · 2021-09-13T15:03:27Z

Codecov Report

Merging #2774 (550bd3d) into main (b48ff30) will decrease coverage by 0.1%.
The diff coverage is 92.7%.

@@           Coverage Diff           @@
##            main   #2774     +/-   ##
=======================================
- Coverage   99.8%   99.8%   -0.0%     
=======================================
  Files        297     297             
  Lines      27744   27720     -24     
=======================================
- Hits       27676   27643     -33     
- Misses        68      77      +9

Impacted Files	Coverage Δ
evalml/utils/__init__.py	`100.0% <ø> (ø)`
...valml/tests/pipeline_tests/test_component_graph.py	`99.9% <50.0%> (-0.1%)`	⬇️
evalml/tests/pipeline_tests/test_pipeline_utils.py	`99.6% <50.0%> (-0.4%)`	⬇️
evalml/tests/pipeline_tests/test_pipelines.py	`99.8% <50.0%> (-<0.1%)`	⬇️
evalml/tests/component_tests/test_components.py	`99.3% <70.0%> (-0.7%)`	⬇️
...nents/transformers/dimensionality_reduction/lda.py	`100.0% <100.0%> (ø)`
...nents/transformers/dimensionality_reduction/pca.py	`100.0% <100.0%> (ø)`
...components/transformers/encoders/target_encoder.py	`100.0% <100.0%> (ø)`
...transformers/feature_selection/feature_selector.py	`100.0% <100.0%> (ø)`
...elines/components/transformers/imputers/imputer.py	`100.0% <100.0%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b48ff30...550bd3d. Read the comment docs.

freddyaboulton · 2021-09-13T17:18:53Z

evalml/pipelines/components/transformers/encoders/target_encoder.py

-        return _retain_custom_types_and_initalize_woodwork(
-            X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
-        )
+        columns = X_ww.ww.select(exclude="categorical").columns


This creates a new copy of the dataset, which seems excessive if we only care about column names.

Can you use return_schema=True?

Ah, that's an interesting detail, thanks for pointing out! Will do.

chukarsten

Nice job Becca. Love this. Anything that moves us towards a more natural usage of our dependency libraries and cuts down on our "compatibility libraries" is a good thing!

chukarsten · 2021-09-14T14:42:23Z

evalml/pipelines/components/transformers/encoders/target_encoder.py

-            X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
-        )
+        columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns
+        X_t_df.ww.init(schema=X_ww.ww.schema._get_subset_schema(columns))


Is this the equivalent of getting the full schema (isn't that like a dict?) and subsetting down to the columns manually? I'm a little iffy on using WW's private func, but that may be a request to them to publicize this function.

freddyaboulton

@eccabay Thank you for doing this! I agree with @chukarsten that this is a great step forward in how we use woodwork.

I left some comments that I'll like to resolve before merging. Since we're touching a lot of components that are used in AutoMLSearch I think it's important we run perf tests on this branch to make sure we're not accidentally converting types or something. Last time we touched how logical types are preserved, the perf tests helped us catch a bug before merging: #2297

freddyaboulton · 2021-09-14T14:03:20Z

evalml/pipelines/components/transformers/dimensionality_reduction/lda.py

@@ -58,7 +54,8 @@ def transform(self, X, y=None):
            index=X_ww.index,
            columns=[f"component_{i}" for i in range(X_t.shape[1])],
        )
-        return _retain_custom_types_and_initalize_woodwork(X_ww, X_t)
+        X_t.ww.init()


Yep this makes sense to me. The dimensionality reduction components create an entirely new set of features, all numeric, so there are no logical types to "retain".

That being said, one of the goals we're working towards is to preserve as much of the ww schema throughout the pipeline as possible. Just want to make it clear that won't always be possible.

freddyaboulton · 2021-09-14T14:03:44Z

evalml/pipelines/components/transformers/encoders/target_encoder.py

-        return _retain_custom_types_and_initalize_woodwork(
-            X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
-        )
+        columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns


Cool use of exclude here.

The string "categorical" will exclude just the Categorical logical type, but the string 'category' exclude all logical types that have the 'category' standard tag (including PostalCode, CountryCode, etc).

freddyaboulton · 2021-09-14T14:04:23Z

evalml/pipelines/components/transformers/encoders/target_encoder.py

-            X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
-        )
+        columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns
+        X_t_df.ww.init(schema=X_ww.ww.schema._get_subset_schema(columns))


Let's file a woodwork issue to make _get_subset_schema public!

Filed woodwork issue #1143!

@eccabay and @freddyaboulton So the X_ww.ww.select call above with return_schema=True should be returning the same subset schema of X_ww that you're getting with the _get_subset_schema call!

You could just do line 84 and 85 as:

no_cat_schema = X_ww.ww.select(exclude="categorical", return_schema=True) X_t_df.ww.init(schema=no_cat_schema)

(If this doesn't work, definitely let me know!)

Also, I think it's a coincidence that the .columns attribute worked in getting the subset schema here. TableSchema.columns is actually a dictionary of ColumnSchema objects, but it just so happens that the way we reference the columns in _get_subset_gchema works for a dictionary where the keys are columns 😅

Thank you @tamargrey !

evalml/pipelines/components/transformers/feature_selection/feature_selector.py

freddyaboulton · 2021-09-14T14:26:00Z

evalml/pipelines/components/transformers/imputers/imputer.py

-            original_ltypes, X_no_all_null
-        )
+
+        return X_no_all_null


I'm concerned that this only works because the only types that can be nullable and supported by our imputer are Categorical and Double. Since we don't allow nullable types, a column with ints and None ([1, 2, None]) can only be Double and [True, False, None] can only be Categorical. That means that the data returned by the SimpleImputer will have the same logical types.

Here is a snippet of how this implementation will produce a broken ww schema if nullable types are allowed.

Nullable types are out of scope of this issue. I'm just bringing this up here because we're thinking of supporting nullable types soon (#2745) so maybe this is just an FYI for @chukarsten and @ParthivNaresh who are assigned to that issue.

freddyaboulton · 2021-09-14T15:05:35Z

evalml/pipelines/components/transformers/transformer.py

-        return _retain_custom_types_and_initalize_woodwork(
-            X_ww.ww.logical_types, X_t_df
-        )
+        X_t_df.ww.init(schema=X_ww.ww.schema)


I don't think we should do this. Since this is the base class, there's no way to know how the transformer is modifying the features. X_t_df could have fewer columns than X_w, which could result in an error or if they have the same columns the types could be getting converted.

I added assert False before line 53 and 72 in this file and ran pytest -n 8 evalml/tests/component_tests. Only one test failed and it's because a MockTransformer relied on the implementation of transform in the base class.

That tells me that we can get rid of transform in the base class because every component we test (except a mock one used in tests) implements transform. We should mark that as an abstract method. The other thing that tells me is that all components we test rely on self.fit(X, y).transform(X, y) since none of the tests triggered the assertion in fit_transform. We should get rid of the try in fit_transform.

freddyaboulton · 2021-09-14T15:07:19Z

evalml/tests/utils_tests/test_woodwork_utils.py

@@ -218,55 +208,6 @@ def test_infer_feature_types_raises_invalid_schema_error():
        infer_feature_types(df)


-def test_ordinal_retains_order_min():


@chukarsten Is this the only place we had coverage for this bug: #2456 ?

If so @eccabay let's add coverage somewhere else for this case.

How important is maintaining this test? This PR is moving the typing responsibility from us (through _retain_custom_types_and_initialize_woodwork) to woodwork itself and their init functionality. Therefore any testing we do on this now is technically just testing third party behavior rather than our own. Is there a specific spot you'd like to see this sort of test?

I think it's important to make sure our components/pipelines preserve the ordinal logical type whenever possible. Before we had coverage that if a logical type was ordinal when it was passed to a component, it would be ordinal when it came out. I agree that initializing via schema should handle that for us but It'd be easy to accidentally refactor our code in such a way that ordinal gets cast to categorical by mistake:

Let's file an issue for adding coverage for ordinal in the #2443 epic.

freddyaboulton · 2021-09-14T15:50:30Z

evalml/pipelines/components/transformers/imputers/imputer.py

@@ -134,6 +131,5 @@ def transform(self, X, y=None):
            X_categorical = X.ww[self._categorical_cols.tolist()]
            imputed = self._categorical_imputer.transform(X_categorical)
            X_no_all_null[X_categorical.columns] = imputed


@thehomebrewnerd @tuethan1999 @tamargrey, how come this works?

imputed has a valid ww schema, X_no_all_null has a valid ww schema, and even though we're updating columns of X_no_all_null by not going through the ww accessor, X_no_all_null has a valid woodwork schema after the update.

Is this because imputed and X_no_all_null always have all the same logical types? Is there a way to do this bulk column update by going through the ww accessor?

Does imputed have a valid ww schema because one is initialized in the transform call? Or is it the same object as X_categorical?

Either way, as long as imputed doesn't change the dtype (and having the same logical types would mean the dtypes were the same), it wouldn't invalidate the schema when it gets added to X_no_all_null (this is assuming you guys still aren't using a woodwork index).

But if the columns in X_categorical are not a subset of those in X_no_all_null that could also invalidate the schema (maybe if a categorical column was somehow fully null?).

This operation can happen outside of Woodwork and not invalidate the schema, but it's more of a coincidence than any thing else. So if there's ever a chance the woodwork schema could be invalidated in either of the ways described above, you'd want a way of going through woodwork. One option would be to get the bulk column update with a pandas method that woodwork hasn't overwritten--like X_no_all_null.ww.update(imputed) or something

Thank you @tamargrey for the thoughtful response! Yea imputed has a valid ww schema because it is initialized in transform.

@eccabay Can you try using ww.update here to see if it works?

Interestingly enough, X_no_all_null.ww.update(imputed) wipes out the woodwork initialization for X_no_all_null. Re-initializing X_no_all_null afterward works, but I'm not sure if it's any better than what we currently have. Using update instead also causes woodwork to throw a lot of warnings ("Operation performed by update has invalidated the Woodwork typing information"), which may get annoying. Thoughts?

That means that the update call invalidated the schema. Can you see the TypingInfoMismatchWarning message that's being produced (there should be something that indicates why the schema has been invalidated)?

The fact that the schema is invalidated is either a side effect of pandas' update, or it's an indication that the resulting X_no_all_null dataframe never had a valid schema but that wasn't caught because none of the changes happened inside of woodwork. If you revert back to the way it was here (so not going through woodwork) but add an assertion that ww.is_schema_valid(X_no_all_null, X_no_all_null.ww.schema) and the assertion passes, then it's likely the update call. Otherwise, it's that the schema was never valid, so going through woodwork will be the route to take.

The full warning is this:

evalml/tests/component_tests/test_imputer.py::test_imputer_does_not_erase_ww_info Operation performed by update has invalidated the Woodwork typing information: dtype mismatch for column b between DataFrame dtype, object, and Categorical dtype, category. Please initialize Woodwork with DataFrame.ww.init

When I set the assertion of ww.is_schema_valid, a single test failed. However, that test mocked the SimpleImputer.transform, in which case imputed was not a woodwork object (which it's otherwise guaranteed to be).

okay, that sounds to me like update is just changing the dtypes to object. So probably not worth going through woodwork for the update call. I'll leave it up to you guys whether you're comfortable with assuming the schema is valid after these changes happen outside of woodwork or whether you want to formally call init_with_full_schema as a sanity check that that's actually true

@tamargrey Thank you for your thoughts on this! I think for now we can assume that the dtypes will always match. That's because the features without missing values will not change and the features with missing values can only be Double or Categorical in which case the imputation will not change the dtype. Of course, that's only because nullable types are currently not allowed.

bchen1116

Great job with this! I left a nit and a question, but generally agree with Freddy's cleanup statements. Otherwise, looks good to go

evalml/pipelines/components/transformers/imputers/target_imputer.py

bchen1116 · 2021-09-14T21:03:40Z

evalml/pipelines/components/transformers/preprocessing/featuretools.py

-        return _retain_custom_types_and_initalize_woodwork(
-            X_ww.ww.logical_types, feature_matrix
-        )
+        typed_columns = set(X_ww.columns).intersection(set(feature_matrix.columns))


Do we need this set intersection? It seem it should be ok with just feature_matrix.columns?

We need the set, since calculate_feature_matrix generates new columns that didn't exist in X_ww, whose schema these columns are passed into.

tamargrey

Some more woodwork usage comments. Very cool to see the usage of partial schema init here!

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

tamargrey · 2021-09-15T14:08:45Z

evalml/pipelines/components/transformers/encoders/target_encoder.py

-        return _retain_custom_types_and_initalize_woodwork(
-            X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical]
-        )
+        columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns


The string "categorical" will exclude just the Categorical logical type, but the string 'category' exclude all logical types that have the 'category' standard tag (including PostalCode, CountryCode, etc).

evalml/pipelines/components/transformers/scalers/standard_scaler.py

tamargrey · 2021-09-15T14:17:21Z

evalml/pipelines/components/transformers/transformer.py

-            return _retain_custom_types_and_initalize_woodwork(
-                X_ww.ww.logical_types, X_t
-            )
+            X_t.ww.init(schema=X_ww.ww.schema)


This could be a good location to use X_t.ww.init_with_full_schema, which is stricter than init in that it would require X_t to already be valid for the schema (meaning none of its dtypes need to be updated at init, for example).

I'm not sure whether that's actually true here, but if it's something that you'd expect (say, none of the dtypes have changed from X_ww during the fit_transform call to create X_t), using init_with_full_schema will alert you to situations where that's not the case instead of silently re-updating the dtype.

In general, it might be worth looking over the places where you're using init(schema=schema) across evalML and changing to init_with_full_schema in any locations where you would expect the dataframe to already be valid for that schema and where it'd be problematic if it wasn't.

evalml/tests/dependency_update_check/minimum_core_requirements.txt

freddyaboulton

Thank you for making changes @eccabay ! I think this is good to merge pending perf test results!

freddyaboulton · 2021-09-16T14:27:52Z

docs/source/release_notes.rst

@@ -6,6 +6,8 @@ Release Notes
        * Fixed bug where ``calculate_permutation_importance`` was not calculating the right value for pipelines with target transformers :pr:`2782`
        * Fixed bug where transformed target values were not used in ``fit`` for time series pipelines :pr:`2780`
    * Changes
+        * Changed woodwork initialization to use partial schemas :pr:`2774`
+        * Made ``Transformer.transform()`` an abstract method :pr:`2744`


Let's list this as a breaking change?

freddyaboulton · 2021-09-17T14:19:34Z

evalml/pipelines/components/transformers/imputers/imputer.py

@@ -134,6 +131,5 @@ def transform(self, X, y=None):
            X_categorical = X.ww[self._categorical_cols.tolist()]
            imputed = self._categorical_imputer.transform(X_categorical)
            X_no_all_null[X_categorical.columns] = imputed


@tamargrey Thank you for your thoughts on this! I think for now we can assume that the dtypes will always match. That's because the features without missing values will not change and the features with missing values can only be Double or Categorical in which case the imputation will not change the dtype. Of course, that's only because nullable types are currently not allowed.

freddyaboulton · 2021-09-17T14:33:27Z

evalml/tests/component_tests/test_feature_selectors.py

@@ -92,6 +92,22 @@ def fit(self, X, y):
        mock_feature_selector.fit_transform(pd.DataFrame())


+def test_feature_selectors_drop_columns_maintains_woodwork():


Thank you for adding this!

eccabay · 2021-09-17T20:38:55Z

Ran some perf tests, nothing seems to have changed dramatically with this update!

Perf tests on Confluence found here

freddyaboulton · 2021-09-17T21:06:08Z

@eccabay Awesome! Also, recently we started attaching perf tests results to a confluence page so that they're all in the same place.

eccabay added 4 commits September 13, 2021 09:19

Replace _retain_custom_types with init(partial_schema)

3873228

remove _retain_custom_types_and_initialize_woodwork entirely

9d1603b

Merge branch 'main' into 2744_ww_partial_schemas

ed67b5d

Update release notes

c97145a

Update ww version to 0.7.0

d7f3e00

freddyaboulton reviewed Sep 13, 2021

View reviewed changes

eccabay added 4 commits September 13, 2021 14:49

add return_schema=True

91e9d67

Fix ww typo

af798ae

Merge branch 'main' into 2744_ww_partial_schemas

bd14392

Update woodwork version in all places

450555c

eccabay marked this pull request as ready for review September 13, 2021 20:54

auto-assign bot assigned eccabay Sep 13, 2021

eccabay requested a review from a team September 13, 2021 20:54

Merge branch 'main' into 2744_ww_partial_schemas

a9f79cf

chukarsten approved these changes Sep 14, 2021

View reviewed changes

freddyaboulton suggested changes Sep 14, 2021

View reviewed changes

bchen1116 approved these changes Sep 14, 2021

View reviewed changes

eccabay mentioned this pull request Sep 15, 2021

Make _get_subset_schema public alteryx/woodwork#1143

Closed

Add feature selector test reducing number of cols

9f57f06

tamargrey reviewed Sep 15, 2021

View reviewed changes

eccabay added 7 commits September 15, 2021 10:26

Move Transformer.transform() to abstract method

8a014fa

Address PR comments

08c3573

Merge branch 'main' into 2744_ww_partial_schemas

01d59e7

Fix reference before assignment error

3c7fba3

Merge branch 'main' into 2744_ww_partial_schemas

34f5ba7

Fix docs issue

44b405c

Merge branch 'main' into 2744_ww_partial_schemas

9241a71

eccabay requested a review from freddyaboulton September 16, 2021 13:08

Merge branch 'main' into 2744_ww_partial_schemas

d5f129b

freddyaboulton approved these changes Sep 17, 2021

View reviewed changes

Merge branch 'main' into 2744_ww_partial_schemas

3d8212e

Merge branch 'main' into 2744_ww_partial_schema

1f7f398

eccabay added 3 commits September 20, 2021 10:00

Merge branch 'main' into 2744_ww_partial_schemas

1e0e261

Merge branch 'main' into 2744_ww_partial_schemas

53b1cdf

Merge branch 'main' into 2744_ww_partial_schemas

550bd3d

eccabay merged commit 4c51a67 into main Sep 21, 2021

eccabay deleted the 2744_ww_partial_schemas branch September 21, 2021 20:01

angela97lin mentioned this pull request Sep 22, 2021

Error in infer_feature_types for seoul bike dataset #2827

Closed

chukarsten mentioned this pull request Oct 1, 2021

Release v0.34.0 #2864

Merged

		@@ -218,55 +208,6 @@ def test_infer_feature_types_raises_invalid_schema_error():
		infer_feature_types(df)


		def test_ordinal_retains_order_min():

		@@ -92,6 +92,22 @@ def fit(self, X, y):
		mock_feature_selector.fit_transform(pd.DataFrame())


		def test_feature_selectors_drop_columns_maintains_woodwork():

Initialize woodwork with partial schemas #2774

Initialize woodwork with partial schemas #2774

Conversation

eccabay commented Sep 13, 2021

codecov bot commented Sep 13, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

eccabay Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

tamargrey Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay commented Sep 17, 2021 • edited Loading

freddyaboulton commented Sep 17, 2021

codecov bot commented Sep 13, 2021 •

edited

Loading

freddyaboulton Sep 14, 2021 •

edited

Loading

freddyaboulton Sep 14, 2021 •

edited

Loading

freddyaboulton Sep 15, 2021 •

edited

Loading

freddyaboulton Sep 15, 2021 •

edited

Loading

eccabay Sep 15, 2021 •

edited

Loading

tamargrey Sep 15, 2021 •

edited

Loading

tamargrey Sep 15, 2021 •

edited

Loading

eccabay commented Sep 17, 2021 •

edited

Loading