-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize woodwork with partial schemas #2774
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2774 +/- ##
=======================================
- Coverage 99.8% 99.8% -0.0%
=======================================
Files 297 297
Lines 27744 27720 -24
=======================================
- Hits 27676 27643 -33
- Misses 68 77 +9
Continue to review full report at Codecov.
|
return _retain_custom_types_and_initalize_woodwork( | ||
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical] | ||
) | ||
columns = X_ww.ww.select(exclude="categorical").columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a new copy of the dataset, which seems excessive if we only care about column names.
Can you use return_schema=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's an interesting detail, thanks for pointing out! Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job Becca. Love this. Anything that moves us towards a more natural usage of our dependency libraries and cuts down on our "compatibility libraries" is a good thing!
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical] | ||
) | ||
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns | ||
X_t_df.ww.init(schema=X_ww.ww.schema._get_subset_schema(columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the equivalent of getting the full schema (isn't that like a dict?) and subsetting down to the columns manually? I'm a little iffy on using WW's private func, but that may be a request to them to publicize this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Thank you for doing this! I agree with @chukarsten that this is a great step forward in how we use woodwork.
I left some comments that I'll like to resolve before merging. Since we're touching a lot of components that are used in AutoMLSearch
I think it's important we run perf tests on this branch to make sure we're not accidentally converting types or something. Last time we touched how logical types are preserved, the perf tests helped us catch a bug before merging: #2297
@@ -58,7 +54,8 @@ def transform(self, X, y=None): | |||
index=X_ww.index, | |||
columns=[f"component_{i}" for i in range(X_t.shape[1])], | |||
) | |||
return _retain_custom_types_and_initalize_woodwork(X_ww, X_t) | |||
X_t.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this makes sense to me. The dimensionality reduction components create an entirely new set of features, all numeric, so there are no logical types to "retain".
That being said, one of the goals we're working towards is to preserve as much of the ww schema throughout the pipeline as possible. Just want to make it clear that won't always be possible.
return _retain_custom_types_and_initalize_woodwork( | ||
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical] | ||
) | ||
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool use of exclude
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string "categorical"
will exclude just the Categorical
logical type, but the string 'category'
exclude all logical types that have the 'category'
standard tag (including PostalCode, CountryCode, etc).
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical] | ||
) | ||
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns | ||
X_t_df.ww.init(schema=X_ww.ww.schema._get_subset_schema(columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file a woodwork issue to make _get_subset_schema
public!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed woodwork issue #1143!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay and @freddyaboulton So the X_ww.ww.select
call above with return_schema=True
should be returning the same subset schema of X_ww
that you're getting with the _get_subset_schema
call!
You could just do line 84 and 85 as:
no_cat_schema = X_ww.ww.select(exclude="categorical", return_schema=True)
X_t_df.ww.init(schema=no_cat_schema)
(If this doesn't work, definitely let me know!)
Also, I think it's a coincidence that the .columns
attribute worked in getting the subset schema here. TableSchema.columns
is actually a dictionary of ColumnSchema
objects, but it just so happens that the way we reference the columns in _get_subset_gchema
works for a dictionary where the keys are columns 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tamargrey !
evalml/pipelines/components/transformers/feature_selection/feature_selector.py
Outdated
Show resolved
Hide resolved
original_ltypes, X_no_all_null | ||
) | ||
|
||
return X_no_all_null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this only works because the only types that can be nullable and supported by our imputer are Categorical
and Double
. Since we don't allow nullable types, a column with ints and None ([1, 2, None]
) can only be Double
and [True, False, None]
can only be Categorical
. That means that the data returned by the SimpleImputer
will have the same logical types.
Here is a snippet of how this implementation will produce a broken ww schema if nullable types are allowed.
Nullable types are out of scope of this issue. I'm just bringing this up here because we're thinking of supporting nullable types soon (#2745) so maybe this is just an FYI for @chukarsten and @ParthivNaresh who are assigned to that issue.
return _retain_custom_types_and_initalize_woodwork( | ||
X_ww.ww.logical_types, X_t_df | ||
) | ||
X_t_df.ww.init(schema=X_ww.ww.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this. Since this is the base class, there's no way to know how the transformer is modifying the features. X_t_df could have fewer columns than X_w, which could result in an error or if they have the same columns the types could be getting converted.
I added assert False
before line 53 and 72 in this file and ran pytest -n 8 evalml/tests/component_tests
. Only one test failed and it's because a MockTransformer
relied on the implementation of transform
in the base class.
That tells me that we can get rid of transform
in the base class because every component we test (except a mock one used in tests) implements transform
. We should mark that as an abstract method. The other thing that tells me is that all components we test rely on self.fit(X, y).transform(X, y)
since none of the tests triggered the assertion in fit_transform
. We should get rid of the try
in fit_transform
.
@@ -218,55 +208,6 @@ def test_infer_feature_types_raises_invalid_schema_error(): | |||
infer_feature_types(df) | |||
|
|||
|
|||
def test_ordinal_retains_order_min(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten Is this the only place we had coverage for this bug: #2456 ?
If so @eccabay let's add coverage somewhere else for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is maintaining this test? This PR is moving the typing responsibility from us (through _retain_custom_types_and_initialize_woodwork
) to woodwork itself and their init
functionality. Therefore any testing we do on this now is technically just testing third party behavior rather than our own. Is there a specific spot you'd like to see this sort of test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to make sure our components/pipelines preserve the ordinal logical type whenever possible. Before we had coverage that if a logical type was ordinal when it was passed to a component, it would be ordinal when it came out. I agree that initializing via schema should handle that for us but It'd be easy to accidentally refactor our code in such a way that ordinal gets cast to categorical by mistake:
Let's file an issue for adding coverage for ordinal in the #2443 epic.
@@ -134,6 +131,5 @@ def transform(self, X, y=None): | |||
X_categorical = X.ww[self._categorical_cols.tolist()] | |||
imputed = self._categorical_imputer.transform(X_categorical) | |||
X_no_all_null[X_categorical.columns] = imputed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thehomebrewnerd @tuethan1999 @tamargrey, how come this works?
imputed
has a valid ww schema, X_no_all_null
has a valid ww schema, and even though we're updating columns of X_no_all_null
by not going through the ww accessor, X_no_all_null
has a valid woodwork schema after the update.
Is this because imputed
and X_no_all_null
always have all the same logical types? Is there a way to do this bulk column update by going through the ww accessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does imputed
have a valid ww schema because one is initialized in the transform
call? Or is it the same object as X_categorical
?
Either way, as long as imputed
doesn't change the dtype (and having the same logical types would mean the dtypes were the same), it wouldn't invalidate the schema when it gets added to X_no_all_null
(this is assuming you guys still aren't using a woodwork index).
But if the columns in X_categorical
are not a subset of those in X_no_all_null
that could also invalidate the schema (maybe if a categorical column was somehow fully null?).
This operation can happen outside of Woodwork and not invalidate the schema, but it's more of a coincidence than any thing else. So if there's ever a chance the woodwork schema could be invalidated in either of the ways described above, you'd want a way of going through woodwork. One option would be to get the bulk column update with a pandas method that woodwork hasn't overwritten--like X_no_all_null.ww.update(imputed)
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tamargrey for the thoughtful response! Yea imputed
has a valid ww schema because it is initialized in transform
.
@eccabay Can you try using ww.update
here to see if it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly enough, X_no_all_null.ww.update(imputed)
wipes out the woodwork initialization for X_no_all_null
. Re-initializing X_no_all_null
afterward works, but I'm not sure if it's any better than what we currently have. Using update
instead also causes woodwork to throw a lot of warnings ("Operation performed by update has invalidated the Woodwork typing information"
), which may get annoying. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that the update call invalidated the schema. Can you see the TypingInfoMismatchWarning
message that's being produced (there should be something that indicates why the schema has been invalidated)?
The fact that the schema is invalidated is either a side effect of pandas' update
, or it's an indication that the resulting X_no_all_null
dataframe never had a valid schema but that wasn't caught because none of the changes happened inside of woodwork. If you revert back to the way it was here (so not going through woodwork) but add an assertion that ww.is_schema_valid(X_no_all_null, X_no_all_null.ww.schema)
and the assertion passes, then it's likely the update
call. Otherwise, it's that the schema was never valid, so going through woodwork will be the route to take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full warning is this:
evalml/tests/component_tests/test_imputer.py::test_imputer_does_not_erase_ww_info
Operation performed by update has invalidated the Woodwork typing information:
dtype mismatch for column b between DataFrame dtype, object, and Categorical dtype, category.
Please initialize Woodwork with DataFrame.ww.init
When I set the assertion of ww.is_schema_valid
, a single test failed. However, that test mocked the SimpleImputer.transform
, in which case imputed
was not a woodwork object (which it's otherwise guaranteed to be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that sounds to me like update is just changing the dtypes to object
. So probably not worth going through woodwork for the update call. I'll leave it up to you guys whether you're comfortable with assuming the schema is valid after these changes happen outside of woodwork or whether you want to formally call init_with_full_schema
as a sanity check that that's actually true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey Thank you for your thoughts on this! I think for now we can assume that the dtypes will always match. That's because the features without missing values will not change and the features with missing values can only be Double or Categorical in which case the imputation will not change the dtype. Of course, that's only because nullable types are currently not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job with this! I left a nit and a question, but generally agree with Freddy's cleanup statements. Otherwise, looks good to go
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
return _retain_custom_types_and_initalize_woodwork( | ||
X_ww.ww.logical_types, feature_matrix | ||
) | ||
typed_columns = set(X_ww.columns).intersection(set(feature_matrix.columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this set intersection? It seem it should be ok with just feature_matrix.columns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the set, since calculate_feature_matrix
generates new columns that didn't exist in X_ww
, whose schema these columns are passed into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more woodwork usage comments. Very cool to see the usage of partial schema init here!
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
return _retain_custom_types_and_initalize_woodwork( | ||
X_ww.ww.logical_types, X_t_df, ltypes_to_ignore=[Categorical] | ||
) | ||
columns = X_ww.ww.select(exclude="categorical", return_schema=True).columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string "categorical"
will exclude just the Categorical
logical type, but the string 'category'
exclude all logical types that have the 'category'
standard tag (including PostalCode, CountryCode, etc).
evalml/pipelines/components/transformers/scalers/standard_scaler.py
Outdated
Show resolved
Hide resolved
return _retain_custom_types_and_initalize_woodwork( | ||
X_ww.ww.logical_types, X_t | ||
) | ||
X_t.ww.init(schema=X_ww.ww.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a good location to use X_t.ww.init_with_full_schema
, which is stricter than init
in that it would require X_t
to already be valid for the schema (meaning none of its dtypes need to be updated at init, for example).
I'm not sure whether that's actually true here, but if it's something that you'd expect (say, none of the dtypes have changed from X_ww during the fit_transform
call to create X_t), using init_with_full_schema
will alert you to situations where that's not the case instead of silently re-updating the dtype.
In general, it might be worth looking over the places where you're using init(schema=schema)
across evalML and changing to init_with_full_schema
in any locations where you would expect the dataframe to already be valid for that schema and where it'd be problematic if it wasn't.
evalml/tests/dependency_update_check/minimum_core_requirements.txt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making changes @eccabay ! I think this is good to merge pending perf test results!
@@ -6,6 +6,8 @@ Release Notes | |||
* Fixed bug where ``calculate_permutation_importance`` was not calculating the right value for pipelines with target transformers :pr:`2782` | |||
* Fixed bug where transformed target values were not used in ``fit`` for time series pipelines :pr:`2780` | |||
* Changes | |||
* Changed woodwork initialization to use partial schemas :pr:`2774` | |||
* Made ``Transformer.transform()`` an abstract method :pr:`2744` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's list this as a breaking change?
@@ -134,6 +131,5 @@ def transform(self, X, y=None): | |||
X_categorical = X.ww[self._categorical_cols.tolist()] | |||
imputed = self._categorical_imputer.transform(X_categorical) | |||
X_no_all_null[X_categorical.columns] = imputed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey Thank you for your thoughts on this! I think for now we can assume that the dtypes will always match. That's because the features without missing values will not change and the features with missing values can only be Double or Categorical in which case the imputation will not change the dtype. Of course, that's only because nullable types are currently not allowed.
@@ -92,6 +92,22 @@ def fit(self, X, y): | |||
mock_feature_selector.fit_transform(pd.DataFrame()) | |||
|
|||
|
|||
def test_feature_selectors_drop_columns_maintains_woodwork(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this!
Ran some perf tests, nothing seems to have changed dramatically with this update! Perf tests on Confluence found here |
@eccabay Awesome! Also, recently we started attaching perf tests results to a confluence page so that they're all in the same place. |
Closes #2744