Component graph and text featurizer: preserve user-specified woodwork types#2297
Component graph and text featurizer: preserve user-specified woodwork types#2297freddyaboulton merged 19 commits intomainfrom
Conversation
evalml/pipelines/component_graph.py
Outdated
| y (ww.DataColumn, pd.Series): The target training data of length [n_samples] | ||
| """ | ||
| X = infer_feature_types(X) | ||
| X = _convert_woodwork_types_wrapper(X.to_dataframe()) |
There was a problem hiding this comment.
There's no need to convert to pandas here. That's all handled in _compute_features. In fact, calling this here is a bug.
Symptom: calls to pipeline fit always use whatever types woodwork infers for all the columns, overriding all user-specified values.
Explanation: the first thing we do in _compute_features is call X = infer_feature_types(X). So, by converting to pandas here and then immediately converting back to woodwork first thing in _compute_features, we were effectively resetting the input to whatever woodwork infers by default.
There was a problem hiding this comment.
@dsherry this is very interesting and I wonder if it's part of what I was seeing in my investigation with Chris.
| X_nlp_primitives.fillna(0, inplace=True) | ||
|
|
||
| X_lsa = self._lsa.transform(X[self._text_columns]).to_dataframe() | ||
| X_lsa = self._lsa.transform(X_ww[self._text_columns]).to_dataframe() |
There was a problem hiding this comment.
This was the first fix required for this bug. We need to pass the woodwork type info to the LSA component, otherwise it will run woodwork inference again from scratch and the newly reset types will be used by all the downstream components!
| X = pd.DataFrame({'column_1': ['a', 'b', 'c', 'd', 'a', 'a', 'b', 'c', 'b'], | ||
| 'column_2': [1, 2, 3, 4, 5, 6, 5, 4, 3]}) | ||
| X = pd.DataFrame({'column_1': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'], | ||
| 'column_2': [1, 2, 3, 3, 4, 4, 5, 5, 6]}) |
There was a problem hiding this comment.
No functional change, just reordered to make this test easier to understand
| 'column_2_1', 'column_2_2', 'column_2_3', 'column_2_4', 'column_2_5', 'column_2_6'] | ||
| assert input_feature_names['Elastic Net'] == ['column_3', 'column_1_a', 'column_1_b', 'column_1_c', 'column_1_d', | ||
| 'column_2_1', 'column_2_2', 'column_2_3', 'column_2_4', 'column_2_5', 'column_2_6'] | ||
| assert input_feature_names['Text'] == ['column_3', 'column_5', 'column_1_a', 'column_1_b', 'column_1_c', 'column_1_d', |
There was a problem hiding this comment.
Woodwork would infer the data in "column_5" to be datetime, but I set it to natural language above. We check here that it passes through the datetime featurizer untouched and enters the text featurizer. On the next assert, we check that it gets transformed into natural language features.
| X['column_4'] = [str((datetime(2021, 5, 21, 12, 0, 0) + timedelta(minutes=5 * x))) for x in range(len(X))] | ||
| X['column_5'] = X['column_4'] | ||
| y = pd.Series([1, 0, 1, 0, 1, 1, 0, 0, 0]) | ||
| X = infer_feature_types(X, {"column_2": "categorical"}) |
There was a problem hiding this comment.
The confusing thing about this test is that it was passing before my changes. That's weird because here we set an integer column "column_2" to have categorical type, and saw that as expected one-hot encoded features based off of "column_2" show up in the datetime component's input.
| y = pd.Series(y) | ||
| graph = {'Imputer': [Imputer], 'OHE': [OneHotEncoder, 'Imputer.x', 'Imputer.y']} | ||
| expected_x = ww.DataTable(pd.DataFrame(index=X.index, columns=X.index).fillna(1)) | ||
| expected_x = ww.DataTable(pd.DataFrame(index=X.index, columns=X.columns).fillna(1)) |
There was a problem hiding this comment.
This isn't a "bug" per se, but the imputer will return one column for each input column, not more which is what this was doing
| component_graph = ComponentGraph(graph).instantiate({}) | ||
| component_graph.fit(X, y) | ||
| expected_x_df = expected_x.to_dataframe().astype("Int64") | ||
| expected_x_df = expected_x.to_dataframe().astype("float64") |
There was a problem hiding this comment.
If you look at X_y_binary, all 20 columns in X have type float64.
evalml/pipelines/component_graph.py
Outdated
| if y_input is not None: | ||
| return_y = y_input | ||
| return_x = infer_feature_types(return_x) | ||
| return_x = _retain_custom_types_and_initalize_woodwork(X, return_x) |
There was a problem hiding this comment.
This method _consolidate_inputs currently only gets called inside _compute_features. If you look at the call site, you'll see the x_inputs is always a list of pandas dataframes. Therefore, after concatenating those pandas dataframes representing input features from the parent components in the graph, we need to call _retain_custom_types_and_initalize_woodwork to preserve any columns which appeared in the original woodwork input.
There was a problem hiding this comment.
I added test_component_graph_types_multi_input_1 and test_component_graph_types_multi_input_2 to check this is working properly. Both fail without this change.
There was a problem hiding this comment.
I think the limitation we still face is that the types that are specified during the pipeline evaluation are not necessarily preserved. Only the types that are specified before pipeline evaluation. We don't make much use of this yet but we may want to do so in the future (one thing that comes to mind is preserving semantic tags). I was initially worried about how the OHE sets boolean types but we're ok for now because there's a one-to-one mapping between the boolean physical type and the boolean logical type so the inference will always work out in our favor.
I don't think this should block your pr because I'm not sure there's a way to prevent this from happening until alteryx/woodwork#884 is released but just want to call out we're still not 100% there.
from evalml.pipelines.components import Transformer
from evalml.pipelines import RegressionPipeline
from evalml.utils import infer_feature_types, _retain_custom_types_and_initalize_woodwork
import woodwork as ww
import pandas as pd
import pytest
class ZipCodeExtractor(Transformer):
name = "Zip Code Extractor"
def fit(self, X, y):
return self
def transform(self, X, y):
X = infer_feature_types(X)
X = X.select(["address"])
X_df = X.to_dataframe()
X_df['zip_code'] = pd.Series(["02101", "02139", "02152"])
X_df.drop(columns=X.columns, inplace=True)
new_X = _retain_custom_types_and_initalize_woodwork(X, X_df)
new_X = new_X.set_types({'zip_code': "ZipCode"})
return new_X
class ZipCodeToAveragePrice(Transformer):
name = "Check Zip Code Preserved"
def fit(self, X, y):
return self
def transform(self, X, y):
X = infer_feature_types(X)
X = X.select(["ZipCode"])
assert len(X.columns) > 0, "No Zip Code!"
X_df = X.to_dataframe()
X_df['average_apartment_price'] = pd.Series([1000, 2000, 3000])
new_X = _retain_custom_types_and_initalize_woodwork(X, X_df)
return new_X
X = pd.DataFrame({"postal_address": ["address-1", "address-2", "address-3"]})
X = ww.DataTable(X, semantic_tags={"postal_address": ['address']})
y = pd.Series([1500, 2500, 35000])
apartment_price_predictor = RegressionPipeline([ZipCodeExtractor, ZipCodeToAveragePrice, "Random Forest Regressor"])
with pytest.raises(AssertionError, match="No Zip Code!"):
apartment_price_predictor.fit(X, y)There was a problem hiding this comment.
Thanks for this @freddyaboulton . I'll try to file something for this before we merge this PR. Its a great point, and yep something we'll run into eventually.
| """ | ||
| X = infer_feature_types(X) | ||
| X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
| y = infer_feature_types(y) |
There was a problem hiding this comment.
This, and the similar change in _compute_features are important. Otherwise, if the input provided to fit is not a woodwork structure, fit will fail!
evalml/pipelines/component_graph.py
Outdated
| """ | ||
| if len(x_inputs) == 0: | ||
| return_x = X | ||
| return_x = X.to_dataframe() |
There was a problem hiding this comment.
Since we're now expecting _compute_features will be normalizing input to woodwork up the stack, we need to convert to pandas here. You can see that this X input is guaranteed to be woodwork by the code in _compute_features.
There was a problem hiding this comment.
Ah, woodwork conversions
There was a problem hiding this comment.
Haha yep. This is necessary because there's no concatenate feature in woodwork 0.0.11. I'm pretty sure this is handled on @freddyaboulton 's accessor feature branch
There was a problem hiding this comment.
Yep, in the ww update branch we just do return_x = X hehe but fyi there still isn't a concatenate feature in ww
Codecov Report
@@ Coverage Diff @@
## main #2297 +/- ##
========================================
+ Coverage 89.4% 99.9% +10.5%
========================================
Files 281 281
Lines 24671 24760 +89
========================================
+ Hits 22054 24731 +2677
+ Misses 2617 29 -2588
Continue to review full report at Codecov.
|
| assert all([s._is_fitted for s in par_pipelines]) | ||
|
|
||
| # Ensure the scores in parallel and sequence are same | ||
| assert set(par_scores) == set(seq_scores) |
evalml/pipelines/component_graph.py
Outdated
| """ | ||
| if len(x_inputs) == 0: | ||
| return_x = X | ||
| return_x = X.to_dataframe() |
There was a problem hiding this comment.
Ah, woodwork conversions
|
For those following along, this is still ready for review, but I'm waiting for #2181 to merge (woodwork accessor), then I'll rebase this and get it in. |
bchen1116
left a comment
There was a problem hiding this comment.
Nice! The tests look pretty thorough!
1c376ad to
7dfebd3
Compare
7dfebd3 to
b5e532d
Compare
| 0.1666874487986626: 1, | ||
| 0.13357573073236878: 1, | ||
| 0.06778096366056789: 1, | ||
| 0.19149451286750388: 154, |
There was a problem hiding this comment.
I had posted a previous comment explaining why this change is necessary but it got wiped from the diff view with the black pr.
In summary, since we're preserving types we were previously not preserving (bool/int) the predicted probabilities are different which means the partial dependence is different.
There was a problem hiding this comment.
Disregard this and the previous comment! This was actually a symptom of a bug hehe
|
Perf tests here Our linear estimators are doing worse so I think we should hold off on merging until we know why. |
| # Because of that, the for-loop below is sufficient. | ||
|
|
||
| logical_types = {} | ||
| logical_types.update(X.ww.logical_types) |
There was a problem hiding this comment.
Before we would do this before the for-loop which was causing the performance regression in linear estimators
| ["column_2", "column_1_a", "column_1_b", "column_1_c", "column_1_d", "column_3"] | ||
| ) | ||
| assert mock_rf_fit.call_args[0][0].ww.logical_types["column_3"] == Integer | ||
| assert mock_rf_fit.call_args[0][0].ww.logical_types["column_2"] == Double |
There was a problem hiding this comment.
Adding this to make sure we don't overwrite the types created by the standard scaler.
|
I've updated the performance tests here! I was able to trace the issue to a bug where we were overriding the types set by the standard scaler. I updated the implementation so this should be good for review now! |
|
@dsherry I can't add you as a reviewer 😂 But your feedback would be appreciated! |
dsherry
left a comment
There was a problem hiding this comment.
@freddyaboulton well done on this! Tricky stuff, particularly those scaler types. The tests look comprehensive. GH isn't letting me "approve" because I am still listed as the PR author, haha. But let's do it! ✅

| # update them with the types created by components (if they are different). | ||
| # Components are not expected to create features with the same names | ||
| # so the only possible clash is between the types selected by the user and the types selected by a component. | ||
| # Because of that, the for-loop below is sufficient. |
There was a problem hiding this comment.
@freddyaboulton thanks for this comment, I think its warranted given the complexity of preserving types here.
There was a problem hiding this comment.
I wonder if woodwork could help manage this for us. I guess adding a concatenate ability would do it.
There was a problem hiding this comment.
@dsherry I think you're right that ww.concat would definitely make this easier! Looks like it'll be in the next release so I'm really excited to try it out!
|
|
||
| X_lsa = self._lsa.transform(X_ww[self._text_columns]) | ||
| X_nlp_primitives.set_index(X_ww.index, inplace=True) | ||
| X_lsa = self._lsa.transform(X_ww.ww[self._text_columns]) |
There was a problem hiding this comment.
Does this change matter? Or could we back it out?
There was a problem hiding this comment.
Ohhhh. @freddyaboulton explained offline. Woodwork schema won't be copied by default pandas indexing operator, have to use woodwork namespace. Thanks @freddyaboulton !
| "LSA(col_1)[0]": Double, | ||
| "LSA(col_1)[1]": Double, | ||
| } | ||
| assert X_t.ww.logical_types == expected_logical_types |
| check_feature_names(component_graph.input_feature_names) | ||
| component_graph.input_feature_names = {} | ||
| component_graph.predict(X) | ||
| check_feature_names(component_graph.input_feature_names) |
There was a problem hiding this comment.
Wow, good call adding this, pretty slick. I forgot that predict would set input_feature_names, because I guess it always gets set in fit first.
Fix #2296
Repro and root cause are on the issue.
There was more to the problem than I had originally thought. In addition to the text featurizer not passing along user-specified woodwork types to the LSA component (which is used inside the text featurizer), our component graph also wasn't preserving custom types during
fit. I shored up our unit test coverage of this case a bit.@freddyaboulton I bet the component graph changes will cause conflicts with your woodwork accessor feature branch. but hopefully they will be minor. I tried to write the tests in a way which meant you could avoid huge conflicts there at least.