Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component graph and text featurizer: preserve user-specified woodwork types #2297

Merged
merged 19 commits into from Jun 8, 2021

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented May 21, 2021

Fix #2296

Repro and root cause are on the issue.

There was more to the problem than I had originally thought. In addition to the text featurizer not passing along user-specified woodwork types to the LSA component (which is used inside the text featurizer), our component graph also wasn't preserving custom types during fit. I shored up our unit test coverage of this case a bit.

@freddyaboulton I bet the component graph changes will cause conflicts with your woodwork accessor feature branch. but hopefully they will be minor. I tried to write the tests in a way which meant you could avoid huge conflicts there at least.

@@ -120,7 +121,7 @@ def fit(self, X, y):
y (ww.DataColumn, pd.Series): The target training data of length [n_samples]
"""
X = infer_feature_types(X)
X = _convert_woodwork_types_wrapper(X.to_dataframe())
Copy link
Contributor Author

@dsherry dsherry May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to convert to pandas here. That's all handled in _compute_features. In fact, calling this here is a bug.

Symptom: calls to pipeline fit always use whatever types woodwork infers for all the columns, overriding all user-specified values.

Explanation: the first thing we do in _compute_features is call X = infer_feature_types(X). So, by converting to pandas here and then immediately converting back to woodwork first thing in _compute_features, we were effectively resetting the input to whatever woodwork infers by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry this is very interesting and I wonder if it's part of what I was seeing in my investigation with Chris.

@@ -119,7 +119,7 @@ def transform(self, X, y=None):
if X_nlp_primitives.isnull().any().any():
X_nlp_primitives.fillna(0, inplace=True)

X_lsa = self._lsa.transform(X[self._text_columns]).to_dataframe()
X_lsa = self._lsa.transform(X_ww[self._text_columns]).to_dataframe()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the first fix required for this bug. We need to pass the woodwork type info to the LSA component, otherwise it will run woodwork inference again from scratch and the newly reset types will be used by all the downstream components!

X = pd.DataFrame({'column_1': ['a', 'b', 'c', 'd', 'a', 'a', 'b', 'c', 'b'],
'column_2': [1, 2, 3, 4, 5, 6, 5, 4, 3]})
X = pd.DataFrame({'column_1': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
'column_2': [1, 2, 3, 3, 4, 4, 5, 5, 6]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No functional change, just reordered to make this test easier to understand

'column_2_1', 'column_2_2', 'column_2_3', 'column_2_4', 'column_2_5', 'column_2_6']
assert input_feature_names['Elastic Net'] == ['column_3', 'column_1_a', 'column_1_b', 'column_1_c', 'column_1_d',
'column_2_1', 'column_2_2', 'column_2_3', 'column_2_4', 'column_2_5', 'column_2_6']
assert input_feature_names['Text'] == ['column_3', 'column_5', 'column_1_a', 'column_1_b', 'column_1_c', 'column_1_d',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woodwork would infer the data in "column_5" to be datetime, but I set it to natural language above. We check here that it passes through the datetime featurizer untouched and enters the text featurizer. On the next assert, we check that it gets transformed into natural language features.

y = pd.Series([1, 0, 1, 0, 1, 1, 0, 0, 0])
X = infer_feature_types(X, {"column_2": "categorical"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusing thing about this test is that it was passing before my changes. That's weird because here we set an integer column "column_2" to have categorical type, and saw that as expected one-hot encoded features based off of "column_2" show up in the datetime component's input.

@@ -367,13 +369,13 @@ def test_fit_correct_inputs(mock_ohe_fit_transform, mock_imputer_fit_transform,
X = pd.DataFrame(X)
y = pd.Series(y)
graph = {'Imputer': [Imputer], 'OHE': [OneHotEncoder, 'Imputer.x', 'Imputer.y']}
expected_x = ww.DataTable(pd.DataFrame(index=X.index, columns=X.index).fillna(1))
expected_x = ww.DataTable(pd.DataFrame(index=X.index, columns=X.columns).fillna(1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "bug" per se, but the imputer will return one column for each input column, not more which is what this was doing

expected_y = ww.DataColumn(pd.Series(index=y.index).fillna(0))
mock_imputer_fit_transform.return_value = tuple((expected_x, expected_y))
mock_ohe_fit_transform.return_value = expected_x
component_graph = ComponentGraph(graph).instantiate({})
component_graph.fit(X, y)
expected_x_df = expected_x.to_dataframe().astype("Int64")
expected_x_df = expected_x.to_dataframe().astype("float64")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at X_y_binary, all 20 columns in X have type float64.

else:
return_x = pd.concat(x_inputs, axis=1)
return_y = y
if y_input is not None:
return_y = y_input
return_x = infer_feature_types(return_x)
return_x = _retain_custom_types_and_initalize_woodwork(X, return_x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method _consolidate_inputs currently only gets called inside _compute_features. If you look at the call site, you'll see the x_inputs is always a list of pandas dataframes. Therefore, after concatenating those pandas dataframes representing input features from the parent components in the graph, we need to call _retain_custom_types_and_initalize_woodwork to preserve any columns which appeared in the original woodwork input.

Copy link
Contributor Author

@dsherry dsherry May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added test_component_graph_types_multi_input_1 and test_component_graph_types_multi_input_2 to check this is working properly. Both fail without this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the limitation we still face is that the types that are specified during the pipeline evaluation are not necessarily preserved. Only the types that are specified before pipeline evaluation. We don't make much use of this yet but we may want to do so in the future (one thing that comes to mind is preserving semantic tags). I was initially worried about how the OHE sets boolean types but we're ok for now because there's a one-to-one mapping between the boolean physical type and the boolean logical type so the inference will always work out in our favor.

I don't think this should block your pr because I'm not sure there's a way to prevent this from happening until alteryx/woodwork#884 is released but just want to call out we're still not 100% there.

from evalml.pipelines.components import Transformer
from evalml.pipelines import RegressionPipeline
from evalml.utils import infer_feature_types, _retain_custom_types_and_initalize_woodwork
import woodwork as ww
import pandas as pd
import pytest

class ZipCodeExtractor(Transformer):
    name = "Zip Code Extractor"
    
    def fit(self, X, y):
        return self
    
    def transform(self, X, y):
        X = infer_feature_types(X)
        X = X.select(["address"])
        X_df = X.to_dataframe()
        X_df['zip_code'] = pd.Series(["02101", "02139", "02152"])
        X_df.drop(columns=X.columns, inplace=True)
        new_X = _retain_custom_types_and_initalize_woodwork(X, X_df)
        new_X = new_X.set_types({'zip_code': "ZipCode"})
        return new_X


class ZipCodeToAveragePrice(Transformer):
    name = "Check Zip Code Preserved"
    
    def fit(self, X, y):
        return self
    
    def transform(self, X, y):
        X = infer_feature_types(X)
        X = X.select(["ZipCode"])
        assert len(X.columns) > 0, "No Zip Code!"
        X_df = X.to_dataframe()
        X_df['average_apartment_price'] = pd.Series([1000, 2000, 3000])
        new_X = _retain_custom_types_and_initalize_woodwork(X, X_df)
        return new_X

    
X = pd.DataFrame({"postal_address": ["address-1", "address-2", "address-3"]})
X = ww.DataTable(X, semantic_tags={"postal_address": ['address']})
y = pd.Series([1500, 2500, 35000])

apartment_price_predictor = RegressionPipeline([ZipCodeExtractor, ZipCodeToAveragePrice, "Random Forest Regressor"])

with pytest.raises(AssertionError, match="No Zip Code!"):
    apartment_price_predictor.fit(X, y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @freddyaboulton . I'll try to file something for this before we merge this PR. Its a great point, and yep something we'll run into eventually.

@@ -120,7 +121,7 @@ def fit(self, X, y):
y (ww.DataColumn, pd.Series): The target training data of length [n_samples]
"""
X = infer_feature_types(X)
X = _convert_woodwork_types_wrapper(X.to_dataframe())
y = infer_feature_types(y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and the similar change in _compute_features are important. Otherwise, if the input provided to fit is not a woodwork structure, fit will fail!

@@ -319,13 +322,13 @@ def _consolidate_inputs(x_inputs, y_input, X, y):
ww.DataTable, ww.DataColumn: The X and y transformed values to evaluate a component with
"""
if len(x_inputs) == 0:
return_x = X
return_x = X.to_dataframe()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're now expecting _compute_features will be normalizing input to woodwork up the stack, we need to convert to pandas here. You can see that this X input is guaranteed to be woodwork by the code in _compute_features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, woodwork conversions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yep. This is necessary because there's no concatenate feature in woodwork 0.0.11. I'm pretty sure this is handled on @freddyaboulton 's accessor feature branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, in the ww update branch we just do return_x = X hehe but fyi there still isn't a concatenate feature in ww

@codecov
Copy link

codecov bot commented May 21, 2021

Codecov Report

Merging #2297 (7fdae8a) into main (a318afa) will increase coverage by 10.5%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff            @@
##            main   #2297      +/-   ##
========================================
+ Coverage   89.4%   99.9%   +10.5%     
========================================
  Files        281     281              
  Lines      24671   24760      +89     
========================================
+ Hits       22054   24731    +2677     
+ Misses      2617      29    -2588     
Impacted Files Coverage Δ
...del_understanding_tests/test_partial_dependence.py 100.0% <ø> (+42.8%) ⬆️
evalml/pipelines/component_graph.py 100.0% <100.0%> (+5.6%) ⬆️
...ents/transformers/preprocessing/text_featurizer.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_dask_engine.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_text_featurizer.py 100.0% <100.0%> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 100.0% <100.0%> (+3.5%) ⬆️
evalml/automl/automl_search.py 99.9% <0.0%> (+0.2%) ⬆️
evalml/tests/utils_tests/test_logger.py 100.0% <0.0%> (+0.6%) ⬆️
evalml/pipelines/utils.py 100.0% <0.0%> (+1.0%) ⬆️
evalml/tests/conftest.py 100.0% <0.0%> (+1.1%) ⬆️
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a318afa...7fdae8a. Read the comment docs.

@dsherry dsherry marked this pull request as ready for review May 21, 2021 02:44
@@ -154,8 +154,9 @@ def eval_pipelines(pipelines, engine):
assert all([s._is_fitted for s in par_pipelines])

# Ensure the scores in parallel and sequence are same
assert set(par_scores) == set(seq_scores)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was flaking on me in the 17th decimal place, but only for linux/windows python 3.8 😆 not sure why exactly this PR would trigger that.

@@ -319,13 +322,13 @@ def _consolidate_inputs(x_inputs, y_input, X, y):
ww.DataTable, ww.DataColumn: The X and y transformed values to evaluate a component with
"""
if len(x_inputs) == 0:
return_x = X
return_x = X.to_dataframe()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, woodwork conversions

evalml/tests/component_tests/test_text_featurizer.py Outdated Show resolved Hide resolved
@dsherry
Copy link
Contributor Author

dsherry commented May 24, 2021

For those following along, this is still ready for review, but I'm waiting for #2181 to merge (woodwork accessor), then I'll rebase this and get it in.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The tests look pretty thorough!

@dsherry dsherry force-pushed the ds_2296_datetime_nl_woodwork_fix branch from 1c376ad to 7dfebd3 Compare May 24, 2021 21:01
@dsherry dsherry marked this pull request as draft May 28, 2021 15:31
@dsherry dsherry force-pushed the ds_2296_datetime_nl_woodwork_fix branch from 7dfebd3 to b5e532d Compare May 28, 2021 15:31
0.1666874487986626: 1,
0.13357573073236878: 1,
0.06778096366056789: 1,
0.19149451286750388: 154,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had posted a previous comment explaining why this change is necessary but it got wiped from the diff view with the black pr.

In summary, since we're preserving types we were previously not preserving (bool/int) the predicted probabilities are different which means the partial dependence is different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard this and the previous comment! This was actually a symptom of a bug hehe

@freddyaboulton
Copy link
Contributor

Perf tests here

Our linear estimators are doing worse so I think we should hold off on merging until we know why.

# Because of that, the for-loop below is sufficient.

logical_types = {}
logical_types.update(X.ww.logical_types)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we would do this before the for-loop which was causing the performance regression in linear estimators

["column_2", "column_1_a", "column_1_b", "column_1_c", "column_1_d", "column_3"]
)
assert mock_rf_fit.call_args[0][0].ww.logical_types["column_3"] == Integer
assert mock_rf_fit.call_args[0][0].ww.logical_types["column_2"] == Double
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to make sure we don't overwrite the types created by the standard scaler.

@freddyaboulton
Copy link
Contributor

I've updated the performance tests here!

I was able to trace the issue to a bug where we were overriding the types set by the standard scaler. I updated the implementation so this should be good for review now!

@freddyaboulton freddyaboulton marked this pull request as ready for review June 4, 2021 20:53
@freddyaboulton freddyaboulton assigned dsherry and unassigned dsherry Jun 4, 2021
@freddyaboulton
Copy link
Contributor

@dsherry I can't add you as a reviewer 😂 But your feedback would be appreciated!

@dsherry dsherry assigned dsherry and unassigned dsherry Jun 4, 2021
Copy link
Contributor Author

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton well done on this! Tricky stuff, particularly those scaler types. The tests look comprehensive. GH isn't letting me "approve" because I am still listed as the PR author, haha. But let's do it! ✅
git_merge

# update them with the types created by components (if they are different).
# Components are not expected to create features with the same names
# so the only possible clash is between the types selected by the user and the types selected by a component.
# Because of that, the for-loop below is sufficient.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton thanks for this comment, I think its warranted given the complexity of preserving types here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if woodwork could help manage this for us. I guess adding a concatenate ability would do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I think you're right that ww.concat would definitely make this easier! Looks like it'll be in the next release so I'm really excited to try it out!

alteryx/woodwork#932

X_nlp_primitives.set_index(X_ww.index, inplace=True)
X_lsa = self._lsa.transform(X_ww.ww[self._text_columns])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change matter? Or could we back it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhh. @freddyaboulton explained offline. Woodwork schema won't be copied by default pandas indexing operator, have to use woodwork namespace. Thanks @freddyaboulton !

"LSA(col_1)[0]": Double,
"LSA(col_1)[1]": Double,
}
assert X_t.ww.logical_types == expected_logical_types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.

check_feature_names(component_graph.input_feature_names)
component_graph.input_feature_names = {}
component_graph.predict(X)
check_feature_names(component_graph.input_feature_names)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, good call adding this, pretty slick. I forgot that predict would set input_feature_names, because I guess it always gets set in fit first.

@freddyaboulton freddyaboulton merged commit 96dc245 into main Jun 8, 2021
@freddyaboulton freddyaboulton deleted the ds_2296_datetime_nl_woodwork_fix branch June 8, 2021 00:12
@chukarsten chukarsten mentioned this pull request Jun 9, 2021
@chukarsten chukarsten mentioned this pull request Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Converting Datetime to Natural Language produces TextFeaturizer error during pipeline fit
4 participants