Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding inverse transform method to pipelines #2256

Merged
merged 15 commits into from
Jun 7, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented May 11, 2021

Pull Request Description

Fixes #2046

Follows the recommendation and implementation plan outlined in the design document


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@@ -95,6 +95,8 @@ def predict(self, X, y=None, objective=None):
if self.estimator.predict_uses_y:
y_arg = y
predictions = self.estimator.predict(features_no_nan, y_arg).to_series()
predictions.index = y.index
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The polynomial detrender needs the predictions to have an index of the same type as the data it was fit on.
I think this is another reason to do #1639

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line equivalent to using this function? It seems nit picky and trivial, but I feel like I've personally been bitten with weird behavior when I don't use pandas' established API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unfortunately in this case reindex_like is not avaliable for series, but even if I'm wrong it'll fill NaNs in the places where the indices don't match. I think reindex does the same thing.

@@ -73,3 +75,11 @@ def fit_transform(self, X, y=None):

def _get_feature_provenance(self):
return {}


class TargetTransformer(Transformer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this abstraction to make it clear which components we expect to be able to invert. It also makes the implementation of ComponentGraph.from_list much easier (imo) without having to make every component return both X and y.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@codecov
Copy link

codecov bot commented May 12, 2021

Codecov Report

Merging #2256 (a67528c) into main (432f876) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2256     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        281     281             
  Lines      24607   24719    +112     
=======================================
+ Hits       24578   24690    +112     
  Misses        29      29             
Impacted Files Coverage Δ
evalml/pipelines/component_graph.py 100.0% <100.0%> (ø)
evalml/pipelines/components/component_base_meta.py 100.0% <100.0%> (ø)
...transformers/preprocessing/polynomial_detrender.py 97.6% <100.0%> (ø)
...l/pipelines/components/transformers/transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/pipeline_base.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/pipelines/pipeline_meta.py 100.0% <100.0%> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <100.0%> (ø)
...tests/component_tests/test_polynomial_detrender.py 100.0% <100.0%> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 100.0% <100.0%> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 432f876...a67528c. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the 2046-inverse-transform branch 2 times, most recently from 0c52447 to 2c640ca Compare May 13, 2021 14:09
@@ -95,6 +95,8 @@ def predict(self, X, y=None, objective=None):
if self.estimator.predict_uses_y:
y_arg = y
predictions = self.estimator.predict(features_no_nan, y_arg).to_series()
predictions.index = y.index
predictions = self.inverse_transform(predictions).to_series()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the time series regression pipeline is the only pipeline that uses the inverse_transform method.

As we add more target transformers (target encoder as a component, #914), I expect the other pipeline classes to call the inverse_transform method.

return infer_feature_types(y / 2)


class SubsetData(Transformer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simulate a transformer that modifies the target but is not a target transformer, e.g. a sampler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add this to the class' docstring so that if someone needs to write a new test, they might be able to reuse this.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like great work!

Comment on lines 105 to 106
sampler_index = next(iter([i for i, tup in enumerate(without_inverse_transform) if "sampler" in tup[0]]), -1)
component_dict[without_inverse_transform[sampler_index][0]].append(f"{with_inverse_transform[-1][0]}.y")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha, oh my dude. This is some big brain stuff.

def _get_parent_y(self, component_name):
"""Helper for inverse_transform method."""
parents = self.get_parents(component_name)
return next(iter(p[:-2] for p in parents if '.y' in p), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the -2 so that we capture the string of the parent? I'm a little about what happens if we change all these ".y's"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think our component graph would break but it would be covered by unit tests at least. I think if a user does something crazy like .z or something, an exception about the graph not being connected should be raised. Maybe we can make that clearer in the future.

@@ -73,3 +75,11 @@ def fit_transform(self, X, y=None):

def _get_feature_provenance(self):
return {}


class TargetTransformer(Transformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@@ -95,6 +95,8 @@ def predict(self, X, y=None, objective=None):
if self.estimator.predict_uses_y:
y_arg = y
predictions = self.estimator.predict(features_no_nan, y_arg).to_series()
predictions.index = y.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line equivalent to using this function? It seems nit picky and trivial, but I feel like I've personally been bitten with weird behavior when I don't use pandas' established API

return infer_feature_types(y / 2)


class SubsetData(Transformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add this to the class' docstring so that if someone needs to write a new test, they might be able to reuse this.

Comment on lines 904 to 978
component_graphs = [(ComponentGraph({"Imputer": [Imputer],
"Log": [LogTransform],
"Random Forest": ["Random Forest Regressor", "Imputer.x", "Log.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series()))),
(ComponentGraph({"Imputer": [Imputer],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Random Forest": ["Random Forest Regressor", "Imputer.x", "Double.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 2))),
(ComponentGraph({"Imputer": [Imputer],
"Log": [LogTransform, "Imputer.x"],
"Double": [DoubleTransform, "Log.x", "Log.y"],
"Random Forest": ["Random Forest Regressor", "Double.x", "Double.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 2))),
(ComponentGraph({"Imputer": [Imputer],
"OneHot": [OneHotEncoder, "Imputer.x"],
"DateTime": [DateTimeFeaturizer, "OneHot.x"],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Random Forest": ["Random Forest Regressor", "DateTime.x", "Double.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 2))),
(ComponentGraph({"Imputer": [Imputer],
"OneHot": [OneHotEncoder, "Imputer.x"],
"DateTime": [DateTimeFeaturizer, "OneHot.x"],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Double2": [DoubleTransform, "Double.y"],
"Random Forest": ["Random Forest Regressor", "DateTime.x", "Double2.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 4))),
(ComponentGraph({"Imputer": ['Imputer'],
"Double": [DoubleTransform],
"DateTime 1": ["DateTime Featurization Component", "Imputer"],
"ET": ["Extra Trees Regressor", "DateTime 1.x", "Double.y"],
"Double 2": [DoubleTransform],
"DateTime 2": ["DateTime Featurization Component", "Imputer"],
"Double 3": [DoubleTransform, "Double 2.y"],
"RandomForest": ["Random Forest Regressor", "DateTime 2.x", "Double 3.y"],
"DateTime 3": ["DateTime Featurization Component", "Imputer"],
"Double 4": [DoubleTransform],
"Linear": ["Linear Regressor", "DateTime 3.x", "Double 4.y"],
"Logistic Regression": ["Linear Regressor", "Linear", "RandomForest", "ET", 'Double 3.y']}),
lambda y: infer_feature_types(y.to_series() / 4)),
(ComponentGraph({"Imputer": [Imputer],
"OneHot": [OneHotEncoder, "Imputer.x"],
"DateTime": [DateTimeFeaturizer, "OneHot.x"],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Double2": [DoubleTransform, "Double.y"],
"Subset": [SubsetData, "DateTime.x", "Double2.y"],
"Random Forest": ["Random Forest Regressor", "Subset.x", "Subset.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 4))),
(ComponentGraph({"Imputer": [Imputer],
"Random Forest": ["Random Forest Regressor", "Imputer.x"]}),
lambda y: y),
(ComponentGraph({"Imputer": [Imputer],
"DateTime": [DateTimeFeaturizer, "Imputer.x"],
"OneHot": [OneHotEncoder, "DateTime.x"],
"Random Forest": ["Random Forest Regressor", "OneHot.x"]}),
lambda y: y),
(ComponentGraph({"Random Forest": ["Random Forest Regressor"]}),
lambda y: y),
(ComponentGraph({"Imputer": ['Imputer'],
"Double": [DoubleTransform],
"DateTime 1": ["DateTime Featurization Component", "Imputer"],
"ET": ["Extra Trees Regressor", "DateTime 1.x", "Double.y"],
"Double 2": [DoubleTransform],
"DateTime 2": ["DateTime Featurization Component", "Imputer"],
"Double 3": [DoubleTransform, "Double 2.y"],
"RandomForest": ["Random Forest Regressor", "DateTime 2.x", "Double 3.y"],
"DateTime 3": ["DateTime Featurization Component", "Imputer"],
"Double 4": [DoubleTransform],
"Linear": ["Linear Regressor", "DateTime 3.x", "Double 4.y"],
"Logistic Regression": ["Linear Regressor", "Linear", "RandomForest", "ET"]}),
lambda y: y)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, my dude...nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holy potates

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments and questions just for my own clarification purposes. The tests look great!

@@ -20,7 +20,7 @@ def _check_for_fit(self, X=None, y=None):
if not self._is_fitted and self.needs_fitting:
raise ComponentNotYetFittedError(f'This {klass} is not fitted yet. You must fit {klass} before calling {method.__name__}.')
elif method.__name__ == 'inverse_transform':
return method(self, X, y)
return method(self, X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we return method(self, y)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since inverse transform only takes one argument, the y is actually "called" X in this piece of code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow that's trippy... Any chance you can leave that as a comment for future notes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Components that implement inverse_transform are PolynomialDetrender, LabelEncoder (tbd).

Arguments:
y: (pd.Series, ww.DataTable): Final component features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick (no colon after y):
y (pd.Series, ww.DataTable): Final component features

@@ -23,6 +23,8 @@ def _check_for_fit(self, X=None, objective=None):
return method(self, X)
elif method.__name__ == 'predict':
return method(self, X, objective)
elif method.__name__ == 'inverse_transform':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above?


@pytest.mark.parametrize("component_list,answer", lists)
def test_from_list_with_target_transformers(component_list, answer):
assert ComponentGraph.from_list(component_list).component_dict == answer.component_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are 👌

* Fixes
* Changes
* Updated logging information in ``AutoMLSearch.__init__`` to clarify pipeline generation :pr:`2263`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh I'm moving this from the 0.24.1 release to the future release!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton Sure thing!

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job with the tests, looks great!

* Fixes
* Changes
* Updated logging information in ``AutoMLSearch.__init__`` to clarify pipeline generation :pr:`2263`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton Sure thing!

# If there are target transformers present, connect them together and then pass the final output
# to the sampler (if present) or the final estimator
with_inverse_transform = list(filter(lambda tup: issubclass(tup[1], TargetTransformer), component_list))
without_inverse_transform = list(filter(lambda tup: not issubclass(tup[1], TargetTransformer), component_list))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@freddyaboulton freddyaboulton force-pushed the 2046-inverse-transform branch 3 times, most recently from 69dfa01 to 94b690d Compare May 19, 2021 14:32
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nitpicky comments but otherwise, really cool stuff!!

Components that implement inverse_transform are PolynomialDetrender, LabelEncoder (tbd).

Arguments:
y: (pd.Series, ww.DataTable): Final component features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, DataColumn?

"""Removes trends from time series by fitting a polynomial to the data."""
name = "Polynomial Detrender"
_defines_inverse_transform = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to get rid of this!

"""A component that transforms the target."""

@abstractmethod
def inverse_transform(self, y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful in docstring here to specify arguments + what we expect to return for when people override :)

Components that implement inverse_transform are PolynomialDetrender, LabelEncoder (tbd).

Arguments:
y (pd.Series, ww.DataTable): Final component features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataColumn? :O

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines 904 to 978
component_graphs = [(ComponentGraph({"Imputer": [Imputer],
"Log": [LogTransform],
"Random Forest": ["Random Forest Regressor", "Imputer.x", "Log.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series()))),
(ComponentGraph({"Imputer": [Imputer],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Random Forest": ["Random Forest Regressor", "Imputer.x", "Double.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 2))),
(ComponentGraph({"Imputer": [Imputer],
"Log": [LogTransform, "Imputer.x"],
"Double": [DoubleTransform, "Log.x", "Log.y"],
"Random Forest": ["Random Forest Regressor", "Double.x", "Double.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 2))),
(ComponentGraph({"Imputer": [Imputer],
"OneHot": [OneHotEncoder, "Imputer.x"],
"DateTime": [DateTimeFeaturizer, "OneHot.x"],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Random Forest": ["Random Forest Regressor", "DateTime.x", "Double.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 2))),
(ComponentGraph({"Imputer": [Imputer],
"OneHot": [OneHotEncoder, "Imputer.x"],
"DateTime": [DateTimeFeaturizer, "OneHot.x"],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Double2": [DoubleTransform, "Double.y"],
"Random Forest": ["Random Forest Regressor", "DateTime.x", "Double2.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 4))),
(ComponentGraph({"Imputer": ['Imputer'],
"Double": [DoubleTransform],
"DateTime 1": ["DateTime Featurization Component", "Imputer"],
"ET": ["Extra Trees Regressor", "DateTime 1.x", "Double.y"],
"Double 2": [DoubleTransform],
"DateTime 2": ["DateTime Featurization Component", "Imputer"],
"Double 3": [DoubleTransform, "Double 2.y"],
"RandomForest": ["Random Forest Regressor", "DateTime 2.x", "Double 3.y"],
"DateTime 3": ["DateTime Featurization Component", "Imputer"],
"Double 4": [DoubleTransform],
"Linear": ["Linear Regressor", "DateTime 3.x", "Double 4.y"],
"Logistic Regression": ["Linear Regressor", "Linear", "RandomForest", "ET", 'Double 3.y']}),
lambda y: infer_feature_types(y.to_series() / 4)),
(ComponentGraph({"Imputer": [Imputer],
"OneHot": [OneHotEncoder, "Imputer.x"],
"DateTime": [DateTimeFeaturizer, "OneHot.x"],
"Log": [LogTransform],
"Double": [DoubleTransform, "Log.y"],
"Double2": [DoubleTransform, "Double.y"],
"Subset": [SubsetData, "DateTime.x", "Double2.y"],
"Random Forest": ["Random Forest Regressor", "Subset.x", "Subset.y"]}),
lambda y: infer_feature_types(np.exp(y.to_series() / 4))),
(ComponentGraph({"Imputer": [Imputer],
"Random Forest": ["Random Forest Regressor", "Imputer.x"]}),
lambda y: y),
(ComponentGraph({"Imputer": [Imputer],
"DateTime": [DateTimeFeaturizer, "Imputer.x"],
"OneHot": [OneHotEncoder, "DateTime.x"],
"Random Forest": ["Random Forest Regressor", "OneHot.x"]}),
lambda y: y),
(ComponentGraph({"Random Forest": ["Random Forest Regressor"]}),
lambda y: y),
(ComponentGraph({"Imputer": ['Imputer'],
"Double": [DoubleTransform],
"DateTime 1": ["DateTime Featurization Component", "Imputer"],
"ET": ["Extra Trees Regressor", "DateTime 1.x", "Double.y"],
"Double 2": [DoubleTransform],
"DateTime 2": ["DateTime Featurization Component", "Imputer"],
"Double 3": [DoubleTransform, "Double 2.y"],
"RandomForest": ["Random Forest Regressor", "DateTime 2.x", "Double 3.y"],
"DateTime 3": ["DateTime Featurization Component", "Imputer"],
"Double 4": [DoubleTransform],
"Linear": ["Linear Regressor", "DateTime 3.x", "Double 4.y"],
"Logistic Regression": ["Linear Regressor", "Linear", "RandomForest", "ET"]}),
lambda y: y)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holy potates

@freddyaboulton freddyaboulton force-pushed the 2046-inverse-transform branch 3 times, most recently from b17e826 to df40b9a Compare May 25, 2021 16:11
@freddyaboulton freddyaboulton force-pushed the 2046-inverse-transform branch 2 times, most recently from 8705adf to d9e484d Compare June 7, 2021 16:59
@freddyaboulton freddyaboulton merged commit 9c414d3 into main Jun 7, 2021
@freddyaboulton freddyaboulton deleted the 2046-inverse-transform branch June 7, 2021 21:53
This was referenced Jun 9, 2021
@chukarsten chukarsten mentioned this pull request Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipelineBase inverse_transform method
5 participants