Skip to content

WW 0.2.0 Update: Updating pipelines#2205

Merged
freddyaboulton merged 9 commits into2035-use-ww-accessorfrom
2035-update-pipelines
May 7, 2021
Merged

WW 0.2.0 Update: Updating pipelines#2205
freddyaboulton merged 9 commits into2035-use-ww-accessorfrom
2035-update-pipelines

Conversation

@freddyaboulton
Copy link
Copy Markdown
Contributor

@freddyaboulton freddyaboulton commented Apr 29, 2021

Pull Request Description

Update pipelines and stacked ensemble unit tests.

See the roadmap here


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

X = pd.DataFrame()
# Normalize the data into pandas objects
X_ww = infer_feature_types(X)
X_ww = infer_feature_types(X).ww.copy()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake I made in my components pr. Before we use assign which creates a new dataframe. Since we're not doing that, we need to create the copy ourselves. This is important so that

pipeline.predict(X, y)
pipeline.predict(X, y)

works. Pipeline tests caught this but I added a unit test in the delayed_feature_transformer unit tests as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, so this makes sure that we don't change the original X input. Did we need to do this in the other components as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question @bchen1116 ! I am adding coverage for the datetime featurizer and the text featurizer.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2021

Codecov Report

Merging #2205 (ec66b2d) into 2035-use-ww-accessor (90578c7) will increase coverage by 16.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@                   Coverage Diff                   @@
##           2035-use-ww-accessor   #2205      +/-   ##
=======================================================
+ Coverage                  53.2%   69.2%   +16.1%     
=======================================================
  Files                       280     280              
  Lines                     24038   24012      -26     
=======================================================
+ Hits                      12770   16609    +3839     
+ Misses                    11268    7403    -3865     
Impacted Files Coverage Δ
evalml/tests/component_tests/test_components.py 100.0% <ø> (+2.6%) ⬆️
evalml/pipelines/binary_classification_pipeline.py 100.0% <100.0%> (+54.2%) ⬆️
.../pipelines/binary_classification_pipeline_mixin.py 100.0% <100.0%> (+67.8%) ⬆️
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (+69.0%) ⬆️
evalml/pipelines/component_graph.py 98.8% <100.0%> (+57.3%) ⬆️
...rmers/preprocessing/delayed_feature_transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/pipeline_base.py 95.8% <100.0%> (+58.8%) ⬆️
evalml/pipelines/regression_pipeline.py 100.0% <100.0%> (+58.9%) ⬆️
.../pipelines/time_series_classification_pipelines.py 98.1% <100.0%> (+77.1%) ⬆️
...valml/pipelines/time_series_regression_pipeline.py 98.0% <100.0%> (+79.4%) ⬆️
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90578c7...ec66b2d. Read the comment docs.

@freddyaboulton freddyaboulton marked this pull request as ready for review April 29, 2021 18:35
Copy link
Copy Markdown
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job getting this crunched out so quick! LGTM, just left a quick question

X = pd.DataFrame()
# Normalize the data into pandas objects
X_ww = infer_feature_types(X)
X_ww = infer_feature_types(X).ww.copy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, so this makes sure that we don't change the original X input. Did we need to do this in the other components as well?

@freddyaboulton freddyaboulton changed the title Updating pipelines WW 0.2.0 Update: Updating pipelines May 3, 2021
Copy link
Copy Markdown
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Copy link
Copy Markdown
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great catch with needing to copy the data, loving how much we're able to continue to clean up :)) Just left a few comments but nothing blocking!

proba = self.estimator.predict_proba(X).to_dataframe()
proba.columns = self._encoder.classes_
proba = self.estimator.predict_proba(X)
proba = proba.ww.rename(columns={col: new_col for col, new_col in zip(proba.columns, self._encoder.classes_)})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, cool that we can use this!!

Comment thread evalml/pipelines/component_graph.py Outdated
Comment on lines +222 to +223
if isinstance(parent_x, pd.Series):
parent_x = pd.Series(parent_x, name=parent_input)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omega nitpick but is it possible to just use rename? :o

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you can even just do parent_x.name = parent_input right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread evalml/pipelines/utils.py
pp_components.append(DropNullColumns)
input_logical_types = set(X.logical_types.values())
types_imputer_handles = {logical_types.Boolean, logical_types.Categorical, logical_types.Double, logical_types.Integer}
input_logical_types = set(X.ww.logical_types.values())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're calling .ww here, do we need to make sure that we've called X.ww.init first?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this is a private method and is only called in make_pipeline, its ok to assume that infer_feature_types was called up the stack i.e. in make_pipeline. Our test coverage for make_pipeline can enforce this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed @dsherry !

def test_serialization(X_y_binary, ts_data, tmpdir, helper_functions):
path = os.path.join(str(tmpdir), 'component.pkl')
for component_class in all_components():
if component_class in {StackedEnsembleClassifier, StackedEnsembleRegressor}:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

'0_day_of_week': {'Saturday': 6, 'Tuesday': 2}}


def test_datetime_featurizer_does_not_modify_input_data():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look good! Not necessary here but I wonder if this is something we'd want for all of our components 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Not required to add in this PR.

expected_x_df = expected_x.to_dataframe().astype("Int64")
assert_frame_equal(expected_x_df, mock_ohe_fit_transform.call_args[0][0].to_dataframe())
assert_series_equal(expected_y.to_series(), mock_ohe_fit_transform.call_args[0][1].to_series())
expected_x_df = expected_x.astype("int64")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question I've had before but do we need to specify astype("int64") anymore? Is this just left for clarity? :o

Copy link
Copy Markdown
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is so beautiful I feel like this

john_cena

if parent_output is not None:
final_component_inputs.append(parent_output)
concatted = pd.concat([component_input.to_dataframe() for component_input in final_component_inputs], axis=1)
concatted = pd.concat([component_input for component_input in final_component_inputs], axis=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can preserve woodwork info in the concatenation here and avoid having to call infer_feature_types at the end? AKA pd.ww.concat? That call to infer_feature_types(concatted) should not be doing any inference, right?

When we finish the upgrade and merge the feature branch, it may be worthwhile to revisit component graph evaluation specifically, and to make sure we know exactly when inference is occurring. (Maybe you already know this; I don't think I do 100% yet!)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that ww.concat would be ideal but it's currently in development! alteryx/woodwork#884

Also agree on auditing the ComponentGraph so we can streamline inference once the dust settles.

Comment thread evalml/pipelines/component_graph.py Outdated
Comment on lines +222 to +223
if isinstance(parent_x, pd.Series):
parent_x = pd.Series(parent_x, name=parent_input)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you can even just do parent_x.name = parent_input right?

Comment thread evalml/pipelines/utils.py
pp_components.append(DropNullColumns)
input_logical_types = set(X.logical_types.values())
types_imputer_handles = {logical_types.Boolean, logical_types.Categorical, logical_types.Double, logical_types.Integer}
input_logical_types = set(X.ww.logical_types.values())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this is a private method and is only called in make_pipeline, its ok to assume that infer_feature_types was called up the stack i.e. in make_pipeline. Our test coverage for make_pipeline can enforce this.

'0_day_of_week': {'Saturday': 6, 'Tuesday': 2}}


def test_datetime_featurizer_does_not_modify_input_data():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Not required to add in this PR.

@freddyaboulton freddyaboulton merged commit 0e65c58 into 2035-use-ww-accessor May 7, 2021
@freddyaboulton freddyaboulton deleted the 2035-update-pipelines branch May 7, 2021 00:40
freddyaboulton added a commit that referenced this pull request May 25, 2021
* WW Accessor: Update demo datasets, preprocessing, and utils (#2172)

* Update demos, utils, and preprocessing

* Updating make test commands

* Update ww requirement

* Add test that infer_feature_types preserves schema

* Add test that infer_feature_types raises errors with invalid schema

* load_data always returns woodwork info. Deleted return_pandas

* Updating docstrings

* Release notes for first PR

* Deleting redundant tests

* update test name

* WW 0.2.0: Update Data Checks (#2182)

* Updating data checks

* Undo accidental edit

* Fixing typo where we assign to ww.init

* WW Accessor: Updating objectives (#2185)

* Updating objectives

* Deleting superfluous pd.Series

* WW Accessor: Updating Components (#2191)

* Update components - first commit

* Update delayed_features_transformer to not use assign

* Fixing tests

* Skipping Boolean with Nan test in imputers

* Fixing base sampler _prepare_data

* Fixing target imputer null bool test

* Fix test skips

* Addressing comments

* Clean up sampler tests

* Editing docstrings

* Update to ww 0.3.0

* Fixing components that didn't have merge conflicts

* WW 0.2.0 Update: Updating pipelines (#2205)

* Updating pipelines

* Fixing docstrings

* Fix stacked test

* Add tests to check input data not modified

* Use rename in component graph

* WW Accessor: Update model understanding (#2247)

* Update model understanding module

* Removing unused import

* WW Accessor: Update Automl (#2243)

* Updating pipelines

* Fixing docstrings

* Fix stacked test

* Update automl

* Update to preserve schema

* Preserve schema in train_pipelines/score_pipelines

* No ww drop

* Update standard scaler

* Use weak ref feture branch

* Add tests to check input data not modified

* Use weak-ref branch

* Use weak ref branch

* Fixing tests

* Set ww back to 0.3.0

* Adding objective tests that use automl

* Add back minimal-dependencies-flag

* Fix test command

* Updating docstrings adding dask test to check schema is sent

* Fixing import order in test_dask_engine

* Fixing last remaining docstrings

* Fix typo in ignore command

* Use ww init series

* No doctest modules yet

* Fix docstrings & use schemas stored in automl_config

* WW Update: Update docs and docstrings (#2292)

* Fixing docs and docstrings

* Update docstring in highly_null_data_check

* Delete warning from init

* Remove print(X.ww)

* Fixing notebook python version :(

* Update release notes

* Delete _convert_woodwork_types_wrapper completely

* Fixing coverage

* Add back standard scaler try

* Remove _convert_woodwork_types_wrapper from docs'

* Linting docs

* Merging main

* Fix release notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants