WW 0.2.0 Update: Updating pipelines by freddyaboulton · Pull Request #2205 · alteryx/evalml

freddyaboulton · 2021-04-29T16:59:52Z

Pull Request Description

Update pipelines and stacked ensemble unit tests.

See the roadmap here

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

freddyaboulton · 2021-04-29T17:03:38Z

            X = pd.DataFrame()
        # Normalize the data into pandas objects
-        X_ww = infer_feature_types(X)
+        X_ww = infer_feature_types(X).ww.copy()


This is a mistake I made in my components pr. Before we use assign which creates a new dataframe. Since we're not doing that, we need to create the copy ourselves. This is important so that

pipeline.predict(X, y) pipeline.predict(X, y)

works. Pipeline tests caught this but I added a unit test in the delayed_feature_transformer unit tests as well.

Awesome, so this makes sure that we don't change the original X input. Did we need to do this in the other components as well?

Great question @bchen1116 ! I am adding coverage for the datetime featurizer and the text featurizer.

codecov · 2021-04-29T17:07:53Z

Codecov Report

Merging #2205 (ec66b2d) into 2035-use-ww-accessor (90578c7) will increase coverage by 16.1%.
The diff coverage is 100.0%.

@@                   Coverage Diff                   @@
##           2035-use-ww-accessor   #2205      +/-   ##
=======================================================
+ Coverage                  53.2%   69.2%   +16.1%     
=======================================================
  Files                       280     280              
  Lines                     24038   24012      -26     
=======================================================
+ Hits                      12770   16609    +3839     
+ Misses                    11268    7403    -3865

Impacted Files	Coverage Δ
evalml/tests/component_tests/test_components.py	`100.0% <ø> (+2.6%)`	⬆️
evalml/pipelines/binary_classification_pipeline.py	`100.0% <100.0%> (+54.2%)`	⬆️
.../pipelines/binary_classification_pipeline_mixin.py	`100.0% <100.0%> (+67.8%)`	⬆️
evalml/pipelines/classification_pipeline.py	`100.0% <100.0%> (+69.0%)`	⬆️
evalml/pipelines/component_graph.py	`98.8% <100.0%> (+57.3%)`	⬆️
...rmers/preprocessing/delayed_feature_transformer.py	`100.0% <100.0%> (ø)`
evalml/pipelines/pipeline_base.py	`95.8% <100.0%> (+58.8%)`	⬆️
evalml/pipelines/regression_pipeline.py	`100.0% <100.0%> (+58.9%)`	⬆️
.../pipelines/time_series_classification_pipelines.py	`98.1% <100.0%> (+77.1%)`	⬆️
...valml/pipelines/time_series_regression_pipeline.py	`98.0% <100.0%> (+79.4%)`	⬆️
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90578c7...ec66b2d. Read the comment docs.

bchen1116

Great job getting this crunched out so quick! LGTM, just left a quick question

bchen1116 · 2021-04-29T21:24:38Z

            X = pd.DataFrame()
        # Normalize the data into pandas objects
-        X_ww = infer_feature_types(X)
+        X_ww = infer_feature_types(X).ww.copy()


Awesome, so this makes sure that we don't change the original X input. Did we need to do this in the other components as well?

chukarsten

This looks good to me.

angela97lin

LGTM! Great catch with needing to copy the data, loving how much we're able to continue to clean up :)) Just left a few comments but nothing blocking!

angela97lin · 2021-05-05T17:27:27Z

-        proba = self.estimator.predict_proba(X).to_dataframe()
-        proba.columns = self._encoder.classes_
+        proba = self.estimator.predict_proba(X)
+        proba = proba.ww.rename(columns={col: new_col for col, new_col in zip(proba.columns, self._encoder.classes_)})


Amazing, cool that we can use this!!

angela97lin · 2021-05-05T17:28:50Z

+                    if isinstance(parent_x, pd.Series):
+                        parent_x = pd.Series(parent_x, name=parent_input)


Omega nitpick but is it possible to just use rename? :o

Yeah I think you can even just do parent_x.name = parent_input right?

angela97lin · 2021-05-05T17:48:45Z

        pp_components.append(DropNullColumns)
-    input_logical_types = set(X.logical_types.values())
-    types_imputer_handles = {logical_types.Boolean, logical_types.Categorical, logical_types.Double, logical_types.Integer}
+    input_logical_types = set(X.ww.logical_types.values())


If we're calling .ww here, do we need to make sure that we've called X.ww.init first?

I think since this is a private method and is only called in make_pipeline, its ok to assume that infer_feature_types was called up the stack i.e. in make_pipeline. Our test coverage for make_pipeline can enforce this.

Yep agreed @dsherry !

angela97lin · 2021-05-05T17:49:59Z

 def test_serialization(X_y_binary, ts_data, tmpdir, helper_functions):
    path = os.path.join(str(tmpdir), 'component.pkl')
    for component_class in all_components():
-        if component_class in {StackedEnsembleClassifier, StackedEnsembleRegressor}:


angela97lin · 2021-05-05T17:55:18Z

                                                        '0_day_of_week': {'Saturday': 6, 'Tuesday': 2}}


+def test_datetime_featurizer_does_not_modify_input_data():


These tests look good! Not necessary here but I wonder if this is something we'd want for all of our components 🤔

Agreed. Not required to add in this PR.

angela97lin · 2021-05-05T17:57:55Z

-    expected_x_df = expected_x.to_dataframe().astype("Int64")
-    assert_frame_equal(expected_x_df, mock_ohe_fit_transform.call_args[0][0].to_dataframe())
-    assert_series_equal(expected_y.to_series(), mock_ohe_fit_transform.call_args[0][1].to_series())
+    expected_x_df = expected_x.astype("int64")


Same question I've had before but do we need to specify astype("int64") anymore? Is this just left for clarity? :o

dsherry

This PR is so beautiful I feel like this

dsherry · 2021-05-06T14:55:39Z

            if parent_output is not None:
                final_component_inputs.append(parent_output)
-        concatted = pd.concat([component_input.to_dataframe() for component_input in final_component_inputs], axis=1)
+        concatted = pd.concat([component_input for component_input in final_component_inputs], axis=1)


Is there a way we can preserve woodwork info in the concatenation here and avoid having to call infer_feature_types at the end? AKA pd.ww.concat? That call to infer_feature_types(concatted) should not be doing any inference, right?

When we finish the upgrade and merge the feature branch, it may be worthwhile to revisit component graph evaluation specifically, and to make sure we know exactly when inference is occurring. (Maybe you already know this; I don't think I do 100% yet!)

Agreed that ww.concat would be ideal but it's currently in development! alteryx/woodwork#884

Also agree on auditing the ComponentGraph so we can streamline inference once the dust settles.

dsherry · 2021-05-06T14:56:17Z

+                    if isinstance(parent_x, pd.Series):
+                        parent_x = pd.Series(parent_x, name=parent_input)


Yeah I think you can even just do parent_x.name = parent_input right?

dsherry · 2021-05-06T15:04:29Z

        pp_components.append(DropNullColumns)
-    input_logical_types = set(X.logical_types.values())
-    types_imputer_handles = {logical_types.Boolean, logical_types.Categorical, logical_types.Double, logical_types.Integer}
+    input_logical_types = set(X.ww.logical_types.values())


I think since this is a private method and is only called in make_pipeline, its ok to assume that infer_feature_types was called up the stack i.e. in make_pipeline. Our test coverage for make_pipeline can enforce this.

dsherry · 2021-05-06T15:05:39Z

                                                        '0_day_of_week': {'Saturday': 6, 'Tuesday': 2}}


+def test_datetime_featurizer_does_not_modify_input_data():


Agreed. Not required to add in this PR.

* WW Accessor: Update demo datasets, preprocessing, and utils (#2172) * Update demos, utils, and preprocessing * Updating make test commands * Update ww requirement * Add test that infer_feature_types preserves schema * Add test that infer_feature_types raises errors with invalid schema * load_data always returns woodwork info. Deleted return_pandas * Updating docstrings * Release notes for first PR * Deleting redundant tests * update test name * WW 0.2.0: Update Data Checks (#2182) * Updating data checks * Undo accidental edit * Fixing typo where we assign to ww.init * WW Accessor: Updating objectives (#2185) * Updating objectives * Deleting superfluous pd.Series * WW Accessor: Updating Components (#2191) * Update components - first commit * Update delayed_features_transformer to not use assign * Fixing tests * Skipping Boolean with Nan test in imputers * Fixing base sampler _prepare_data * Fixing target imputer null bool test * Fix test skips * Addressing comments * Clean up sampler tests * Editing docstrings * Update to ww 0.3.0 * Fixing components that didn't have merge conflicts * WW 0.2.0 Update: Updating pipelines (#2205) * Updating pipelines * Fixing docstrings * Fix stacked test * Add tests to check input data not modified * Use rename in component graph * WW Accessor: Update model understanding (#2247) * Update model understanding module * Removing unused import * WW Accessor: Update Automl (#2243) * Updating pipelines * Fixing docstrings * Fix stacked test * Update automl * Update to preserve schema * Preserve schema in train_pipelines/score_pipelines * No ww drop * Update standard scaler * Use weak ref feture branch * Add tests to check input data not modified * Use weak-ref branch * Use weak ref branch * Fixing tests * Set ww back to 0.3.0 * Adding objective tests that use automl * Add back minimal-dependencies-flag * Fix test command * Updating docstrings adding dask test to check schema is sent * Fixing import order in test_dask_engine * Fixing last remaining docstrings * Fix typo in ignore command * Use ww init series * No doctest modules yet * Fix docstrings & use schemas stored in automl_config * WW Update: Update docs and docstrings (#2292) * Fixing docs and docstrings * Update docstring in highly_null_data_check * Delete warning from init * Remove print(X.ww) * Fixing notebook python version :( * Update release notes * Delete _convert_woodwork_types_wrapper completely * Fixing coverage * Add back standard scaler try * Remove _convert_woodwork_types_wrapper from docs' * Linting docs * Merging main * Fix release notes

Updating pipelines

f119a3a

freddyaboulton commented Apr 29, 2021

View reviewed changes

freddyaboulton added 2 commits April 29, 2021 13:42

Fixing docstrings

0168bf6

Fix stacked test

219a488

freddyaboulton marked this pull request as ready for review April 29, 2021 18:35

auto-assign Bot assigned freddyaboulton Apr 29, 2021

freddyaboulton requested review from ParthivNaresh, angela97lin, bchen1116, chukarsten, dsherry and jeremyliweishih and removed request for angela97lin and dsherry April 29, 2021 18:35

bchen1116 approved these changes Apr 29, 2021

View reviewed changes

Add tests to check input data not modified

d5db73e

freddyaboulton changed the title ~~Updating pipelines~~ WW 0.2.0 Update: Updating pipelines May 3, 2021

freddyaboulton added 3 commits May 3, 2021 11:47

Merge delete baselines

4a23d46

Merge sampler deletion

c473cb3

Merge branch '2035-use-ww-accessor' into 2035-update-pipelines

122808c

chukarsten approved these changes May 5, 2021

View reviewed changes

angela97lin approved these changes May 5, 2021

View reviewed changes

dsherry approved these changes May 6, 2021

View reviewed changes

freddyaboulton added 2 commits May 6, 2021 16:43

Merge branch '2035-use-ww-accessor' into 2035-update-pipelines

688d443

Use rename in component graph

ec66b2d

freddyaboulton merged commit 0e65c58 into 2035-use-ww-accessor May 7, 2021

freddyaboulton deleted the 2035-update-pipelines branch May 7, 2021 00:40

		if isinstance(parent_x, pd.Series):
		parent_x = pd.Series(parent_x, name=parent_input)

		'0_day_of_week': {'Saturday': 6, 'Tuesday': 2}}


		def test_datetime_featurizer_does_not_modify_input_data():

Conversation

freddyaboulton commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bchen1116 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chukarsten left a comment

Choose a reason for hiding this comment

Uh oh!

angela97lin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

freddyaboulton commented Apr 29, 2021 •

edited

Loading

codecov Bot commented Apr 29, 2021 •

edited

Loading