WW 0.2.0 Update: Updating pipelines#2205
WW 0.2.0 Update: Updating pipelines#2205freddyaboulton merged 9 commits into2035-use-ww-accessorfrom
Conversation
| X = pd.DataFrame() | ||
| # Normalize the data into pandas objects | ||
| X_ww = infer_feature_types(X) | ||
| X_ww = infer_feature_types(X).ww.copy() |
There was a problem hiding this comment.
This is a mistake I made in my components pr. Before we use assign which creates a new dataframe. Since we're not doing that, we need to create the copy ourselves. This is important so that
pipeline.predict(X, y)
pipeline.predict(X, y)works. Pipeline tests caught this but I added a unit test in the delayed_feature_transformer unit tests as well.
There was a problem hiding this comment.
Awesome, so this makes sure that we don't change the original X input. Did we need to do this in the other components as well?
There was a problem hiding this comment.
Great question @bchen1116 ! I am adding coverage for the datetime featurizer and the text featurizer.
Codecov Report
@@ Coverage Diff @@
## 2035-use-ww-accessor #2205 +/- ##
=======================================================
+ Coverage 53.2% 69.2% +16.1%
=======================================================
Files 280 280
Lines 24038 24012 -26
=======================================================
+ Hits 12770 16609 +3839
+ Misses 11268 7403 -3865
Continue to review full report at Codecov.
|
bchen1116
left a comment
There was a problem hiding this comment.
Great job getting this crunched out so quick! LGTM, just left a quick question
| X = pd.DataFrame() | ||
| # Normalize the data into pandas objects | ||
| X_ww = infer_feature_types(X) | ||
| X_ww = infer_feature_types(X).ww.copy() |
There was a problem hiding this comment.
Awesome, so this makes sure that we don't change the original X input. Did we need to do this in the other components as well?
chukarsten
left a comment
There was a problem hiding this comment.
This looks good to me.
angela97lin
left a comment
There was a problem hiding this comment.
LGTM! Great catch with needing to copy the data, loving how much we're able to continue to clean up :)) Just left a few comments but nothing blocking!
| proba = self.estimator.predict_proba(X).to_dataframe() | ||
| proba.columns = self._encoder.classes_ | ||
| proba = self.estimator.predict_proba(X) | ||
| proba = proba.ww.rename(columns={col: new_col for col, new_col in zip(proba.columns, self._encoder.classes_)}) |
There was a problem hiding this comment.
Amazing, cool that we can use this!!
| if isinstance(parent_x, pd.Series): | ||
| parent_x = pd.Series(parent_x, name=parent_input) |
There was a problem hiding this comment.
Omega nitpick but is it possible to just use rename? :o
There was a problem hiding this comment.
Yeah I think you can even just do parent_x.name = parent_input right?
| pp_components.append(DropNullColumns) | ||
| input_logical_types = set(X.logical_types.values()) | ||
| types_imputer_handles = {logical_types.Boolean, logical_types.Categorical, logical_types.Double, logical_types.Integer} | ||
| input_logical_types = set(X.ww.logical_types.values()) |
There was a problem hiding this comment.
If we're calling .ww here, do we need to make sure that we've called X.ww.init first?
There was a problem hiding this comment.
I think since this is a private method and is only called in make_pipeline, its ok to assume that infer_feature_types was called up the stack i.e. in make_pipeline. Our test coverage for make_pipeline can enforce this.
| def test_serialization(X_y_binary, ts_data, tmpdir, helper_functions): | ||
| path = os.path.join(str(tmpdir), 'component.pkl') | ||
| for component_class in all_components(): | ||
| if component_class in {StackedEnsembleClassifier, StackedEnsembleRegressor}: |
| '0_day_of_week': {'Saturday': 6, 'Tuesday': 2}} | ||
|
|
||
|
|
||
| def test_datetime_featurizer_does_not_modify_input_data(): |
There was a problem hiding this comment.
These tests look good! Not necessary here but I wonder if this is something we'd want for all of our components 🤔
There was a problem hiding this comment.
Agreed. Not required to add in this PR.
| expected_x_df = expected_x.to_dataframe().astype("Int64") | ||
| assert_frame_equal(expected_x_df, mock_ohe_fit_transform.call_args[0][0].to_dataframe()) | ||
| assert_series_equal(expected_y.to_series(), mock_ohe_fit_transform.call_args[0][1].to_series()) | ||
| expected_x_df = expected_x.astype("int64") |
There was a problem hiding this comment.
Same question I've had before but do we need to specify astype("int64") anymore? Is this just left for clarity? :o
| if parent_output is not None: | ||
| final_component_inputs.append(parent_output) | ||
| concatted = pd.concat([component_input.to_dataframe() for component_input in final_component_inputs], axis=1) | ||
| concatted = pd.concat([component_input for component_input in final_component_inputs], axis=1) |
There was a problem hiding this comment.
Is there a way we can preserve woodwork info in the concatenation here and avoid having to call infer_feature_types at the end? AKA pd.ww.concat? That call to infer_feature_types(concatted) should not be doing any inference, right?
When we finish the upgrade and merge the feature branch, it may be worthwhile to revisit component graph evaluation specifically, and to make sure we know exactly when inference is occurring. (Maybe you already know this; I don't think I do 100% yet!)
There was a problem hiding this comment.
Agreed that ww.concat would be ideal but it's currently in development! alteryx/woodwork#884
Also agree on auditing the ComponentGraph so we can streamline inference once the dust settles.
| if isinstance(parent_x, pd.Series): | ||
| parent_x = pd.Series(parent_x, name=parent_input) |
There was a problem hiding this comment.
Yeah I think you can even just do parent_x.name = parent_input right?
| pp_components.append(DropNullColumns) | ||
| input_logical_types = set(X.logical_types.values()) | ||
| types_imputer_handles = {logical_types.Boolean, logical_types.Categorical, logical_types.Double, logical_types.Integer} | ||
| input_logical_types = set(X.ww.logical_types.values()) |
There was a problem hiding this comment.
I think since this is a private method and is only called in make_pipeline, its ok to assume that infer_feature_types was called up the stack i.e. in make_pipeline. Our test coverage for make_pipeline can enforce this.
| '0_day_of_week': {'Saturday': 6, 'Tuesday': 2}} | ||
|
|
||
|
|
||
| def test_datetime_featurizer_does_not_modify_input_data(): |
There was a problem hiding this comment.
Agreed. Not required to add in this PR.
* WW Accessor: Update demo datasets, preprocessing, and utils (#2172) * Update demos, utils, and preprocessing * Updating make test commands * Update ww requirement * Add test that infer_feature_types preserves schema * Add test that infer_feature_types raises errors with invalid schema * load_data always returns woodwork info. Deleted return_pandas * Updating docstrings * Release notes for first PR * Deleting redundant tests * update test name * WW 0.2.0: Update Data Checks (#2182) * Updating data checks * Undo accidental edit * Fixing typo where we assign to ww.init * WW Accessor: Updating objectives (#2185) * Updating objectives * Deleting superfluous pd.Series * WW Accessor: Updating Components (#2191) * Update components - first commit * Update delayed_features_transformer to not use assign * Fixing tests * Skipping Boolean with Nan test in imputers * Fixing base sampler _prepare_data * Fixing target imputer null bool test * Fix test skips * Addressing comments * Clean up sampler tests * Editing docstrings * Update to ww 0.3.0 * Fixing components that didn't have merge conflicts * WW 0.2.0 Update: Updating pipelines (#2205) * Updating pipelines * Fixing docstrings * Fix stacked test * Add tests to check input data not modified * Use rename in component graph * WW Accessor: Update model understanding (#2247) * Update model understanding module * Removing unused import * WW Accessor: Update Automl (#2243) * Updating pipelines * Fixing docstrings * Fix stacked test * Update automl * Update to preserve schema * Preserve schema in train_pipelines/score_pipelines * No ww drop * Update standard scaler * Use weak ref feture branch * Add tests to check input data not modified * Use weak-ref branch * Use weak ref branch * Fixing tests * Set ww back to 0.3.0 * Adding objective tests that use automl * Add back minimal-dependencies-flag * Fix test command * Updating docstrings adding dask test to check schema is sent * Fixing import order in test_dask_engine * Fixing last remaining docstrings * Fix typo in ignore command * Use ww init series * No doctest modules yet * Fix docstrings & use schemas stored in automl_config * WW Update: Update docs and docstrings (#2292) * Fixing docs and docstrings * Update docstring in highly_null_data_check * Delete warning from init * Remove print(X.ww) * Fixing notebook python version :( * Update release notes * Delete _convert_woodwork_types_wrapper completely * Fixing coverage * Add back standard scaler try * Remove _convert_woodwork_types_wrapper from docs' * Linting docs * Merging main * Fix release notes

Pull Request Description
Update pipelines and stacked ensemble unit tests.
See the roadmap here
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.