WW Accessor: Update demo datasets, preprocessing, and utils#2172
Conversation
Codecov Report
@@ Coverage Diff @@
## 2035-use-ww-accessor #2172 +/- ##
=======================================================
- Coverage 100.0% 21.0% -79.0%
=======================================================
Files 295 295
Lines 24389 24373 -16
=======================================================
- Hits 24379 5095 -19284
- Misses 10 19278 +19268
Continue to review full report at Codecov.
|
| .PHONY: git-test | ||
| git-test: | ||
| pytest evalml/ -n 4 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure | ||
| pytest evalml -n 4 --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure \ |
There was a problem hiding this comment.
Removing doctest because it will test modules that have not been updated yet, e.g. data checks
| X_test = X.iloc[test] | ||
| y_train = y.iloc[train] | ||
| y_test = y.iloc[test] | ||
| X_train = X.ww.iloc[train] |
There was a problem hiding this comment.
Use .ww.iloc to preserve the schema
|
|
||
| _raise_value_error_if_nullable_types_detected(data) | ||
|
|
||
| def _convert_woodwork_types_wrapper(pd_data): |
There was a problem hiding this comment.
Don't need convert_woodwork_types_wrapper anymore! I've tested this out on another branch that updates the rest of the codebase to use the latest woodwork.
There was a problem hiding this comment.
_convert_woodwork_types_wrapper()
2021-2021
"Beloved private function. Devoted partner to infer_feature_types(). Rest in Power."
| Converts a pandas data structure that may have extension or nullable dtypes to dtypes that numpy can understand and handle. | ||
| if isinstance(data, pd.Series): | ||
| if data.ww.schema is not None: | ||
| ww_data = ww.init_series(ww_data, logical_type=data.ww.logical_type, |
There was a problem hiding this comment.
Can't use series.ww.init(..) because is the expectation is that infer_feature_types should change the logical type if needed. In order to do that with a series, you need to use ww.init_series.
https://woodwork.alteryx.com/en/stable/start.html#Using-Woodwork-with-a-Series
|
|
||
| def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None): | ||
|
|
||
| def _convert_woodwork_types_wrapper(): |
There was a problem hiding this comment.
Keeping this just so that I don't have to edit all the files that import conver_woodwork_types_wrapper.
My plan is to edit the files in each PR to not update _convert_woodwork_types_wrapper though.
| return ww.init_series(new_dataframe, old_logical_types) | ||
| if ltypes_to_ignore is None: | ||
| ltypes_to_ignore = [] | ||
| col_intersection = set(old_woodwork_data.columns).intersection(set(new_pandas_data.columns)) |
There was a problem hiding this comment.
Since we're no longer converting from nullable to non-nullable types, I think we can simplify the implementation.
I tested this on my branch that updates the entire repo and I think this implementation meets our requirements.
| --ignore evalml/tests/component_tests \ | ||
| --ignore evalml/tests/pipeline_tests --ignore evalml/tests/automl_tests --ignore evalml/tests/data_checks_tests \ | ||
| --ignore evalml/tests/model_understanding_tests --ignore evalml/tests/objective_tests \ | ||
| -k "not test_save" |
There was a problem hiding this comment.
Telling pytest to not run the tests that start with test_save. These are plotting tests but they use a test fixture that trains decision trees. Since I haven't updated the components on the feature branch yet, those tests will fail.
| return ww.DataColumn(new_pandas_data) | ||
|
|
||
| retained_logical_types = {} | ||
| if isinstance(new_dataframe, pd.Series): |
There was a problem hiding this comment.
This will be covered when we add support for components
bchen1116
left a comment
There was a problem hiding this comment.
Nice jobs getting these changes in! It's cool to see that the changes are fairly minimal in this PR, although I'm sure it'll get much bigger later 😅
Left a nitpick and a testing question, but everything else looks good to me!
| X_dt.ww.init() | ||
| pd.testing.assert_frame_equal(X_dt, infer_feature_types(X_dt)) | ||
|
|
||
| y = _convert_woodwork_types_wrapper(pd.array([1, 2, None], dtype="Int64")) |
There was a problem hiding this comment.
Do we have any tests that still cover having np.nan or None in the dataset without having to cast the logical types to be some nullable type?
There was a problem hiding this comment.
Yea I think this will be clearer when I get the components PR up but [1, 2, None] will always be treated as Double.
| return ww.DataTable(ww_data, logical_types=feature_types) | ||
| ww_data = data.copy() | ||
|
|
||
| _raise_value_error_if_nullable_types_detected(data) |
There was a problem hiding this comment.
super nit, but what if we switch this with the line before, just so we can raise the error if necessary before we do any copying for the world's smallest time save 😅 not necessary tho!
There was a problem hiding this comment.
Super great suggestion!!
angela97lin
left a comment
There was a problem hiding this comment.
Looks good! Curious about your thoughts regarding whether or not we still need the return_pandas parameter, but otherwise just left nitpicky comments about docstrings and testing :)
| shap>=0.35.0 | ||
| texttable>=1.6.2 | ||
| woodwork==0.0.11 | ||
| woodwork==0.2.0 |
| X.ww.init() | ||
| y = ww.init_series(y) |
There was a problem hiding this comment.
My understanding is that load_data used to return WW types, so we didn't need to do the conversion here; are we updating load_data to return WW types in a separate PR?
| # target distribution | ||
| print(target_distribution(y)) | ||
|
|
||
| X = infer_feature_types(X) |
There was a problem hiding this comment.
RE comment for demo datasets, are we returning WW for load_data? If not, we should update the docstring but I feel like we should 😁
| X_test = X.iloc[test] | ||
| y_train = y.iloc[train] | ||
| y_test = y.iloc[test] | ||
| X_train = X.ww.iloc[train] |
| @@ -23,8 +25,9 @@ def load_fraud(n_rows=None, verbose=True, return_pandas=False): | |||
| target="fraud", | |||
There was a problem hiding this comment.
We should update docstrings for these demo dataset methods! :D
There was a problem hiding this comment.
Great suggestion!
| assert isinstance(X, pd.DataFrame) | ||
| assert isinstance(y, pd.Series) |
There was a problem hiding this comment.
Do we want to test that we have WW initialized?
|
|
||
| if return_pandas: | ||
| return X.to_dataframe(), y.to_series() | ||
| return X, y |
There was a problem hiding this comment.
JW, do you think we still need the return_pandas parameter now? Is it okay to always return a WW init version since users can still just use the pandas data structures if they wanted to? :o
| X_expected = pd.DataFrame({0: pd.Series([1, 2], dtype="category"), 1: pd.Series([2, 4], dtype="category")}) | ||
| pd.testing.assert_frame_equal(X_renamed.to_dataframe(), X_expected) | ||
| assert X_renamed.logical_types == {0: ww.logical_types.Categorical, 1: ww.logical_types.Categorical} | ||
| pd.testing.assert_frame_equal(X_renamed, X_expected) |
|
|
||
| Arguments: | ||
| dt (ww.DataTable): The DataTable to check data types of. | ||
| dt (pd.DataFrame): The DataTable to check data types of. |
There was a problem hiding this comment.
nitpick: The dataframe to check data types of., also maybe update dt --> df?
67f5f03 to
bb0c9f2
Compare
chukarsten
left a comment
There was a problem hiding this comment.
This is solid work. I think some of the tests have proved to be redundant post-refactor and should be deleted. But that's looking good.
| X = pd.DataFrame(data.data, columns=data.feature_names) | ||
| y = pd.Series(data.target) | ||
| y = y.map(lambda x: data["target_names"][x]) | ||
| if return_pandas: |
There was a problem hiding this comment.
There go the training wheels
| from evalml.preprocessing import load_data | ||
|
|
||
|
|
||
| def load_fraud(n_rows=None, verbose=True, return_pandas=False): |
There was a problem hiding this comment.
I like that this change hides our shame from not putting return_pandas in the docstring to begin with 😂
| X, y = demos.load_wine() | ||
| assert X.shape == (178, 13) | ||
| assert y.shape == (178,) | ||
| assert isinstance(X, pd.DataFrame) | ||
| assert isinstance(y, pd.Series) | ||
| assert X.ww.schema is not None | ||
| assert y.ww.schema is not None |
There was a problem hiding this comment.
We should snipe this test.
- snipe this test
| X, y = demos.load_breast_cancer() | ||
| assert X.shape == (569, 30) | ||
| assert y.shape == (569,) | ||
| assert isinstance(X, pd.DataFrame) | ||
| assert isinstance(y, pd.Series) | ||
| assert X.ww.schema is not None | ||
| assert y.ww.schema is not None |
There was a problem hiding this comment.
I think we can snipe this one too.
- snipe this test, too
| X, y = demos.load_diabetes() | ||
| assert X.shape == (442, 10) | ||
| assert y.shape == (442,) | ||
| assert isinstance(X, pd.DataFrame) | ||
| assert isinstance(y, pd.Series) | ||
| assert X.ww.schema is not None | ||
| assert y.ww.schema is not None |
| pd.testing.assert_frame_equal(_rename_column_names_to_numeric(X), pd.DataFrame({0: [1, 2], 1: [2, 4]})) | ||
|
|
||
| X = ww.DataTable(pd.DataFrame({"<>": [1, 2], ">>": [2, 4]}), logical_types={"<>": "categorical", ">>": "categorical"}) | ||
| X = pd.DataFrame({"<>": [1, 2], ">>": [2, 4]}) |
There was a problem hiding this comment.
MAybe I'm missing something, do we need X redefined here?
There was a problem hiding this comment.
Nope, good eye!
|
|
||
| y = _convert_woodwork_types_wrapper(pd.array([1, 2, None], dtype="Int64")) | ||
| pd.testing.assert_series_equal(y, pd.Series([1, 2, np.nan], dtype="float64")) | ||
| X_dc = ww.init_series(pd.Series([1, 2, 3, 4])) |
There was a problem hiding this comment.
So is this testing that infer_feature_types leaves the data table unchanged?
There was a problem hiding this comment.
Yea, I'll change the name to test_infer_feature_types_no_type_change to make it clearer
| X_pd = pd.Series([1, 2, 3, 4], dtype="int64") | ||
| pd.testing.assert_series_equal(X_pd, infer_feature_types(X_pd)) |
There was a problem hiding this comment.
I think this has become redundant now
There was a problem hiding this comment.
You're right!!
|
|
||
| _raise_value_error_if_nullable_types_detected(data) | ||
|
|
||
| def _convert_woodwork_types_wrapper(pd_data): |
There was a problem hiding this comment.
_convert_woodwork_types_wrapper()
2021-2021
"Beloved private function. Devoted partner to infer_feature_types(). Rest in Power."
| except (ValueError, TypeError): | ||
| pass | ||
| return ww.DataTable(new_pandas_data, logical_types=retained_logical_types) | ||
| col_intersection = set(old_logical_types.keys()).intersection(set(new_dataframe.columns)) |
There was a problem hiding this comment.
Nothing makes me happier than when set intersection or symmetric_difference is used.
* WW Accessor: Update demo datasets, preprocessing, and utils (#2172) * Update demos, utils, and preprocessing * Updating make test commands * Update ww requirement * Add test that infer_feature_types preserves schema * Add test that infer_feature_types raises errors with invalid schema * load_data always returns woodwork info. Deleted return_pandas * Updating docstrings * Release notes for first PR * Deleting redundant tests * update test name * WW 0.2.0: Update Data Checks (#2182) * Updating data checks * Undo accidental edit * Fixing typo where we assign to ww.init * WW Accessor: Updating objectives (#2185) * Updating objectives * Deleting superfluous pd.Series * WW Accessor: Updating Components (#2191) * Update components - first commit * Update delayed_features_transformer to not use assign * Fixing tests * Skipping Boolean with Nan test in imputers * Fixing base sampler _prepare_data * Fixing target imputer null bool test * Fix test skips * Addressing comments * Clean up sampler tests * Editing docstrings * Update to ww 0.3.0 * Fixing components that didn't have merge conflicts * WW 0.2.0 Update: Updating pipelines (#2205) * Updating pipelines * Fixing docstrings * Fix stacked test * Add tests to check input data not modified * Use rename in component graph * WW Accessor: Update model understanding (#2247) * Update model understanding module * Removing unused import * WW Accessor: Update Automl (#2243) * Updating pipelines * Fixing docstrings * Fix stacked test * Update automl * Update to preserve schema * Preserve schema in train_pipelines/score_pipelines * No ww drop * Update standard scaler * Use weak ref feture branch * Add tests to check input data not modified * Use weak-ref branch * Use weak ref branch * Fixing tests * Set ww back to 0.3.0 * Adding objective tests that use automl * Add back minimal-dependencies-flag * Fix test command * Updating docstrings adding dask test to check schema is sent * Fixing import order in test_dask_engine * Fixing last remaining docstrings * Fix typo in ignore command * Use ww init series * No doctest modules yet * Fix docstrings & use schemas stored in automl_config * WW Update: Update docs and docstrings (#2292) * Fixing docs and docstrings * Update docstring in highly_null_data_check * Delete warning from init * Remove print(X.ww) * Fixing notebook python version :( * Update release notes * Delete _convert_woodwork_types_wrapper completely * Fixing coverage * Add back standard scaler try * Remove _convert_woodwork_types_wrapper from docs' * Linting docs * Merging main * Fix release notes

Pull Request Description
Update demo datasets, preprocessing, and utils.
Roadmap for other prs is here
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.