WW Accessor: Update demo datasets, preprocessing, and utils by freddyaboulton · Pull Request #2172 · alteryx/evalml

freddyaboulton · 2021-04-21T22:18:24Z

Pull Request Description

Update demo datasets, preprocessing, and utils.

Roadmap for other prs is here

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2021-04-21T22:20:50Z

Codecov Report

Merging #2172 (124a179) into 2035-use-ww-accessor (1762f83) will decrease coverage by 79.1%.
The diff coverage is 90.2%.

@@                   Coverage Diff                   @@
##           2035-use-ww-accessor   #2172      +/-   ##
=======================================================
- Coverage                 100.0%   21.0%   -79.0%     
=======================================================
  Files                       295     295              
  Lines                     24389   24373      -16     
=======================================================
- Hits                      24379    5095   -19284     
- Misses                       10   19278   +19268

Impacted Files	Coverage Δ
evalml/tests/automl_tests/test_automl.py	`0.0% <0.0%> (-100.0%)`	⬇️
...assification_pipeline_tests/test_classification.py	`0.0% <0.0%> (-100.0%)`	⬇️
...tests/regression_pipeline_tests/test_regression.py	`0.0% <0.0%> (-100.0%)`	⬇️
evalml/tests/pipeline_tests/test_pipelines.py	`0.0% <0.0%> (-100.0%)`	⬇️
evalml/utils/__init__.py	`100.0% <ø> (ø)`
evalml/tests/conftest.py	`41.5% <40.0%> (-58.5%)`	⬇️
evalml/utils/gen_utils.py	`67.9% <71.5%> (-31.6%)`	⬇️
evalml/utils/woodwork_utils.py	`83.7% <82.4%> (-16.3%)`	⬇️
evalml/demos/breast_cancer.py	`100.0% <100.0%> (ø)`
evalml/demos/churn.py	`100.0% <100.0%> (ø)`
... and 208 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1762f83...124a179. Read the comment docs.

freddyaboulton · 2021-04-21T22:34:37Z

 .PHONY: git-test
 git-test:
-	pytest evalml/ -n 4 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure
+	pytest evalml -n 4 --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure \


Removing doctest because it will test modules that have not been updated yet, e.g. data checks

freddyaboulton · 2021-04-21T23:32:56Z

-    X_test = X.iloc[test]
-    y_train = y.iloc[train]
-    y_test = y.iloc[test]
+    X_train = X.ww.iloc[train]


Use .ww.iloc to preserve the schema

freddyaboulton · 2021-04-21T23:36:57Z


+    _raise_value_error_if_nullable_types_detected(data)

-def _convert_woodwork_types_wrapper(pd_data):


Don't need convert_woodwork_types_wrapper anymore! I've tested this out on another branch that updates the rest of the codebase to use the latest woodwork.

_convert_woodwork_types_wrapper()
2021-2021
"Beloved private function. Devoted partner to infer_feature_types(). Rest in Power."

freddyaboulton · 2021-04-21T23:37:52Z

-    Converts a pandas data structure that may have extension or nullable dtypes to dtypes that numpy can understand and handle.
+    if isinstance(data, pd.Series):
+        if data.ww.schema is not None:
+            ww_data = ww.init_series(ww_data, logical_type=data.ww.logical_type,


Can't use series.ww.init(..) because is the expectation is that infer_feature_types should change the logical type if needed. In order to do that with a series, you need to use ww.init_series.

https://woodwork.alteryx.com/en/stable/start.html#Using-Woodwork-with-a-Series

freddyaboulton · 2021-04-21T23:39:22Z

-
-def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None):
+
+def _convert_woodwork_types_wrapper():


Keeping this just so that I don't have to edit all the files that import conver_woodwork_types_wrapper.

My plan is to edit the files in each PR to not update _convert_woodwork_types_wrapper though.

freddyaboulton · 2021-04-21T23:42:16Z

+        return ww.init_series(new_dataframe, old_logical_types)
    if ltypes_to_ignore is None:
        ltypes_to_ignore = []
-    col_intersection = set(old_woodwork_data.columns).intersection(set(new_pandas_data.columns))


Since we're no longer converting from nullable to non-nullable types, I think we can simplify the implementation.

I tested this on my branch that updates the entire repo and I think this implementation meets our requirements.

freddyaboulton · 2021-04-21T23:44:17Z

+	--ignore evalml/tests/component_tests \
+	--ignore evalml/tests/pipeline_tests --ignore evalml/tests/automl_tests --ignore evalml/tests/data_checks_tests \
+	--ignore evalml/tests/model_understanding_tests --ignore evalml/tests/objective_tests \
+	-k "not test_save"


Telling pytest to not run the tests that start with test_save. These are plotting tests but they use a test fixture that trains decision trees. Since I haven't updated the components on the feature branch yet, those tests will fail.

freddyaboulton · 2021-04-22T15:06:55Z

-        return ww.DataColumn(new_pandas_data)
-
-    retained_logical_types = {}
+    if isinstance(new_dataframe, pd.Series):


This will be covered when we add support for components

bchen1116

Nice jobs getting these changes in! It's cool to see that the changes are fairly minimal in this PR, although I'm sure it'll get much bigger later 😅

Left a nitpick and a testing question, but everything else looks good to me!

bchen1116 · 2021-04-22T17:54:13Z

+    X_dt.ww.init()
+    pd.testing.assert_frame_equal(X_dt, infer_feature_types(X_dt))

-    y = _convert_woodwork_types_wrapper(pd.array([1, 2, None], dtype="Int64"))


Do we have any tests that still cover having np.nan or None in the dataset without having to cast the logical types to be some nullable type?

Yea I think this will be clearer when I get the components PR up but [1, 2, None] will always be treated as Double.

bchen1116 · 2021-04-22T18:23:23Z

-    return ww.DataTable(ww_data, logical_types=feature_types)
+    ww_data = data.copy()

+    _raise_value_error_if_nullable_types_detected(data)


super nit, but what if we switch this with the line before, just so we can raise the error if necessary before we do any copying for the world's smallest time save 😅 not necessary tho!

Super great suggestion!!

angela97lin

Looks good! Curious about your thoughts regarding whether or not we still need the return_pandas parameter, but otherwise just left nitpicky comments about docstrings and testing :)

angela97lin · 2021-04-22T19:31:16Z

 shap>=0.35.0
 texttable>=1.6.2
-woodwork==0.0.11
+woodwork==0.2.0


👏 👏 👏 👏

angela97lin · 2021-04-22T19:33:59Z

+    X.ww.init()
+    y = ww.init_series(y)


My understanding is that load_data used to return WW types, so we didn't need to do the conversion here; are we updating load_data to return WW types in a separate PR?

angela97lin · 2021-04-22T19:34:52Z

        # target distribution
        print(target_distribution(y))

-    X = infer_feature_types(X)


RE comment for demo datasets, are we returning WW for load_data? If not, we should update the docstring but I feel like we should 😁

angela97lin · 2021-04-22T19:35:19Z

-    X_test = X.iloc[test]
-    y_train = y.iloc[train]
-    y_test = y.iloc[test]
+    X_train = X.ww.iloc[train]


angela97lin · 2021-04-22T19:36:53Z

@@ -23,8 +25,9 @@ def load_fraud(n_rows=None, verbose=True, return_pandas=False):
                     target="fraud",


We should update docstrings for these demo dataset methods! :D

Great suggestion!

angela97lin · 2021-04-22T19:37:46Z

+    assert isinstance(X, pd.DataFrame)
+    assert isinstance(y, pd.Series)


Do we want to test that we have WW initialized?

angela97lin · 2021-04-22T19:38:51Z

-
    if return_pandas:
-        return X.to_dataframe(), y.to_series()
+        return X, y


JW, do you think we still need the return_pandas parameter now? Is it okay to always return a WW init version since users can still just use the pandas data structures if they wanted to? :o

angela97lin · 2021-04-22T19:39:34Z

    X_expected = pd.DataFrame({0: pd.Series([1, 2], dtype="category"), 1: pd.Series([2, 4], dtype="category")})
-    pd.testing.assert_frame_equal(X_renamed.to_dataframe(), X_expected)
-    assert X_renamed.logical_types == {0: ww.logical_types.Categorical, 1: ww.logical_types.Categorical}
+    pd.testing.assert_frame_equal(X_renamed, X_expected)


This is beautiful.

angela97lin · 2021-04-22T19:50:04Z


    Arguments:
-        dt (ww.DataTable): The DataTable to check data types of.
+        dt (pd.DataFrame): The DataTable to check data types of.


nitpick: The dataframe to check data types of., also maybe update dt --> df?

chukarsten

This is solid work. I think some of the tests have proved to be redundant post-refactor and should be deleted. But that's looking good.

chukarsten · 2021-04-22T21:17:41Z

    X = pd.DataFrame(data.data, columns=data.feature_names)
    y = pd.Series(data.target)
    y = y.map(lambda x: data["target_names"][x])
-    if return_pandas:


There go the training wheels

chukarsten · 2021-04-22T21:20:04Z

 from evalml.preprocessing import load_data


-def load_fraud(n_rows=None, verbose=True, return_pandas=False):


I like that this change hides our shame from not putting return_pandas in the docstring to begin with 😂

chukarsten · 2021-04-22T21:39:08Z

+    X, y = demos.load_wine()
    assert X.shape == (178, 13)
    assert y.shape == (178,)
    assert isinstance(X, pd.DataFrame)
    assert isinstance(y, pd.Series)
+    assert X.ww.schema is not None
+    assert y.ww.schema is not None


We should snipe this test.

snipe this test

chukarsten · 2021-04-22T21:39:36Z

+    X, y = demos.load_breast_cancer()
    assert X.shape == (569, 30)
    assert y.shape == (569,)
    assert isinstance(X, pd.DataFrame)
    assert isinstance(y, pd.Series)
+    assert X.ww.schema is not None
+    assert y.ww.schema is not None


I think we can snipe this one too.

snipe this test, too

chukarsten · 2021-04-22T21:41:15Z

+    X, y = demos.load_diabetes()
    assert X.shape == (442, 10)
    assert y.shape == (442,)
    assert isinstance(X, pd.DataFrame)
    assert isinstance(y, pd.Series)
+    assert X.ww.schema is not None
+    assert y.ww.schema is not None


command-shift-4 to bring up the sniping tool and snipe it

chukarsten · 2021-04-22T21:44:01Z

    pd.testing.assert_frame_equal(_rename_column_names_to_numeric(X), pd.DataFrame({0: [1, 2], 1: [2, 4]}))

-    X = ww.DataTable(pd.DataFrame({"<>": [1, 2], ">>": [2, 4]}), logical_types={"<>": "categorical", ">>": "categorical"})
+    X = pd.DataFrame({"<>": [1, 2], ">>": [2, 4]})


MAybe I'm missing something, do we need X redefined here?

Nope, good eye!

chukarsten · 2021-04-22T22:00:37Z


-    y = _convert_woodwork_types_wrapper(pd.array([1, 2, None], dtype="Int64"))
-    pd.testing.assert_series_equal(y, pd.Series([1, 2, np.nan], dtype="float64"))
+    X_dc = ww.init_series(pd.Series([1, 2, 3, 4]))


So is this testing that infer_feature_types leaves the data table unchanged?

Yea, I'll change the name to test_infer_feature_types_no_type_change to make it clearer

chukarsten · 2021-04-22T22:02:30Z

+    X_pd = pd.Series([1, 2, 3, 4], dtype="int64")
+    pd.testing.assert_series_equal(X_pd, infer_feature_types(X_pd))


I think this has become redundant now

You're right!!

chukarsten · 2021-04-22T22:40:30Z


+    _raise_value_error_if_nullable_types_detected(data)

-def _convert_woodwork_types_wrapper(pd_data):


_convert_woodwork_types_wrapper()
2021-2021
"Beloved private function. Devoted partner to infer_feature_types(). Rest in Power."

chukarsten · 2021-04-22T22:42:32Z

-            except (ValueError, TypeError):
-                pass
-    return ww.DataTable(new_pandas_data, logical_types=retained_logical_types)
+    col_intersection = set(old_logical_types.keys()).intersection(set(new_dataframe.columns))


Nothing makes me happier than when set intersection or symmetric_difference is used.

* WW Accessor: Update demo datasets, preprocessing, and utils (#2172) * Update demos, utils, and preprocessing * Updating make test commands * Update ww requirement * Add test that infer_feature_types preserves schema * Add test that infer_feature_types raises errors with invalid schema * load_data always returns woodwork info. Deleted return_pandas * Updating docstrings * Release notes for first PR * Deleting redundant tests * update test name * WW 0.2.0: Update Data Checks (#2182) * Updating data checks * Undo accidental edit * Fixing typo where we assign to ww.init * WW Accessor: Updating objectives (#2185) * Updating objectives * Deleting superfluous pd.Series * WW Accessor: Updating Components (#2191) * Update components - first commit * Update delayed_features_transformer to not use assign * Fixing tests * Skipping Boolean with Nan test in imputers * Fixing base sampler _prepare_data * Fixing target imputer null bool test * Fix test skips * Addressing comments * Clean up sampler tests * Editing docstrings * Update to ww 0.3.0 * Fixing components that didn't have merge conflicts * WW 0.2.0 Update: Updating pipelines (#2205) * Updating pipelines * Fixing docstrings * Fix stacked test * Add tests to check input data not modified * Use rename in component graph * WW Accessor: Update model understanding (#2247) * Update model understanding module * Removing unused import * WW Accessor: Update Automl (#2243) * Updating pipelines * Fixing docstrings * Fix stacked test * Update automl * Update to preserve schema * Preserve schema in train_pipelines/score_pipelines * No ww drop * Update standard scaler * Use weak ref feture branch * Add tests to check input data not modified * Use weak-ref branch * Use weak ref branch * Fixing tests * Set ww back to 0.3.0 * Adding objective tests that use automl * Add back minimal-dependencies-flag * Fix test command * Updating docstrings adding dask test to check schema is sent * Fixing import order in test_dask_engine * Fixing last remaining docstrings * Fix typo in ignore command * Use ww init series * No doctest modules yet * Fix docstrings & use schemas stored in automl_config * WW Update: Update docs and docstrings (#2292) * Fixing docs and docstrings * Update docstring in highly_null_data_check * Delete warning from init * Remove print(X.ww) * Fixing notebook python version :( * Update release notes * Delete _convert_woodwork_types_wrapper completely * Fixing coverage * Add back standard scaler try * Remove _convert_woodwork_types_wrapper from docs' * Linting docs * Merging main * Fix release notes

freddyaboulton commented Apr 21, 2021

View reviewed changes

freddyaboulton commented Apr 22, 2021

View reviewed changes

freddyaboulton marked this pull request as ready for review April 22, 2021 15:07

auto-assign Bot assigned freddyaboulton Apr 22, 2021

freddyaboulton requested review from ParthivNaresh, angela97lin, bchen1116, chukarsten, dsherry and jeremyliweishih April 22, 2021 15:07

freddyaboulton added 4 commits April 22, 2021 14:58

Update demos, utils, and preprocessing

ab2f50c

Updating make test commands

9c0e37b

Update ww requirement

1b46a7b

Add test that infer_feature_types preserves schema

5f0e720

bchen1116 approved these changes Apr 22, 2021

View reviewed changes

angela97lin approved these changes Apr 22, 2021

View reviewed changes

Add test that infer_feature_types raises errors with invalid schema

bb0c9f2

freddyaboulton force-pushed the 2035-update-preprocessing-demos-and-utils branch from 67f5f03 to bb0c9f2 Compare April 22, 2021 20:29

freddyaboulton added 2 commits April 22, 2021 16:52

load_data always returns woodwork info. Deleted return_pandas

fb588a0

Updating docstrings

124a179

freddyaboulton merged commit 8117248 into 2035-use-ww-accessor Apr 22, 2021

chukarsten suggested changes Apr 22, 2021

View reviewed changes

freddyaboulton deleted the 2035-update-preprocessing-demos-and-utils branch June 3, 2021 14:02


		_raise_value_error_if_nullable_types_detected(data)

		def _convert_woodwork_types_wrapper(pd_data):


		def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None):

		def _convert_woodwork_types_wrapper():

		@@ -23,8 +25,9 @@ def load_fraud(n_rows=None, verbose=True, return_pandas=False):
		target="fraud",

		assert isinstance(X, pd.DataFrame)
		assert isinstance(y, pd.Series)

		from evalml.preprocessing import load_data


		def load_fraud(n_rows=None, verbose=True, return_pandas=False):

		X_pd = pd.Series([1, 2, 3, 4], dtype="int64")
		pd.testing.assert_series_equal(X_pd, infer_feature_types(X_pd))

Conversation

freddyaboulton commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Uh oh!

codecov Bot commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bchen1116 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angela97lin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chukarsten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chukarsten Apr 22, 2021 • edited by freddyaboulton Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chukarsten Apr 22, 2021 • edited by freddyaboulton Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chukarsten Apr 22, 2021 • edited by freddyaboulton Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

freddyaboulton commented Apr 21, 2021 •

edited

Loading

codecov Bot commented Apr 21, 2021 •

edited

Loading

chukarsten Apr 22, 2021 •

edited by freddyaboulton

Loading

chukarsten Apr 22, 2021 •

edited by freddyaboulton

Loading

chukarsten Apr 22, 2021 •

edited by freddyaboulton

Loading