Skip to content

WW Accessor: Update demo datasets, preprocessing, and utils#2172

Merged
freddyaboulton merged 7 commits into2035-use-ww-accessorfrom
2035-update-preprocessing-demos-and-utils
Apr 22, 2021
Merged

WW Accessor: Update demo datasets, preprocessing, and utils#2172
freddyaboulton merged 7 commits into2035-use-ww-accessorfrom
2035-update-preprocessing-demos-and-utils

Conversation

@freddyaboulton
Copy link
Copy Markdown
Contributor

@freddyaboulton freddyaboulton commented Apr 21, 2021

Pull Request Description

Update demo datasets, preprocessing, and utils.

Roadmap for other prs is here


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2021

Codecov Report

Merging #2172 (124a179) into 2035-use-ww-accessor (1762f83) will decrease coverage by 79.1%.
The diff coverage is 90.2%.

Impacted file tree graph

@@                   Coverage Diff                   @@
##           2035-use-ww-accessor   #2172      +/-   ##
=======================================================
- Coverage                 100.0%   21.0%   -79.0%     
=======================================================
  Files                       295     295              
  Lines                     24389   24373      -16     
=======================================================
- Hits                      24379    5095   -19284     
- Misses                       10   19278   +19268     
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_automl.py 0.0% <0.0%> (-100.0%) ⬇️
...assification_pipeline_tests/test_classification.py 0.0% <0.0%> (-100.0%) ⬇️
...tests/regression_pipeline_tests/test_regression.py 0.0% <0.0%> (-100.0%) ⬇️
evalml/tests/pipeline_tests/test_pipelines.py 0.0% <0.0%> (-100.0%) ⬇️
evalml/utils/__init__.py 100.0% <ø> (ø)
evalml/tests/conftest.py 41.5% <40.0%> (-58.5%) ⬇️
evalml/utils/gen_utils.py 67.9% <71.5%> (-31.6%) ⬇️
evalml/utils/woodwork_utils.py 83.7% <82.4%> (-16.3%) ⬇️
evalml/demos/breast_cancer.py 100.0% <100.0%> (ø)
evalml/demos/churn.py 100.0% <100.0%> (ø)
... and 208 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1762f83...124a179. Read the comment docs.

Comment thread Makefile
.PHONY: git-test
git-test:
pytest evalml/ -n 4 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure
pytest evalml -n 4 --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure \
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing doctest because it will test modules that have not been updated yet, e.g. data checks

X_test = X.iloc[test]
y_train = y.iloc[train]
y_test = y.iloc[test]
X_train = X.ww.iloc[train]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .ww.iloc to preserve the schema

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!


_raise_value_error_if_nullable_types_detected(data)

def _convert_woodwork_types_wrapper(pd_data):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need convert_woodwork_types_wrapper anymore! I've tested this out on another branch that updates the rest of the codebase to use the latest woodwork.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convert_woodwork_types_wrapper()
2021-2021
"Beloved private function. Devoted partner to infer_feature_types(). Rest in Power."

Converts a pandas data structure that may have extension or nullable dtypes to dtypes that numpy can understand and handle.
if isinstance(data, pd.Series):
if data.ww.schema is not None:
ww_data = ww.init_series(ww_data, logical_type=data.ww.logical_type,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use series.ww.init(..) because is the expectation is that infer_feature_types should change the logical type if needed. In order to do that with a series, you need to use ww.init_series.

https://woodwork.alteryx.com/en/stable/start.html#Using-Woodwork-with-a-Series


def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None):

def _convert_woodwork_types_wrapper():
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this just so that I don't have to edit all the files that import conver_woodwork_types_wrapper.

My plan is to edit the files in each PR to not update _convert_woodwork_types_wrapper though.

return ww.init_series(new_dataframe, old_logical_types)
if ltypes_to_ignore is None:
ltypes_to_ignore = []
col_intersection = set(old_woodwork_data.columns).intersection(set(new_pandas_data.columns))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're no longer converting from nullable to non-nullable types, I think we can simplify the implementation.

I tested this on my branch that updates the entire repo and I think this implementation meets our requirements.

Comment thread Makefile
--ignore evalml/tests/component_tests \
--ignore evalml/tests/pipeline_tests --ignore evalml/tests/automl_tests --ignore evalml/tests/data_checks_tests \
--ignore evalml/tests/model_understanding_tests --ignore evalml/tests/objective_tests \
-k "not test_save"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telling pytest to not run the tests that start with test_save. These are plotting tests but they use a test fixture that trains decision trees. Since I haven't updated the components on the feature branch yet, those tests will fail.

return ww.DataColumn(new_pandas_data)

retained_logical_types = {}
if isinstance(new_dataframe, pd.Series):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be covered when we add support for components

Copy link
Copy Markdown
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice jobs getting these changes in! It's cool to see that the changes are fairly minimal in this PR, although I'm sure it'll get much bigger later 😅

Left a nitpick and a testing question, but everything else looks good to me!

X_dt.ww.init()
pd.testing.assert_frame_equal(X_dt, infer_feature_types(X_dt))

y = _convert_woodwork_types_wrapper(pd.array([1, 2, None], dtype="Int64"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests that still cover having np.nan or None in the dataset without having to cast the logical types to be some nullable type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think this will be clearer when I get the components PR up but [1, 2, None] will always be treated as Double.

return ww.DataTable(ww_data, logical_types=feature_types)
ww_data = data.copy()

_raise_value_error_if_nullable_types_detected(data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit, but what if we switch this with the line before, just so we can raise the error if necessary before we do any copying for the world's smallest time save 😅 not necessary tho!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super great suggestion!!

Copy link
Copy Markdown
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Curious about your thoughts regarding whether or not we still need the return_pandas parameter, but otherwise just left nitpicky comments about docstrings and testing :)

Comment thread core-requirements.txt
shap>=0.35.0
texttable>=1.6.2
woodwork==0.0.11
woodwork==0.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏 👏

Comment thread evalml/demos/fraud.py Outdated
Comment on lines +31 to +32
X.ww.init()
y = ww.init_series(y)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that load_data used to return WW types, so we didn't need to do the conversion here; are we updating load_data to return WW types in a separate PR?

# target distribution
print(target_distribution(y))

X = infer_feature_types(X)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE comment for demo datasets, are we returning WW for load_data? If not, we should update the docstring but I feel like we should 😁

X_test = X.iloc[test]
y_train = y.iloc[train]
y_test = y.iloc[test]
X_train = X.ww.iloc[train]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

Comment thread evalml/demos/fraud.py
@@ -23,8 +25,9 @@ def load_fraud(n_rows=None, verbose=True, return_pandas=False):
target="fraud",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update docstrings for these demo dataset methods! :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion!

Comment on lines +10 to +11
assert isinstance(X, pd.DataFrame)
assert isinstance(y, pd.Series)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to test that we have WW initialized?

Comment thread evalml/demos/fraud.py Outdated

if return_pandas:
return X.to_dataframe(), y.to_series()
return X, y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JW, do you think we still need the return_pandas parameter now? Is it okay to always return a WW init version since users can still just use the pandas data structures if they wanted to? :o

X_expected = pd.DataFrame({0: pd.Series([1, 2], dtype="category"), 1: pd.Series([2, 4], dtype="category")})
pd.testing.assert_frame_equal(X_renamed.to_dataframe(), X_expected)
assert X_renamed.logical_types == {0: ww.logical_types.Categorical, 1: ww.logical_types.Categorical}
pd.testing.assert_frame_equal(X_renamed, X_expected)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beautiful.

Comment thread evalml/utils/gen_utils.py Outdated

Arguments:
dt (ww.DataTable): The DataTable to check data types of.
dt (pd.DataFrame): The DataTable to check data types of.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: The dataframe to check data types of., also maybe update dt --> df?

@freddyaboulton freddyaboulton force-pushed the 2035-update-preprocessing-demos-and-utils branch from 67f5f03 to bb0c9f2 Compare April 22, 2021 20:29
@freddyaboulton freddyaboulton merged commit 8117248 into 2035-use-ww-accessor Apr 22, 2021
Copy link
Copy Markdown
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is solid work. I think some of the tests have proved to be redundant post-refactor and should be deleted. But that's looking good.

X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
y = y.map(lambda x: data["target_names"][x])
if return_pandas:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There go the training wheels

Comment thread evalml/demos/fraud.py
from evalml.preprocessing import load_data


def load_fraud(n_rows=None, verbose=True, return_pandas=False):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this change hides our shame from not putting return_pandas in the docstring to begin with 😂

Comment on lines +41 to +47
X, y = demos.load_wine()
assert X.shape == (178, 13)
assert y.shape == (178,)
assert isinstance(X, pd.DataFrame)
assert isinstance(y, pd.Series)
assert X.ww.schema is not None
assert y.ww.schema is not None
Copy link
Copy Markdown
Contributor

@chukarsten chukarsten Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should snipe this test.

  • snipe this test

Comment on lines +59 to +65
X, y = demos.load_breast_cancer()
assert X.shape == (569, 30)
assert y.shape == (569,)
assert isinstance(X, pd.DataFrame)
assert isinstance(y, pd.Series)
assert X.ww.schema is not None
assert y.ww.schema is not None
Copy link
Copy Markdown
Contributor

@chukarsten chukarsten Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can snipe this one too.

  • snipe this test, too

Comment on lines +77 to +83
X, y = demos.load_diabetes()
assert X.shape == (442, 10)
assert y.shape == (442,)
assert isinstance(X, pd.DataFrame)
assert isinstance(y, pd.Series)
assert X.ww.schema is not None
assert y.ww.schema is not None
Copy link
Copy Markdown
Contributor

@chukarsten chukarsten Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

  • command-shift-4 to bring up the sniping tool and snipe it

pd.testing.assert_frame_equal(_rename_column_names_to_numeric(X), pd.DataFrame({0: [1, 2], 1: [2, 4]}))

X = ww.DataTable(pd.DataFrame({"<>": [1, 2], ">>": [2, 4]}), logical_types={"<>": "categorical", ">>": "categorical"})
X = pd.DataFrame({"<>": [1, 2], ">>": [2, 4]})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAybe I'm missing something, do we need X redefined here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good eye!


y = _convert_woodwork_types_wrapper(pd.array([1, 2, None], dtype="Int64"))
pd.testing.assert_series_equal(y, pd.Series([1, 2, np.nan], dtype="float64"))
X_dc = ww.init_series(pd.Series([1, 2, 3, 4]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this testing that infer_feature_types leaves the data table unchanged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'll change the name to test_infer_feature_types_no_type_change to make it clearer

Comment on lines +21 to +22
X_pd = pd.Series([1, 2, 3, 4], dtype="int64")
pd.testing.assert_series_equal(X_pd, infer_feature_types(X_pd))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has become redundant now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right!!


_raise_value_error_if_nullable_types_detected(data)

def _convert_woodwork_types_wrapper(pd_data):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convert_woodwork_types_wrapper()
2021-2021
"Beloved private function. Devoted partner to infer_feature_types(). Rest in Power."

except (ValueError, TypeError):
pass
return ww.DataTable(new_pandas_data, logical_types=retained_logical_types)
col_intersection = set(old_logical_types.keys()).intersection(set(new_dataframe.columns))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing makes me happier than when set intersection or symmetric_difference is used.

freddyaboulton added a commit that referenced this pull request May 25, 2021
* WW Accessor: Update demo datasets, preprocessing, and utils (#2172)

* Update demos, utils, and preprocessing

* Updating make test commands

* Update ww requirement

* Add test that infer_feature_types preserves schema

* Add test that infer_feature_types raises errors with invalid schema

* load_data always returns woodwork info. Deleted return_pandas

* Updating docstrings

* Release notes for first PR

* Deleting redundant tests

* update test name

* WW 0.2.0: Update Data Checks (#2182)

* Updating data checks

* Undo accidental edit

* Fixing typo where we assign to ww.init

* WW Accessor: Updating objectives (#2185)

* Updating objectives

* Deleting superfluous pd.Series

* WW Accessor: Updating Components (#2191)

* Update components - first commit

* Update delayed_features_transformer to not use assign

* Fixing tests

* Skipping Boolean with Nan test in imputers

* Fixing base sampler _prepare_data

* Fixing target imputer null bool test

* Fix test skips

* Addressing comments

* Clean up sampler tests

* Editing docstrings

* Update to ww 0.3.0

* Fixing components that didn't have merge conflicts

* WW 0.2.0 Update: Updating pipelines (#2205)

* Updating pipelines

* Fixing docstrings

* Fix stacked test

* Add tests to check input data not modified

* Use rename in component graph

* WW Accessor: Update model understanding (#2247)

* Update model understanding module

* Removing unused import

* WW Accessor: Update Automl (#2243)

* Updating pipelines

* Fixing docstrings

* Fix stacked test

* Update automl

* Update to preserve schema

* Preserve schema in train_pipelines/score_pipelines

* No ww drop

* Update standard scaler

* Use weak ref feture branch

* Add tests to check input data not modified

* Use weak-ref branch

* Use weak ref branch

* Fixing tests

* Set ww back to 0.3.0

* Adding objective tests that use automl

* Add back minimal-dependencies-flag

* Fix test command

* Updating docstrings adding dask test to check schema is sent

* Fixing import order in test_dask_engine

* Fixing last remaining docstrings

* Fix typo in ignore command

* Use ww init series

* No doctest modules yet

* Fix docstrings & use schemas stored in automl_config

* WW Update: Update docs and docstrings (#2292)

* Fixing docs and docstrings

* Update docstring in highly_null_data_check

* Delete warning from init

* Remove print(X.ww)

* Fixing notebook python version :(

* Update release notes

* Delete _convert_woodwork_types_wrapper completely

* Fixing coverage

* Add back standard scaler try

* Remove _convert_woodwork_types_wrapper from docs'

* Linting docs

* Merging main

* Fix release notes
@freddyaboulton freddyaboulton deleted the 2035-update-preprocessing-demos-and-utils branch June 3, 2021 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants