Skip to content

WW 0.2.0: Update Data Checks#2182

Merged
freddyaboulton merged 3 commits into2035-use-ww-accessorfrom
2035-update-data-checks
Apr 23, 2021
Merged

WW 0.2.0: Update Data Checks#2182
freddyaboulton merged 3 commits into2035-use-ww-accessorfrom
2035-update-data-checks

Conversation

@freddyaboulton
Copy link
Copy Markdown
Contributor

Pull Request Description

Update data checks to use the new woodwork accessor


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the base branch from main to 2035-use-ww-accessor April 22, 2021 21:50
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2021

Codecov Report

Merging #2182 (77077c0) into 2035-use-ww-accessor (0d2b5ae) will increase coverage by 6.6%.
The diff coverage is 100.0%.

Impacted file tree graph

@@                  Coverage Diff                   @@
##           2035-use-ww-accessor   #2182     +/-   ##
======================================================
+ Coverage                  20.8%   27.4%   +6.6%     
======================================================
  Files                       295     295             
  Lines                     24335   24336      +1     
======================================================
+ Hits                       5057    6657   +1600     
+ Misses                    19278   17679   -1599     
Impacted Files Coverage Δ
evalml/data_checks/data_check.py 100.0% <ø> (+33.4%) ⬆️
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (+86.9%) ⬆️
evalml/data_checks/data_checks.py 100.0% <100.0%> (+80.4%) ⬆️
evalml/data_checks/datetime_nan_data_check.py 100.0% <100.0%> (+60.0%) ⬆️
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (+77.3%) ⬆️
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (+80.0%) ⬆️
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (+87.7%) ⬆️
evalml/data_checks/multicollinearity_data_check.py 100.0% <100.0%> (+73.7%) ⬆️
...lml/data_checks/natural_language_nan_data_check.py 100.0% <100.0%> (+60.0%) ⬆️
evalml/data_checks/no_variance_data_check.py 100.0% <100.0%> (+76.5%) ⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d2b5ae...77077c0. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the 2035-update-data-checks branch from 1f71e50 to 6039340 Compare April 23, 2021 00:24
Copy link
Copy Markdown
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, looks like this update was nice and clean! 🥂

y = ww.DataColumn(y)
y_long = ww.DataColumn(y_long)
y_balanced = ww.DataColumn(y_balanced)
X.ww.init()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O interesting, do we need to set X = X.ww.init() here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just wondering between the difference between this and line 114)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, my guess is that the assignment in X = X.ww.init() is redundant but harmless

Copy link
Copy Markdown
Contributor Author

@freddyaboulton freddyaboulton Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! We can't assign the result of ww.init to anything because ww.init returns None. Just fixed a typo about this in this file that I missed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that ww.init() modifies in place while the equivalent for series returns a new series

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you mean! There is series.ww.init() the problem is that if the logical type doesn't match the physical type of the series, it will error out. ww.init_series will actually do the type conversion if needed.

image

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! We can't assign the result of ww.init to anything because ww.init returns None. Just fixed a typo about this in this file that I missed.

@freddyaboulton So, I'm a little late to the party here, but I'm wondering if we should actually return self from our Woodwork init method. If I'm thinking about this correctly, doing so would allow for chaining operations like df.ww.init().ww.describe() and also make the re-assignment situation valid. Thoughts?

Copy link
Copy Markdown
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 ✨

Comment thread Makefile
pytest evalml -n 4 --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure \
--ignore evalml/tests/component_tests \
--ignore evalml/tests/pipeline_tests --ignore evalml/tests/automl_tests --ignore evalml/tests/data_checks_tests \
--ignore evalml/tests/pipeline_tests --ignore evalml/tests/automl_tests \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever! :)

id_cols = {col: 0.95 for col in cols_named_id}

X = X.select(include=['Integer', 'Categorical'])
X = _convert_woodwork_types_wrapper(X.to_dataframe())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏 yessss

Copy link
Copy Markdown
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a couple of questions, but great job cranking this out so quickly!

Comment thread evalml/tests/data_checks_tests/test_class_imbalance_data_check.py Outdated
@@ -1,6 +1,5 @@
import numpy as np
import pandas as pd
import woodwork as ww
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, but why do we no longer need to import woodwork, even if we're using ww.init() and other ww methods?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great observation! This is something I hadn't noticed actually.

In order to use the accessor, woodwork needs to be present in the namespace but it doesn't need to be imported in the same file that you use it. Since our evaml/__init__.py file imports ww by way of importing our submodules, ww is imported when our tests run so we can use the accessor.

I suspect I originally kept the import but then deleted it because of lint since it's not "needed".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's interesting. Thanks for the clarification!

@freddyaboulton freddyaboulton merged commit 6a8207d into 2035-use-ww-accessor Apr 23, 2021
@freddyaboulton freddyaboulton deleted the 2035-update-data-checks branch April 23, 2021 16:23
freddyaboulton added a commit that referenced this pull request May 25, 2021
* WW Accessor: Update demo datasets, preprocessing, and utils (#2172)

* Update demos, utils, and preprocessing

* Updating make test commands

* Update ww requirement

* Add test that infer_feature_types preserves schema

* Add test that infer_feature_types raises errors with invalid schema

* load_data always returns woodwork info. Deleted return_pandas

* Updating docstrings

* Release notes for first PR

* Deleting redundant tests

* update test name

* WW 0.2.0: Update Data Checks (#2182)

* Updating data checks

* Undo accidental edit

* Fixing typo where we assign to ww.init

* WW Accessor: Updating objectives (#2185)

* Updating objectives

* Deleting superfluous pd.Series

* WW Accessor: Updating Components (#2191)

* Update components - first commit

* Update delayed_features_transformer to not use assign

* Fixing tests

* Skipping Boolean with Nan test in imputers

* Fixing base sampler _prepare_data

* Fixing target imputer null bool test

* Fix test skips

* Addressing comments

* Clean up sampler tests

* Editing docstrings

* Update to ww 0.3.0

* Fixing components that didn't have merge conflicts

* WW 0.2.0 Update: Updating pipelines (#2205)

* Updating pipelines

* Fixing docstrings

* Fix stacked test

* Add tests to check input data not modified

* Use rename in component graph

* WW Accessor: Update model understanding (#2247)

* Update model understanding module

* Removing unused import

* WW Accessor: Update Automl (#2243)

* Updating pipelines

* Fixing docstrings

* Fix stacked test

* Update automl

* Update to preserve schema

* Preserve schema in train_pipelines/score_pipelines

* No ww drop

* Update standard scaler

* Use weak ref feture branch

* Add tests to check input data not modified

* Use weak-ref branch

* Use weak ref branch

* Fixing tests

* Set ww back to 0.3.0

* Adding objective tests that use automl

* Add back minimal-dependencies-flag

* Fix test command

* Updating docstrings adding dask test to check schema is sent

* Fixing import order in test_dask_engine

* Fixing last remaining docstrings

* Fix typo in ignore command

* Use ww init series

* No doctest modules yet

* Fix docstrings & use schemas stored in automl_config

* WW Update: Update docs and docstrings (#2292)

* Fixing docs and docstrings

* Update docstring in highly_null_data_check

* Delete warning from init

* Remove print(X.ww)

* Fixing notebook python version :(

* Update release notes

* Delete _convert_woodwork_types_wrapper completely

* Fixing coverage

* Add back standard scaler try

* Remove _convert_woodwork_types_wrapper from docs'

* Linting docs

* Merging main

* Fix release notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants