Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding error if columns are all Nan in partial dependence #2120

Merged

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #2073, #2119


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

if not pipeline._is_fitted:
raise ValueError("Pipeline to calculate partial dependence for must be fitted")
if pipeline.model_family == ModelFamily.BASELINE:
raise ValueError("Partial dependence plots are not supported for Baseline pipelines")

feature_list = []
if isinstance(features, int):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only check for int or str we miss the two-way case, which caused #2119

f"partial dependence cannot be computed: {', '.join(all_nan)}")


def _raise_value_error_if_mostly_one_value(df, percentile):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix #2119. May be scope-creep but it felt weird to have the all-NaN check work for two-way without fixing this warning. The implementation is pretty simple, we basically use the existing code for the one-way check on every column the user passed in, so I thought it was appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it weird not using our own datachecks to accomplish this type of thing? This seems very much like a uniqueness and a no variance datacheck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! There's definitely overlap in the logic. I'll file a spike issue.

@freddyaboulton freddyaboulton changed the title Adding error if columns are all Nan. Adding error if columns are all Nan in partial dependence Apr 9, 2021
@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #2120 (7515f19) into main (0db51e3) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2120     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         291      291             
  Lines       23809    23853     +44     
=========================================
+ Hits        23799    23843     +44     
  Misses         10       10             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0db51e3...7515f19. Read the comment docs.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

values.append(str(normalized_counts.index[0]))

if one_value:
raise ValueError(f"Features {', '.join(one_value)} are mostly one value, ({', '.join(values)}), "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick, but can we put parentheses around the feature names too?

f"Features ({', '.join(one_value)}) are mostly

I think it's cleaner when we have multiple features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also another good time to add the raises column to the docstring of the partial dependence function so people can see the linkage of this ValueError to the upper percentile. It took me a second, so I'm sure it'll cause a bit of head scratching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @chukarsten . I'll add that now!

partial_dependence(pipeline, X, features=2, percentiles=(0.01, 0.955), grid_resolution=20)
with pytest.raises(ValueError, match="Features 'random_col' are mostly one value, \\(1\\), and cannot be"):
partial_dependence(pipeline, X, features=('A', "random_col"), percentiles=(0.01, 0.955), grid_resolution=20)
with pytest.raises(ValueError, match="Features 'random_col', 'random_col_2' are mostly one value, \\(1, 1\\), and cannot be"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Just a suggesting about adding "raises" to the docstring of the part dep function to suggest the relationship of the upper percentile limit to the "mostly a value" ValueError. I also made a comment about using our datachecks to do these seeming datachecks, but I don't know how clean or reasonable that is.

f"partial dependence cannot be computed: {', '.join(all_nan)}")


def _raise_value_error_if_mostly_one_value(df, percentile):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it weird not using our own datachecks to accomplish this type of thing? This seems very much like a uniqueness and a no variance datacheck.

values.append(str(normalized_counts.index[0]))

if one_value:
raise ValueError(f"Features {', '.join(one_value)} are mostly one value, ({', '.join(values)}), "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also another good time to add the raises column to the docstring of the partial dependence function so people can see the linkage of this ValueError to the upper percentile. It took me a second, so I'm sure it'll cause a bit of head scratching.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, excellent addition of helper functions

@freddyaboulton freddyaboulton force-pushed the 2073-partial-dependence-exception-for-all-null-features branch 2 times, most recently from 81d5fd6 to d053099 Compare April 13, 2021 16:49
@freddyaboulton freddyaboulton force-pushed the 2073-partial-dependence-exception-for-all-null-features branch from d053099 to 7515f19 Compare April 13, 2021 17:32
@freddyaboulton freddyaboulton merged commit b92dc41 into main Apr 13, 2021
@freddyaboulton freddyaboulton deleted the 2073-partial-dependence-exception-for-all-null-features branch April 13, 2021 18:10
chukarsten added a commit that referenced this pull request Apr 20, 2021
…skEngine`` #1975.

- Added optional ``engine`` argument to ``AutoMLSearch`` #1975
- Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118
- Added ``NaturalLanguageNaNDataCheck`` data check #2122
- Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120
- Added standard deviation of cv scores to rankings table #2154
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077
- Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117
- Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133
- Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133
- Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133
- Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113
- Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126
- Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117
- Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135
- Fixed ``conf.py`` file #2112
- Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118
- Fixed documentation demos #2139
- Update test badge in README to use GitHub Actions #2150
- Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129
- Added a GitHub Action for building the conda package #1870 #2148
.. warning::
- Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113
- Deleted the "errors" key from automl results #1975
- Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
@chukarsten chukarsten mentioned this pull request Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle exception for fully null columns in partial dependence
4 participants