Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error message to partial_dependence #1994

Merged
merged 11 commits into from Mar 19, 2021
Merged

Add error message to partial_dependence #1994

merged 11 commits into from Mar 19, 2021

Conversation

bchen1116
Copy link
Contributor

fix #1923

@bchen1116 bchen1116 self-assigned this Mar 18, 2021
@@ -561,11 +563,15 @@ def partial_dependence(pipeline, X, features, grid_resolution=100):
if pipeline.model_family == ModelFamily.BASELINE:
raise ValueError("Partial dependence plots are not supported for Baseline pipelines")

if ((isinstance(features, int) and X.iloc[:, features].value_counts(normalize=True).values[0] + 0.01 > percentiles[1]) or
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we add 0.01 (1%) since SKLearn doesn't allow the percentage of each value to be too close to the percentile (for instance, if a feature contains 95% of a value, SKLearn will still throw the error even if we pass the upper percentile as 0.955). Otherwise, 1% was chosen arbitrarily (compared to 0.0075 or another value)

@codecov
Copy link

codecov bot commented Mar 18, 2021

Codecov Report

Merging #1994 (a8829a8) into main (50da6bc) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1994     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         274      274             
  Lines       22300    22325     +25     
=========================================
+ Hits        22294    22319     +25     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50da6bc...a8829a8. Read the comment docs.

@@ -233,8 +233,8 @@ def test_partial_dependence_not_fitted(X_y_binary, logistic_regression_binary_pi


def test_partial_dependence_warning(logistic_regression_binary_pipeline_class):
X = pd.DataFrame({'a': [2, None, 2, 2], 'b': [1, 2, 2, 1]})
y = pd.Series([0, 1, 0, 1])
X = pd.DataFrame({'a': [1, 2, None, 2, 2], 'b': [1, 1, 2, 2, 1]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added extra value so we don't raise the new ValueError

@bchen1116 bchen1116 marked this pull request as ready for review March 18, 2021 19:00
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -564,8 +566,12 @@ def partial_dependence(pipeline, X, features, grid_resolution=100):
if ((isinstance(features, int) and X.iloc[:, features].isnull().sum()) or (isinstance(features, str) and X[features].isnull().sum())):
warnings.warn("There are null values in the features, which will cause NaN values in the partial dependence output. Fill in these values to remove the NaN values.", NullsInColumnWarning)

if ((isinstance(features, int) and X.iloc[:, features].value_counts(normalize=True).values[0] + 0.01 > percentiles[1]) or
(isinstance(features, str) and X[features].value_counts(normalize=True).values[0] + 0.01 > percentiles[1])):
raise ValueError(f"Feature {features} is mostly one value and cannot be used to compute partial dependence. Try raising the upper percentage value.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  • Features {features} will return a number if features is an int right? Should we return the name of the feature in this case for easier debugging?
  • Should we include some information about the value causing the imbalance in the error message like The value {X.iloc[:, features].value_counts(normalize=True).values[0]} is present in feature {features} above the upper limit for partial dependence..etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch in that first bullet. Since the features variable is passed by the user, I assumed it would be fine to output that same value, but I'll add some quotations around it to make it clearer to the user.

I can add in a small clarification on the first value that's causing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh I added a fix, let me know if it looks good to you!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, Bryan!

evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved
@bchen1116 bchen1116 merged commit 2cbfa34 into main Mar 19, 2021
@dsherry dsherry mentioned this pull request Mar 24, 2021
@freddyaboulton freddyaboulton deleted the bc_1923_partial branch May 13, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partial dependence: better error handling for single-valued features
3 participants