New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add error message to partial_dependence #1994
Conversation
evalml/model_understanding/graphs.py
Outdated
@@ -561,11 +563,15 @@ def partial_dependence(pipeline, X, features, grid_resolution=100): | |||
if pipeline.model_family == ModelFamily.BASELINE: | |||
raise ValueError("Partial dependence plots are not supported for Baseline pipelines") | |||
|
|||
if ((isinstance(features, int) and X.iloc[:, features].value_counts(normalize=True).values[0] + 0.01 > percentiles[1]) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we add 0.01 (1%) since SKLearn doesn't allow the percentage of each value to be too close to the percentile (for instance, if a feature contains 95% of a value, SKLearn will still throw the error even if we pass the upper percentile as 0.955). Otherwise, 1% was chosen arbitrarily (compared to 0.0075 or another value)
Codecov Report
@@ Coverage Diff @@
## main #1994 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 274 274
Lines 22300 22325 +25
=========================================
+ Hits 22294 22319 +25
Misses 6 6
Continue to review full report at Codecov.
|
@@ -233,8 +233,8 @@ def test_partial_dependence_not_fitted(X_y_binary, logistic_regression_binary_pi | |||
|
|||
|
|||
def test_partial_dependence_warning(logistic_regression_binary_pipeline_class): | |||
X = pd.DataFrame({'a': [2, None, 2, 2], 'b': [1, 2, 2, 1]}) | |||
y = pd.Series([0, 1, 0, 1]) | |||
X = pd.DataFrame({'a': [1, 2, None, 2, 2], 'b': [1, 1, 2, 2, 1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added extra value so we don't raise the new ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
evalml/model_understanding/graphs.py
Outdated
@@ -564,8 +566,12 @@ def partial_dependence(pipeline, X, features, grid_resolution=100): | |||
if ((isinstance(features, int) and X.iloc[:, features].isnull().sum()) or (isinstance(features, str) and X[features].isnull().sum())): | |||
warnings.warn("There are null values in the features, which will cause NaN values in the partial dependence output. Fill in these values to remove the NaN values.", NullsInColumnWarning) | |||
|
|||
if ((isinstance(features, int) and X.iloc[:, features].value_counts(normalize=True).values[0] + 0.01 > percentiles[1]) or | |||
(isinstance(features, str) and X[features].value_counts(normalize=True).values[0] + 0.01 > percentiles[1])): | |||
raise ValueError(f"Feature {features} is mostly one value and cannot be used to compute partial dependence. Try raising the upper percentage value.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
Features {features}
will return a number iffeatures
is an int right? Should we return the name of the feature in this case for easier debugging?- Should we include some information about the value causing the imbalance in the error message like
The value {X.iloc[:, features].value_counts(normalize=True).values[0]} is present in feature {features} above the upper limit for partial dependence..etc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch in that first bullet. Since the features
variable is passed by the user, I assumed it would be fine to output that same value, but I'll add some quotations around it to make it clearer to the user.
I can add in a small clarification on the first value that's causing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParthivNaresh I added a fix, let me know if it looks good to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, Bryan!
fix #1923