Partial Dependence Scale Error#2455
Conversation
09523ad to
d27dedb
Compare
Codecov Report
@@ Coverage Diff @@
## main #2455 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 283 283
Lines 25539 25555 +16
=======================================
+ Hits 25437 25453 +16
Misses 102 102
Continue to review full report at Codecov.
|
| * Added support for showing a Individual Conditional Expectations plot when graphing Partial Dependence :pr:`2386` | ||
| * Updated Objectives API to allow for sample weighting :pr:`2433` | ||
| * Fixes | ||
| * Added custom exception to address features with scales to small for partial dependence :pr:`2455` |
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Left a comment on something that can be added if there is a solution to get PDP to work with small scales. Otherwise, 👌
| except ValueError as e: | ||
| if "percentiles are too close to each other" in str(e): | ||
| raise ValueError( | ||
| "The scale of these features is too small and results in" |
There was a problem hiding this comment.
Is there a way that users would be able to run PDP on this small data? ie change the percentiles? Or is it that if the data is too small/close together, partial dependence won't work at all?
If it's the former, I think it'd be nice if we can include a comment to give the user suggestions on how to change their params. Otherwise, this looks fine!
There was a problem hiding this comment.
I think you could artificially scale the feature by perhaps multiplying by 1E6 or something like that to defeat the absolute tolerance check. That might fall within the realm of our current transformation components, but I think the story was just for a more informative error message. Let me add a little bit to the error message.
| grid_resolution=grid_resolution, | ||
| kind=kind, | ||
| ) | ||
| except ValueError as e: |
There was a problem hiding this comment.
Thoughts on us doing the check rather than relying on sklearn? My reasoning is that with @bchen1116 's #2454 , we're going to be computing the grid ourselves for datetimes and I think we'd want to have this same check in place for dates as well?
Maybe the right thing to do is modify #2454 to compute the grid ourselves for all features rather than having separate branches for datetimes vs other features. That may make it easier to put the "can we compute a valid grid for these features?" checks in one place? Curious what you think!
There was a problem hiding this comment.
My first crack was reproducing the check myself, rather than just let sklearn's partial dependence fail. I decided against that, ultimately, as I didn't want to take the risk that sklearn changes the current criteria behind the check and I wanted the catch to be a little more resilient to that.
I'm not even sure how the same check that's failing within partial dependence would handle date times. I guess you could define a relative tolerance of 1E-5 and it mean something if you convert a datetime to like linux time or something.
I'm also not against rejecting this PR if you don't think catching the existing sklearn exception is still the right move. If we're more in favor of computing, for sure, a functional grid, then we can go that route and I can flush this sucker down the toilet.
White space
Updated the error message
ParthivNaresh
left a comment
There was a problem hiding this comment.
LGTM! Do you think that this warrants a potential DataCheck in the future? I get the feeling that uniqueness_data_check could have an extension to it that checks if the 5th and 95th percentiles are too close together.
Addresses #2336
The origin of the original error is within sklearn and is the result of
mquantile()being called on certain features whose scales are too close together. "Too close" is defined by a call of numpy'sallclose()function, which compares the values defining, in this case, the .05 and .95 percentiles. Ifnp.allclose(five_percentile, ninety_five_percentile), the ValueError for the original issue will be raised. The tolerance for theallclose()function that triggers the exception is, by default, 1E-08, so when features are closer together than this, we experience the error.