Skip to content

Partial Dependence Scale Error#2455

Merged
chukarsten merged 10 commits into
mainfrom
part_dep_range_error
Jul 7, 2021
Merged

Partial Dependence Scale Error#2455
chukarsten merged 10 commits into
mainfrom
part_dep_range_error

Conversation

@chukarsten

@chukarsten chukarsten commented Jun 28, 2021

Copy link
Copy Markdown
Contributor

Addresses #2336

The origin of the original error is within sklearn and is the result of mquantile() being called on certain features whose scales are too close together. "Too close" is defined by a call of numpy's allclose() function, which compares the values defining, in this case, the .05 and .95 percentiles. If np.allclose(five_percentile, ninety_five_percentile), the ValueError for the original issue will be raised. The tolerance for the allclose() function that triggers the exception is, by default, 1E-08, so when features are closer together than this, we experience the error.

@chukarsten chukarsten force-pushed the part_dep_range_error branch from 09523ad to d27dedb Compare June 28, 2021 23:48
@codecov

codecov Bot commented Jun 28, 2021

Copy link
Copy Markdown

Codecov Report

Merging #2455 (2e041dd) into main (a3aa403) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2455     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25539   25555     +16     
=======================================
+ Hits       25437   25453     +16     
  Misses       102     102             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3aa403...2e041dd. Read the comment docs.

@angela97lin angela97lin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 😁

Comment thread docs/source/release_notes.rst Outdated
* Added support for showing a Individual Conditional Expectations plot when graphing Partial Dependence :pr:`2386`
* Updated Objectives API to allow for sample weighting :pr:`2433`
* Fixes
* Added custom exception to address features with scales to small for partial dependence :pr:`2455`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too small

@bchen1116 bchen1116 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a comment on something that can be added if there is a solution to get PDP to work with small scales. Otherwise, 👌

except ValueError as e:
if "percentiles are too close to each other" in str(e):
raise ValueError(
"The scale of these features is too small and results in"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way that users would be able to run PDP on this small data? ie change the percentiles? Or is it that if the data is too small/close together, partial dependence won't work at all?

If it's the former, I think it'd be nice if we can include a comment to give the user suggestions on how to change their params. Otherwise, this looks fine!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could artificially scale the feature by perhaps multiplying by 1E6 or something like that to defeat the absolute tolerance check. That might fall within the realm of our current transformation components, but I think the story was just for a more informative error message. Let me add a little bit to the error message.

Comment thread evalml/tests/model_understanding_tests/test_partial_dependence.py
Comment thread evalml/tests/model_understanding_tests/test_partial_dependence.py
grid_resolution=grid_resolution,
kind=kind,
)
except ValueError as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on us doing the check rather than relying on sklearn? My reasoning is that with @bchen1116 's #2454 , we're going to be computing the grid ourselves for datetimes and I think we'd want to have this same check in place for dates as well?

Maybe the right thing to do is modify #2454 to compute the grid ourselves for all features rather than having separate branches for datetimes vs other features. That may make it easier to put the "can we compute a valid grid for these features?" checks in one place? Curious what you think!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first crack was reproducing the check myself, rather than just let sklearn's partial dependence fail. I decided against that, ultimately, as I didn't want to take the risk that sklearn changes the current criteria behind the check and I wanted the catch to be a little more resilient to that.

I'm not even sure how the same check that's failing within partial dependence would handle date times. I guess you could define a relative tolerance of 1E-5 and it mean something if you convert a datetime to like linux time or something.

I'm also not against rejecting this PR if you don't think catching the existing sklearn exception is still the right move. If we're more in favor of computing, for sure, a functional grid, then we can go that route and I can flush this sucker down the toilet.

@ParthivNaresh ParthivNaresh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you think that this warrants a potential DataCheck in the future? I get the feeling that uniqueness_data_check could have an extension to it that checks if the 5th and 95th percentiles are too close together.

@chukarsten chukarsten merged commit 143c83a into main Jul 7, 2021
@chukarsten chukarsten deleted the part_dep_range_error branch July 7, 2021 14:27
@chukarsten chukarsten mentioned this pull request Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants