-
Notifications
You must be signed in to change notification settings - Fork 16
ENH: add quantile function with weights support
#494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
quantile function - method="linear", no weights
|
From scipy/scipy#23832
I am assuming all delegate packages do what numpy does, in terms of broadcasting? If that is the case we should probably follow unless we think scipy has better rules? |
quantile function - method="linear", no weightsquantile function with weights support
|
This PR is now ~95% finished, so marking it ready for review. Updated PR desc. |
| raise ValueError(msg) | ||
| else: | ||
| if method not in {"inverted_cdf", "averaged_inverted_cdf"}: | ||
| msg = f"`method` '{method}' not supported with weights." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only method x/ y support weights?
|
|
||
| methods = {"linear", "inverted_cdf", "averaged_inverted_cdf"} | ||
| if method not in methods: | ||
| msg = f"`method` must be one of {methods}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort methods to get a deterministic output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's deterministic already. But do you mean declaring methods in the sorted order?
Like this: methods = {"averaged_inverted_cdf", "inverted_cdf", "linear"}
| raise ValueError(msg) | ||
| nan_policies = {"propagate", "omit"} | ||
| if nan_policy not in nan_policies: | ||
| msg = f"`nan_policy` must be one of {nan_policies}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
|
@cakedev0 would you be able to fill out the PR description a little more to justify API decisions, primarily comparing how the implementation compares to what is available in NumPy and SciPy respectively? Is there a link to a PR which would use this in sklearn, or are we not at that stage yet? |
|
Added some comparisons with numpy/scipy in the PR description.
We are not at this stage yet, but I would be happy to open an "experiment PR" where I replace most calls to |
|
Thanks, yeah, that sounds like a good idea. |
Resolves #340
Specs:
"linear", "inverted_cdf", "averaged_inverted_cdf""inverted_cdf", "averaged_inverted_cdf"Comparison with NumPy:
Those specs matches NumPy
quantilefornan_policy="propagate"andnanquantilefornan_policy="omit". Differences are:nan_policy="propagate"|"omit"(inspired by scipy) instead of two functions like numpy (quantilefornan_policy="propagate"andnanquantile. Why? Less code/doc duplication, and clearer behavior IMO."averaged_inverted_cdf"method (only"inverted_cdf"method is supported by numpy): we need this in scikit-learn.nan_policy="omit": there are a few calls tonanpercentilein scikit-learn. Would be easy to implement, following what has been done in scipy.xpx.partionrelies on sorting when not delegating. This could be workedComparison with SciPy:
Main difference is the broadcasting/reduction behavior. We aligned on numpy for this one.
Also, SciPy doesn't support weights.
Implementation for the non-weighted case is more or less copy-pasted from SciPy, except that I've only kept support for 3 methods.
Future use of
xpx.quantilein scikit-learnThose specs should be enough to replace
sklearn.utils.stats._weighted_percentileand to replace most uses ofnp.quantile/percentilewhen rewriting numpy-based functions to support Array API standard.Note that the implementation for the weighted case differs a bit from sklearn's
_weighted_percentile: it's mostly the same approach (sortweightsbased onaand compute cumulative sum), but they differ in the way edge cases are handled (null weights mostly). I believe my implementation to be easier to read and to get right, and equivalent in terms of performance (dominated by argsort for big inputs).Still TODO:
_weighted_percentileto see if there is something I missed in my tests