Skip to content

Conversation

@cakedev0
Copy link
Contributor

@cakedev0 cakedev0 commented Oct 21, 2025

Resolves #340

Specs:

  • no weights: ND-support, propagate NaNs, supported methods: "linear", "inverted_cdf", "averaged_inverted_cdf"
    • adding support for omitting NaNs shouldn't be too hard (if you assume NaNs are sorted at the end, which is what everyone does), but I don't think it's needed for now.
  • with weights: Only 1D or 2D support, propagate or omit NaNs, supported methods: "inverted_cdf", "averaged_inverted_cdf"
  • Delegates for all major supported backends when possible (numpy, cupy, torch and jax), except dask, as it was raising errors.

Comparison with NumPy:

Those specs matches NumPy quantile for nan_policy="propagate" and nanquantile for nan_policy="omit". Differences are:

  • using nan_policy="propagate"|"omit" (inspired by scipy) instead of two functions like numpy (quantile for nan_policy="propagate" and nanquantile. Why? Less code/doc duplication, and clearer behavior IMO.
  • weighted case:
    • support for "averaged_inverted_cdf" method (only "inverted_cdf" method is supported by numpy): we need this in scikit-learn.
    • only up to 2D: we don't need more for now.
  • non-weighted case:
    • less methods. Why? We don't need more. I've only found one call with a method in scikit-learn and it's a deprecated method.
    • no support for nan_policy="omit": there are a few calls to nanpercentile in scikit-learn. Would be easy to implement, following what has been done in scipy.
    • implementation: we sort instead of relying on partitioning like numpy does. The partitioning-based implementation would be quite more complex and would not benefits perf in most cases, as currently xpx.partion relies on sorting when not delegating. This could be worked

Comparison with SciPy:

Main difference is the broadcasting/reduction behavior. We aligned on numpy for this one.

Also, SciPy doesn't support weights.

Implementation for the non-weighted case is more or less copy-pasted from SciPy, except that I've only kept support for 3 methods.

Future use of xpx.quantile in scikit-learn

Those specs should be enough to replace sklearn.utils.stats._weighted_percentile and to replace most uses of np.quantile/percentile when rewriting numpy-based functions to support Array API standard.

Note that the implementation for the weighted case differs a bit from sklearn's _weighted_percentile: it's mostly the same approach (sort weights based on a and compute cumulative sum), but they differ in the way edge cases are handled (null weights mostly). I believe my implementation to be easier to read and to get right, and equivalent in terms of performance (dominated by argsort for big inputs).


Still TODO:

  • decide whether to support dask. I don't think we should it's very slow and doesn't match the purpose of dask of mentioned here. Supporting dask would slow down the CI quite a bit.
  • read scikit-learn's tests for _weighted_percentile to see if there is something I missed in my tests

@cakedev0 cakedev0 marked this pull request as draft October 21, 2025 10:05
@lucascolley lucascolley changed the title FEA: Add quantile function - method="linear", no weights ENH: add quantile function - method="linear", no weights Oct 21, 2025
@lucascolley lucascolley added enhancement New feature or request new function labels Oct 21, 2025
@lucascolley lucascolley mentioned this pull request Oct 21, 2025
@lucyleeow
Copy link

From scipy/scipy#23832

There is a slight difference between NumPy gufunc and scipy.stats rules that has to do with prepending 1s to the shapes when the arguments have different dimensionality. Specifically, in scipy.stats, axis identifies the core dimension after 1s are prepended to match the dimensionalities.

I am assuming all delegate packages do what numpy does, in terms of broadcasting? If that is the case we should probably follow unless we think scipy has better rules?

@cakedev0 cakedev0 changed the title ENH: add quantile function - method="linear", no weights ENH: add quantile function with weights support Oct 23, 2025
@cakedev0
Copy link
Contributor Author

cakedev0 commented Oct 23, 2025

This PR is now ~95% finished, so marking it ready for review. Updated PR desc.

@cakedev0 cakedev0 marked this pull request as ready for review October 23, 2025 15:19
raise ValueError(msg)
else:
if method not in {"inverted_cdf", "averaged_inverted_cdf"}:
msg = f"`method` '{method}' not supported with weights."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only method x/ y support weights?


methods = {"linear", "inverted_cdf", "averaged_inverted_cdf"}
if method not in methods:
msg = f"`method` must be one of {methods}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort methods to get a deterministic output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's deterministic already. But do you mean declaring methods in the sorted order?

Like this: methods = {"averaged_inverted_cdf", "inverted_cdf", "linear"}

raise ValueError(msg)
nan_policies = {"propagate", "omit"}
if nan_policy not in nan_policies:
msg = f"`nan_policy` must be one of {nan_policies}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@lucascolley
Copy link
Member

@cakedev0 would you be able to fill out the PR description a little more to justify API decisions, primarily comparing how the implementation compares to what is available in NumPy and SciPy respectively?

Is there a link to a PR which would use this in sklearn, or are we not at that stage yet?

@cakedev0
Copy link
Contributor Author

Added some comparisons with numpy/scipy in the PR description.

Is there a link to a PR which would use this in sklearn, or are we not at that stage yet?

We are not at this stage yet, but I would be happy to open an "experiment PR" where I replace most calls to np.percentile/np.quantile/_weighted_percentile by calls to xpx.quantile to check that the tests pass. This would help a lot being confident about this PR (in terms of API decisions and implementation correctness).

@lucascolley
Copy link
Member

Thanks, yeah, that sounds like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request new function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: add quantile

4 participants