Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrete percentile aggregations deviates from Oracle definitions #5366

Closed
2 tasks done
rnc505 opened this issue Nov 15, 2022 · 9 comments · Fixed by #5442
Closed
2 tasks done

Discrete percentile aggregations deviates from Oracle definitions #5366

rnc505 opened this issue Nov 15, 2022 · 9 comments · Fixed by #5442
Assignees

Comments

@rnc505
Copy link

rnc505 commented Nov 15, 2022

What happens?

Calculating discrete percentile values (regardless of the percentile_disc or quantile_disc API) appears to have a bug that doesn't align with the Oracle documentation for percentile_disc and diverges (at a minimum) with BigQuery's implementation. I believe it has to do with the floor operation of the calculated percentile position.

Oracle's documentation defines percentile_disc as:

For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.

In the Oracle examples for percentile_disc, the 50th percentile of the descending ordered values of : 11000, 3100, 2900, 2800, 2600, 2500 should be 2900 because the value of cume_dist for that value is 0.5 which is equal to the 50th percentile. However, a similar query for DuckDB will give the 50th percentile value as 2800, despite the cume_dist values matching the Oracle example.

This is also very noticable on even smaller inputs, such as n = 2. E.g. For all percentile values strictly less than 1.0, running discrete percentiles for a list of unique values of size 2 will return the smaller of the two values. Whereas the larger of the two values will be returned only when checking for the 100th percentile.


I was able to compare this to a query from BigQuery:

SELECT
  DISTINCT PERCENTILE_DISC(value, 0.1) OVER (),
  PERCENTILE_DISC(value, 0.32) OVER (),
  PERCENTILE_DISC(value, 0.33) OVER (),
  PERCENTILE_DISC(value, 0.34) OVER (),
  PERCENTILE_DISC(value, 0.49) OVER (),
  PERCENTILE_DISC(value, 0.50) OVER (),
  PERCENTILE_DISC(value, 0.51) OVER (),
  PERCENTILE_DISC(value, 0.75) OVER (),
  PERCENTILE_DISC(value, 0.9) OVER (),
  PERCENTILE_DISC(value, 0.999) OVER (),
  PERCENTILE_DISC(value, 1) OVER (),
FROM (
  SELECT
    0 AS value
  UNION ALL
  SELECT
    1 AS value
  UNION ALL
  SELECT
    2 AS value
  UNION ALL
  SELECT
    10 AS value )
f0_	f1_	f2_	f3_	f4_	f5_	f6_	f7_	f8_	f9_	f10_
0	1	1	1	1	1	2	2	10	10	10

In DuckDB (edited to match the SQL flavor):

duckdb> SELECT quantile_disc(col, [0.1, 0.32, 0.33, 0.34, 0.49, .5, .51, 0.75, 0.9, 0.999, 1])
   ...>     FROM VALUES (0), (1), (2), (10) AS tab(col);
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│ quantile_disc(col, main.list_value(0.1, 0.32, 0.33, 0.34, 0.49, .5, .51, 0.75, 0.9, 0.999, 1)) │
╞════════════════════════════════════════════════════════════════════════════════════════════════╡
│ [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 10]                                                             │
└────────────────────────────────────────────────────────────────────────────────────────────────┘

To Reproduce

Example From Oracle

Link: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions111.htm#sthref1776

duckdb>     SELECT percentile_disc(.5) WITHIN GROUP (order by col desc) FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);
┌─────────────────────────────────────┐
│ quantile_disc(.5 ORDER BY col DESC) │
╞═════════════════════════════════════╡
│                                2800 │
└─────────────────────────────────────┘
duckdb>  SELECT col, cume_dist() over (ORDER BY col desc),  FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);
┌───────┬──────────────────────────────────────┐
│ col   ┆ cume_dist() OVER (ORDER BY col DESC) │
╞═══════╪══════════════════════════════════════╡
│ 11000 ┆                  0.16666666666666666 │
│  3100 ┆                   0.3333333333333333 │
│  2900 ┆                                  0.5 │
│  2800 ┆                   0.6666666666666666 │
│  2600 ┆                   0.8333333333333334 │
│  2500 ┆                                    1 │
└───────┴──────────────────────────────────────┘
duckdb>     SELECT percentile_disc(.5) WITHIN GROUP (order by col) FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);

┌────────────────────────────────┐
│ quantile_disc(.5 ORDER BY col) │
╞════════════════════════════════╡
│                           2800 │
└────────────────────────────────┘
duckdb>     SELECT col, cume_dist() over (ORDER BY col ), FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);
┌───────┬─────────────────────────────────┐
│ col   ┆ cume_dist() OVER (ORDER BY col) │
╞═══════╪═════════════════════════════════╡
│  2500 ┆             0.16666666666666666 │
│  2600 ┆              0.3333333333333333 │
│  2800 ┆                             0.5 │
│  2900 ┆              0.6666666666666666 │
│  3100 ┆              0.8333333333333334 │
│ 11000 ┆                               1 │
└───────┴─────────────────────────────────┘

Two number example:

duckdb> SELECT col, cume_dist() over (ORDER BY col ), FROM VALUES (11000), (3100) AS tab(col);
┌───────┬─────────────────────────────────┐
│ col   ┆ cume_dist() OVER (ORDER BY col) │
╞═══════╪═════════════════════════════════╡
│  3100 ┆                             0.5 │
│ 11000 ┆                               1 │
└───────┴─────────────────────────────────┘
duckdb>     SELECT 
   ...> percentile_disc([0.1, 0.25, 0.5, 0.75, 0.9, 1.0]) WITHIN GROUP (order by col)
   ...>     FROM VALUES (11000), (3100) AS tab(col);
┌─────────────────────────────────────────────────────────────────────────────┐
│ quantile_disc(main.list_value(0.1, 0.25, 0.5, 0.75, 0.9, 1.0) ORDER BY col) │
╞═════════════════════════════════════════════════════════════════════════════╡
│ [3100, 3100, 3100, 3100, 3100, 11000]                                       │
└─────────────────────────────────────────────────────────────────────────────┘

OS:

Mac OSX M1

DuckDB Version:

latest

DuckDB Client:

nodejs, shell

Full Name:

Robby Cohen

Affiliation:

Pave

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@rsund
Copy link
Contributor

rsund commented Nov 15, 2022

You are probably aware that there are at least nine algorithms for estimating quantiles from a sample commonly used in (statistical) software packages. So maybe this is not a bug, but just a (justified) decision to use certain algorithm?

@rnc505
Copy link
Author

rnc505 commented Nov 15, 2022

I was not aware about >9 different algorithms (reviewing Pandas documentation, I could only find 5 algorithms), so thank you for the pointer.

The reason in particular I referenced the Oracle definition as the authoritative source was that in this Github issue where quantile_cont was added, it references the corresponding Oracle documentation. In fact, this issue references similar comments about number of implementations and it appeared to me that the decision for interpolation for discrete was unresolved or at least, open for debate.

You may be correct that this was an explicit decision; however, navigating thru git blame all the way to the initial implementation of quantiles, it doesn't appear that was any other explicit decision.

Additionally, the documentation around this function doesn't provide any information about the decision either.

So you're correct, maybe this isn't a bug, but I think it's worth a clarification or discussion. At a minimum, it would be nice if the interpolation function was configurable, similar to Pandas so we don't have to roll our own (potentially unperformant) implementation.

@rsund
Copy link
Contributor

rsund commented Nov 15, 2022

Yes, I agree that it is a general nuisance that quantile estimation is not necessarily reproducible between packages without tuning the algoritms to match. In this sense, the fast native implementations of the common algorithms would certainly be a great benefit.

On the other hand, I have not encountered any convincing real world example of a situation (in addition to the aim of exacly reproducing the earlier results) in which the choice of reasonable quantile algorithm would have really mattered, i.e. from pragmatic point of view the results are similar enough to make the same kind of decisions (even though the results are not identical).

In any case, a brief justification for the choice of an algorithm and a note telling which systems have the same implementation (maybe Postgre or SQLite?) in the documentation would certainly be useful information.

@hawkfish hawkfish self-assigned this Nov 16, 2022
@hawkfish
Copy link
Contributor

This is also the Postgres spec so we should follow it.

@hawkfish
Copy link
Contributor

There are actually two issues here. One is that we are not computing the half open intervals correctly, which is straightforward. The other is that the rewrite does not handle ORDER BY DESC correctly, which is more interesting...

@rnc505
Copy link
Author

rnc505 commented Nov 20, 2022

In case it's helpful, my use case relies on solving the half open intervals bit, more so than descending ordering bit.

@hawkfish
Copy link
Contributor

hawkfish commented Nov 20, 2022

I think I'm going to break out the ORDER BY DESC piece into a separate issue as that is really a problem with the substitution rules (sorting DESC is implemented by reversing the quantile values, which does not respect half-open interval definitions). So this issue is now just fixing the interval definition for quantile_desc.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 20, 2022
Change the discrete interval boundaries to match Oracle.
@rnc505
Copy link
Author

rnc505 commented Nov 21, 2022

@hawkfish cool solution - I pulled main and tried a naive round for FRN and that did not work lol

@hawkfish
Copy link
Contributor

Unfortunately it seems to have a bizarre representation problem on Linux 32 bit. I'm going to look at using DECIMAL types for the fractions to see if that fixes it.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 21, 2022
Modify or disable the tests on 32 bit
thanks to DECIMAL => DOUBLE conversion differences.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022
Add parentheses to try to fix strange Linux failures.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022
Add parentheses to try to fix strange Linux failures.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022
Use our AbsOperator instead of whatever Linux is using today.
Make ICU TZ sorting deterministic.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022
Make ICU TZ sorting deterministic.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022
Maintain decimal types to avoid floating point scaling errors.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 23, 2022
Fix the templating so Linux will swallow it.
@Mytherin Mytherin linked a pull request Nov 23, 2022 that will close this issue
Mytherin added a commit that referenced this issue Nov 23, 2022
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 23, 2022
Use a copy constructor instead of double negation.
(As Bjarne intended...)
Tmonster pushed a commit to Tmonster/duckdb that referenced this issue Nov 24, 2022
Change the discrete interval boundaries to match Oracle.
Tmonster pushed a commit to Tmonster/duckdb that referenced this issue Nov 24, 2022
Modify or disable the tests on 32 bit
thanks to DECIMAL => DOUBLE conversion differences.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Dec 1, 2022
Maintain decimal types to avoid floating point scaling errors.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Dec 1, 2022
Fix the templating so Linux will swallow it.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Dec 1, 2022
Use a copy constructor instead of double negation.
(As Bjarne intended...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants