Discrete percentile aggregations deviates from Oracle definitions #5366

rnc505 · 2022-11-15T18:25:14Z

What happens?

Calculating discrete percentile values (regardless of the percentile_disc or quantile_disc API) appears to have a bug that doesn't align with the Oracle documentation for percentile_disc and diverges (at a minimum) with BigQuery's implementation. I believe it has to do with the floor operation of the calculated percentile position.

Oracle's documentation defines percentile_disc as:

For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P.

In the Oracle examples for percentile_disc, the 50th percentile of the descending ordered values of : 11000, 3100, 2900, 2800, 2600, 2500 should be 2900 because the value of cume_dist for that value is 0.5 which is equal to the 50th percentile. However, a similar query for DuckDB will give the 50th percentile value as 2800, despite the cume_dist values matching the Oracle example.

This is also very noticable on even smaller inputs, such as n = 2. E.g. For all percentile values strictly less than 1.0, running discrete percentiles for a list of unique values of size 2 will return the smaller of the two values. Whereas the larger of the two values will be returned only when checking for the 100th percentile.

I was able to compare this to a query from BigQuery:

SELECT
  DISTINCT PERCENTILE_DISC(value, 0.1) OVER (),
  PERCENTILE_DISC(value, 0.32) OVER (),
  PERCENTILE_DISC(value, 0.33) OVER (),
  PERCENTILE_DISC(value, 0.34) OVER (),
  PERCENTILE_DISC(value, 0.49) OVER (),
  PERCENTILE_DISC(value, 0.50) OVER (),
  PERCENTILE_DISC(value, 0.51) OVER (),
  PERCENTILE_DISC(value, 0.75) OVER (),
  PERCENTILE_DISC(value, 0.9) OVER (),
  PERCENTILE_DISC(value, 0.999) OVER (),
  PERCENTILE_DISC(value, 1) OVER (),
FROM (
  SELECT
    0 AS value
  UNION ALL
  SELECT
    1 AS value
  UNION ALL
  SELECT
    2 AS value
  UNION ALL
  SELECT
    10 AS value )
f0_	f1_	f2_	f3_	f4_	f5_	f6_	f7_	f8_	f9_	f10_
0	1	1	1	1	1	2	2	10	10	10

In DuckDB (edited to match the SQL flavor):

duckdb> SELECT quantile_disc(col, [0.1, 0.32, 0.33, 0.34, 0.49, .5, .51, 0.75, 0.9, 0.999, 1])
   ...>     FROM VALUES (0), (1), (2), (10) AS tab(col);
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│ quantile_disc(col, main.list_value(0.1, 0.32, 0.33, 0.34, 0.49, .5, .51, 0.75, 0.9, 0.999, 1)) │
╞════════════════════════════════════════════════════════════════════════════════════════════════╡
│ [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 10]                                                             │
└────────────────────────────────────────────────────────────────────────────────────────────────┘

To Reproduce

Example From Oracle

Link: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions111.htm#sthref1776

duckdb>     SELECT percentile_disc(.5) WITHIN GROUP (order by col desc) FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);
┌─────────────────────────────────────┐
│ quantile_disc(.5 ORDER BY col DESC) │
╞═════════════════════════════════════╡
│                                2800 │
└─────────────────────────────────────┘

duckdb>  SELECT col, cume_dist() over (ORDER BY col desc),  FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);
┌───────┬──────────────────────────────────────┐
│ col   ┆ cume_dist() OVER (ORDER BY col DESC) │
╞═══════╪══════════════════════════════════════╡
│ 11000 ┆                  0.16666666666666666 │
│  3100 ┆                   0.3333333333333333 │
│  2900 ┆                                  0.5 │
│  2800 ┆                   0.6666666666666666 │
│  2600 ┆                   0.8333333333333334 │
│  2500 ┆                                    1 │
└───────┴──────────────────────────────────────┘

duckdb>     SELECT percentile_disc(.5) WITHIN GROUP (order by col) FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);

┌────────────────────────────────┐
│ quantile_disc(.5 ORDER BY col) │
╞════════════════════════════════╡
│                           2800 │
└────────────────────────────────┘

duckdb>     SELECT col, cume_dist() over (ORDER BY col ), FROM VALUES (11000), (3100), (2900), (2800), (2600), (2500) AS tab(col);
┌───────┬─────────────────────────────────┐
│ col   ┆ cume_dist() OVER (ORDER BY col) │
╞═══════╪═════════════════════════════════╡
│  2500 ┆             0.16666666666666666 │
│  2600 ┆              0.3333333333333333 │
│  2800 ┆                             0.5 │
│  2900 ┆              0.6666666666666666 │
│  3100 ┆              0.8333333333333334 │
│ 11000 ┆                               1 │
└───────┴─────────────────────────────────┘

Two number example:

duckdb> SELECT col, cume_dist() over (ORDER BY col ), FROM VALUES (11000), (3100) AS tab(col);
┌───────┬─────────────────────────────────┐
│ col   ┆ cume_dist() OVER (ORDER BY col) │
╞═══════╪═════════════════════════════════╡
│  3100 ┆                             0.5 │
│ 11000 ┆                               1 │
└───────┴─────────────────────────────────┘

duckdb>     SELECT 
   ...> percentile_disc([0.1, 0.25, 0.5, 0.75, 0.9, 1.0]) WITHIN GROUP (order by col)
   ...>     FROM VALUES (11000), (3100) AS tab(col);
┌─────────────────────────────────────────────────────────────────────────────┐
│ quantile_disc(main.list_value(0.1, 0.25, 0.5, 0.75, 0.9, 1.0) ORDER BY col) │
╞═════════════════════════════════════════════════════════════════════════════╡
│ [3100, 3100, 3100, 3100, 3100, 11000]                                       │
└─────────────────────────────────────────────────────────────────────────────┘

OS:

Mac OSX M1

DuckDB Version:

latest

DuckDB Client:

nodejs, shell

Full Name:

Robby Cohen

Affiliation:

Pave

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

rsund · 2022-11-15T19:30:20Z

You are probably aware that there are at least nine algorithms for estimating quantiles from a sample commonly used in (statistical) software packages. So maybe this is not a bug, but just a (justified) decision to use certain algorithm?

rnc505 · 2022-11-15T19:50:29Z

I was not aware about >9 different algorithms (reviewing Pandas documentation, I could only find 5 algorithms), so thank you for the pointer.

The reason in particular I referenced the Oracle definition as the authoritative source was that in this Github issue where quantile_cont was added, it references the corresponding Oracle documentation. In fact, this issue references similar comments about number of implementations and it appeared to me that the decision for interpolation for discrete was unresolved or at least, open for debate.

You may be correct that this was an explicit decision; however, navigating thru git blame all the way to the initial implementation of quantiles, it doesn't appear that was any other explicit decision.

Additionally, the documentation around this function doesn't provide any information about the decision either.

So you're correct, maybe this isn't a bug, but I think it's worth a clarification or discussion. At a minimum, it would be nice if the interpolation function was configurable, similar to Pandas so we don't have to roll our own (potentially unperformant) implementation.

rsund · 2022-11-15T20:25:34Z

Yes, I agree that it is a general nuisance that quantile estimation is not necessarily reproducible between packages without tuning the algoritms to match. In this sense, the fast native implementations of the common algorithms would certainly be a great benefit.

On the other hand, I have not encountered any convincing real world example of a situation (in addition to the aim of exacly reproducing the earlier results) in which the choice of reasonable quantile algorithm would have really mattered, i.e. from pragmatic point of view the results are similar enough to make the same kind of decisions (even though the results are not identical).

In any case, a brief justification for the choice of an algorithm and a note telling which systems have the same implementation (maybe Postgre or SQLite?) in the documentation would certainly be useful information.

hawkfish · 2022-11-18T20:15:15Z

This is also the Postgres spec so we should follow it.

hawkfish · 2022-11-20T02:37:35Z

There are actually two issues here. One is that we are not computing the half open intervals correctly, which is straightforward. The other is that the rewrite does not handle ORDER BY DESC correctly, which is more interesting...

rnc505 · 2022-11-20T02:49:10Z

In case it's helpful, my use case relies on solving the half open intervals bit, more so than descending ordering bit.

hawkfish · 2022-11-20T19:50:51Z

I think I'm going to break out the ORDER BY DESC piece into a separate issue as that is really a problem with the substitution rules (sorting DESC is implemented by reversing the quantile values, which does not respect half-open interval definitions). So this issue is now just fixing the interval definition for quantile_desc.

Change the discrete interval boundaries to match Oracle.

rnc505 · 2022-11-21T17:39:01Z

@hawkfish cool solution - I pulled main and tried a naive round for FRN and that did not work lol

hawkfish · 2022-11-21T17:41:27Z

Unfortunately it seems to have a bizarre representation problem on Linux 32 bit. I'm going to look at using DECIMAL types for the fractions to see if that fixes it.

Modify or disable the tests on 32 bit thanks to DECIMAL => DOUBLE conversion differences.

Add parentheses to try to fix strange Linux failures.

Use our AbsOperator instead of whatever Linux is using today. Make ICU TZ sorting deterministic.

Make ICU TZ sorting deterministic.

Maintain decimal types to avoid floating point scaling errors.

Fix the templating so Linux will swallow it.

Issue #5366: QUANTILE_DISC Intervals

Use a copy constructor instead of double negation. (As Bjarne intended...)

Change the discrete interval boundaries to match Oracle.

Modify or disable the tests on 32 bit thanks to DECIMAL => DOUBLE conversion differences.

Maintain decimal types to avoid floating point scaling errors.

Fix the templating so Linux will swallow it.

Use a copy constructor instead of double negation. (As Bjarne intended...)

hawkfish self-assigned this Nov 16, 2022

hawkfish mentioned this issue Nov 20, 2022

percentile_disc DESC does not compute interval boundaries correctly #5422

Closed

2 tasks

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 20, 2022

Issue duckdb#5366: QUANTILE_DESC Intervals

166841f

Change the discrete interval boundaries to match Oracle.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 21, 2022

Issue duckdb#5366: QUANTILE_DESC Intervals 32-bit

b172926

Modify or disable the tests on 32 bit thanks to DECIMAL => DOUBLE conversion differences.

hawkfish mentioned this issue Nov 22, 2022

Issue #5366: QUANTILE_DISC Intervals #5442

Merged

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022

Issue duckdb#5366: Negative Quantile Sorting

de7405a

Add parentheses to try to fix strange Linux failures.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022

Issue duckdb#5366: Negative Quantile Sorting

e36596c

Add parentheses to try to fix strange Linux failures.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022

Issue duckdb#5366: Quantile Absolute Value

af1a493

Use our AbsOperator instead of whatever Linux is using today. Make ICU TZ sorting deterministic.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022

Issue duckdb#5366: Quantile Absolute Value

89f2408

Make ICU TZ sorting deterministic.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 22, 2022

Issue duckdb#5366: QUANTILE_DESC Decimals

4063951

Maintain decimal types to avoid floating point scaling errors.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 23, 2022

Issue duckdb#5366: QUANTILE_DESC Templates

9b2fb60

Fix the templating so Linux will swallow it.

Mytherin linked a pull request Nov 23, 2022 that will close this issue

Issue #5366: QUANTILE_DISC Intervals #5442

Merged

Mytherin closed this as completed in #5442 Nov 23, 2022

Mytherin added a commit that referenced this issue Nov 23, 2022

Merge pull request #5442 from hawkfish/quantile-disc

dfae126

Issue #5366: QUANTILE_DISC Intervals

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 23, 2022

Issue duckdb#5366: QUANTILE_DESC Simplification

0bd0266

Use a copy constructor instead of double negation. (As Bjarne intended...)

Tmonster pushed a commit to Tmonster/duckdb that referenced this issue Nov 24, 2022

Issue duckdb#5366: QUANTILE_DESC Intervals

75956b1

Change the discrete interval boundaries to match Oracle.

Tmonster pushed a commit to Tmonster/duckdb that referenced this issue Nov 24, 2022

Issue duckdb#5366: QUANTILE_DESC Intervals 32-bit

df4af83

Modify or disable the tests on 32 bit thanks to DECIMAL => DOUBLE conversion differences.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Dec 1, 2022

Issue duckdb#5366: QUANTILE_DESC Decimals

4f6b8f6

Maintain decimal types to avoid floating point scaling errors.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Dec 1, 2022

Issue duckdb#5366: QUANTILE_DESC Templates

6ea397f

Fix the templating so Linux will swallow it.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Dec 1, 2022

Issue duckdb#5366: QUANTILE_DESC Simplification

ce1eda2

Use a copy constructor instead of double negation. (As Bjarne intended...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrete percentile aggregations deviates from Oracle definitions #5366

Discrete percentile aggregations deviates from Oracle definitions #5366

rnc505 commented Nov 15, 2022 •

edited

rsund commented Nov 15, 2022

rnc505 commented Nov 15, 2022 •

edited

rsund commented Nov 15, 2022

hawkfish commented Nov 18, 2022

hawkfish commented Nov 20, 2022

rnc505 commented Nov 20, 2022

hawkfish commented Nov 20, 2022 •

edited

rnc505 commented Nov 21, 2022

hawkfish commented Nov 21, 2022

Discrete percentile aggregations deviates from Oracle definitions #5366

Discrete percentile aggregations deviates from Oracle definitions #5366

Comments

rnc505 commented Nov 15, 2022 • edited

What happens?

To Reproduce

Example From Oracle

Two number example:

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

rsund commented Nov 15, 2022

rnc505 commented Nov 15, 2022 • edited

rsund commented Nov 15, 2022

hawkfish commented Nov 18, 2022

hawkfish commented Nov 20, 2022

rnc505 commented Nov 20, 2022

hawkfish commented Nov 20, 2022 • edited

rnc505 commented Nov 21, 2022

hawkfish commented Nov 21, 2022

rnc505 commented Nov 15, 2022 •

edited

Have you tried this on the latest `master` branch?

rnc505 commented Nov 15, 2022 •

edited

hawkfish commented Nov 20, 2022 •

edited