-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrete percentile aggregations deviates from Oracle definitions #5366
Comments
You are probably aware that there are at least nine algorithms for estimating quantiles from a sample commonly used in (statistical) software packages. So maybe this is not a bug, but just a (justified) decision to use certain algorithm? |
I was not aware about >9 different algorithms (reviewing Pandas documentation, I could only find 5 algorithms), so thank you for the pointer. The reason in particular I referenced the Oracle definition as the authoritative source was that in this Github issue where You may be correct that this was an explicit decision; however, navigating thru git blame all the way to the initial implementation of quantiles, it doesn't appear that was any other explicit decision. Additionally, the documentation around this function doesn't provide any information about the decision either. So you're correct, maybe this isn't a bug, but I think it's worth a clarification or discussion. At a minimum, it would be nice if the interpolation function was configurable, similar to Pandas so we don't have to roll our own (potentially unperformant) implementation. |
Yes, I agree that it is a general nuisance that quantile estimation is not necessarily reproducible between packages without tuning the algoritms to match. In this sense, the fast native implementations of the common algorithms would certainly be a great benefit. On the other hand, I have not encountered any convincing real world example of a situation (in addition to the aim of exacly reproducing the earlier results) in which the choice of reasonable quantile algorithm would have really mattered, i.e. from pragmatic point of view the results are similar enough to make the same kind of decisions (even though the results are not identical). In any case, a brief justification for the choice of an algorithm and a note telling which systems have the same implementation (maybe Postgre or SQLite?) in the documentation would certainly be useful information. |
This is also the Postgres spec so we should follow it. |
There are actually two issues here. One is that we are not computing the half open intervals correctly, which is straightforward. The other is that the rewrite does not handle |
In case it's helpful, my use case relies on solving the half open intervals bit, more so than descending ordering bit. |
I think I'm going to break out the |
Change the discrete interval boundaries to match Oracle.
@hawkfish cool solution - I pulled main and tried a naive |
Unfortunately it seems to have a bizarre representation problem on Linux 32 bit. I'm going to look at using |
Modify or disable the tests on 32 bit thanks to DECIMAL => DOUBLE conversion differences.
Add parentheses to try to fix strange Linux failures.
Add parentheses to try to fix strange Linux failures.
Use our AbsOperator instead of whatever Linux is using today. Make ICU TZ sorting deterministic.
Make ICU TZ sorting deterministic.
Maintain decimal types to avoid floating point scaling errors.
Fix the templating so Linux will swallow it.
Issue #5366: QUANTILE_DISC Intervals
Use a copy constructor instead of double negation. (As Bjarne intended...)
Change the discrete interval boundaries to match Oracle.
Modify or disable the tests on 32 bit thanks to DECIMAL => DOUBLE conversion differences.
Maintain decimal types to avoid floating point scaling errors.
Fix the templating so Linux will swallow it.
Use a copy constructor instead of double negation. (As Bjarne intended...)
What happens?
Calculating discrete percentile values (regardless of the
percentile_disc
orquantile_disc
API) appears to have a bug that doesn't align with the Oracle documentation forpercentile_disc
and diverges (at a minimum) with BigQuery's implementation. I believe it has to do with the floor operation of the calculated percentile position.Oracle's documentation defines
percentile_disc
as:In the Oracle examples for
percentile_disc
, the 50th percentile of the descending ordered values of :11000, 3100, 2900, 2800, 2600, 2500
should be 2900 because the value ofcume_dist
for that value is 0.5 which is equal to the 50th percentile. However, a similar query for DuckDB will give the 50th percentile value as 2800, despite thecume_dist
values matching the Oracle example.This is also very noticable on even smaller inputs, such as n = 2. E.g. For all percentile values strictly less than 1.0, running discrete percentiles for a list of unique values of size 2 will return the smaller of the two values. Whereas the larger of the two values will be returned only when checking for the 100th percentile.
I was able to compare this to a query from BigQuery:
In DuckDB (edited to match the SQL flavor):
To Reproduce
Example From Oracle
Link: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions111.htm#sthref1776
Two number example:
OS:
Mac OSX M1
DuckDB Version:
latest
DuckDB Client:
nodejs, shell
Full Name:
Robby Cohen
Affiliation:
Pave
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: