Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jul 30, 2024

Which issue does this PR close?

Closes #11720

Rationale for this change

Fix a bug in approx_percentile_cont when having nulls in input

What changes are included in this PR?

Filters input arrays on nulls if present

Are these changes tested?

Yes, additional sql test

Are there any user-facing changes?

Just the fix

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 30, 2024
@Dandandan Dandandan changed the title Respect nulls in approx_percentile_cont Handle nulls in approx_percentile_cont Jul 30, 2024

if args.ignore_nulls {
return not_impl_err!(
"IGNORE NULLS clause not yet supported for APPROX_PERCENTILE_CONT"
Copy link
Contributor Author

@Dandandan Dandandan Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively this fix actually does ignore nulls, but we don't have a way to disable setting RESPECT NULLS for aggregate functions, I can file a follow-up ticket.
I believe we should also move DataFusion to SELECT percentile_cont(0.5) WITHIN GROUP(order by v) (used by PostgreSQL, others) syntax...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can trigger it if used as a window function PERCENTILE_CONT(...) OVER (PARTITION BY ... IGNORE NULLS) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... for now I think it makes sense to ignore the ignore_nulls argument (like we did before) and remove any nulls which makes the most sense. Null handling is broken anyway (resulting in 0 instead of NULL), so no one will rely on percentile_cont on columns with nulls.
I'll file a ticket to discuss standardizing PERCENTILE_CONT and having it behave like PostgreSQL / Spark / Snowflake, etc....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan

query I
SELECT APPROX_PERCENTILE_CONT(v, 0.5) FROM (VALUES (1), (2), (3), (NULL), (NULL), (NULL)) as t (v);
----
2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was 0 before (not even NULL)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

APPROX_PERCENTILE_CONT doesn't handle nulls and gives incorrect answer

2 participants