-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-30707][SQL]Window function set partitionSpec as order spec when orderSpec is empty #27861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
df63708
487e533
85e66d2
1c5bd67
62739e3
519b4ca
98a33ec
2469d73
13754a4
376c255
5efb201
b94adcf
69b0dd6
537cd46
d369cbc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -1691,7 +1691,19 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging | |||||||||
| override def visitWindowDef(ctx: WindowDefContext): WindowSpecDefinition = withOrigin(ctx) { | ||||||||||
| // CLUSTER BY ... | PARTITION BY ... ORDER BY ... | ||||||||||
| val partition = ctx.partition.asScala.map(expression) | ||||||||||
| val order = ctx.sortItem.asScala.map(visitSortItem) | ||||||||||
| val order = if (ctx.sortItem.asScala.nonEmpty) { | ||||||||||
| ctx.sortItem.asScala.map(visitSortItem) | ||||||||||
| } else if (ctx.windowFrame != null && | ||||||||||
| ctx.windowFrame().frameType.getType == SqlBaseParser.RANGE) { | ||||||||||
| // for RANGE window frame, we won't add default order spec | ||||||||||
| ctx.sortItem.asScala.map(visitSortItem) | ||||||||||
| } else { | ||||||||||
| // Same default behaviors like hive, when order spec is null | ||||||||||
| // set partition spec expression as order spec | ||||||||||
| ctx.partition.asScala.map { expr => | ||||||||||
| SortOrder(expression(expr), Ascending, Ascending.defaultNullOrdering, Set.empty) | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait .. why do we set the ordering column as partition column? We should just leave it unspecified so only (non-window) aggregation functions work together with unbounded windows so it doesn't get affected by the order. This is what Scala API does.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
emmmm, hive doing like this...for me, when user not set order by clause, means he don't care about result order. For Range DataFrame we can't support this.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But the results will be useless. When can it be useful if the order is indeterministic for the functions dependent on the order .. ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In postgre sql , if we don't specify order column, the result is according to partition column 's default sort order.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess because PostgreSQL can keep the natural order. Spark can't keep the natural order. Is PostgreSQL result deterministic?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For same sql, result is deterministic. And we add partition column as order by column by default can keep result deterministic. I meet this problem when migration hive sql to spark sql.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should not fix it because Spark side at least the results will be non-deterministic. I doubt if this is good to add this support only because of compatibility with other DMBSes when the output is expected to be useless. Maybe disallowing it might be a better idea than finding another problem later caused by the different and indeterministic data. Do you maybe know other cases from other distributed DBMSs such as presto?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
but in my fix, we add default order spec, the result will be deterministic. spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Lines 2773 to 2776 in a28ed86
|
||||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| // RANGE/ROWS BETWEEN ... | ||||||||||
| val frameSpecOption = Option(ctx.windowFrame).map { frame => | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,7 +14,7 @@ struct<> | |
| SELECT four, ten, SUM(SUM(four)) OVER (PARTITION BY four), AVG(ten) FROM tenk1 | ||
| GROUP BY four, ten ORDER BY four, ten | ||
| -- !query schema | ||
| struct<four:int,ten:int,sum(sum(CAST(four AS BIGINT))) OVER (PARTITION BY four ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):bigint,avg(ten):double> | ||
| struct<four:int,ten:int,sum(sum(CAST(four AS BIGINT))) OVER (PARTITION BY four ORDER BY four ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):bigint,avg(ten):double> | ||
| -- !query output | ||
| 0 0 0 0.0 | ||
| 0 2 0 2.0 | ||
|
|
@@ -306,7 +306,7 @@ SELECT last(ten) OVER (PARTITION BY four), ten, four FROM | |
| (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s | ||
| ORDER BY four, ten | ||
| -- !query schema | ||
| struct<last(ten, false) OVER (PARTITION BY four ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):int,ten:int,four:int> | ||
| struct<last(ten, false) OVER (PARTITION BY four ORDER BY four ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):int,ten:int,four:int> | ||
| -- !query output | ||
| 4 0 0 | ||
| 4 0 0 | ||
|
|
@@ -341,8 +341,7 @@ struct<ten:int,two:int,gsum:bigint,wsum:bigint> | |
| -- !query | ||
| SELECT count(*) OVER (PARTITION BY four), four FROM (SELECT * FROM tenk1 WHERE two = 1)s WHERE unique2 < 10 | ||
| -- !query schema | ||
| struct<count(1) OVER (PARTITION BY four ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):bigint,four:int> | ||
| -- !query output | ||
| struct<count(1) OVER (PARTITION BY four ORDER BY four ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):bigint,four:int>-- !query output | ||
| 2 3 | ||
| 2 3 | ||
| 4 1 | ||
|
|
@@ -422,7 +421,7 @@ struct<ten:int,two:int,gsum:bigint,wsum:bigint> | |
| -- !query | ||
| SELECT count(*) OVER (PARTITION BY four) FROM (SELECT * FROM tenk1 WHERE FALSE)s | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, now I completely got what you're trying to do it. You do want window functions to work without specifying the ordering, and non-window functions already work without specifying ordering (because the results will be deterministic anyway). Yes, -1 for the same comment from @hvanhovell. |
||
| -- !query schema | ||
| struct<count(1) OVER (PARTITION BY four ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):bigint> | ||
| struct<count(1) OVER (PARTITION BY four ORDER BY four ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):bigint> | ||
| -- !query output | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the ANSI SQL spec:
If WD has no window ordering clause, then the window ordering is implementation-dependent, and all rows are peers.I don't think this is a bug fix but rather a new feature. We need to justify it: what's the behavior of other popular SQL systems like presto, snowflake, redshift, etc.? And what's the benefit for end users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check other SQL system later.
In our production, one benefit is we can migrate hive sql to spark sql smother and don't need to rewrite sql one by one