[SPARK-47572][SQL] Enforce Window partitionSpec is orderable. #45730

chenhao-db · 2024-03-26T19:14:23Z

What changes were proposed in this pull request?

In the Window node, both partitionSpec and orderSpec must be orderable, but the current type check only verifies orderSpec is orderable. This can cause an error in later optimizing phases.

Given a query:

with t as (select id, map(id, id) as m from range(0, 10))
select rank() over (partition by m order by id) from t

Before the PR, it fails with an INTERNAL_ERROR:

org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window partition keys cannot be map type. SQLSTATE: XX000
at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103)
at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94)
...

After the PR, it fails with a EXPRESSION_TYPE_IS_NOT_ORDERABLE, which is expected:

  org.apache.spark.sql.catalyst.ExtendedAnalysisException: [EXPRESSION_TYPE_IS_NOT_ORDERABLE] Column expression "m" cannot be sorted because its type "MAP<BIGINT, BIGINT>" is not orderable. SQLSTATE: 42822; line 2 pos 53;
Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
+- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
   +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4], [m#0], [id#1L ASC NULLS FIRST]
      +- Project [id#1L, m#0]
         +- SubqueryAlias t
            +- SubqueryAlias t
               +- Project [id#1L, map(id#1L, id#1L) AS m#0]
                  +- Range (0, 10, step=1, splits=None)
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
...

How was this patch tested?

Unit test.

cloud-fan · 2024-03-29T03:20:39Z

I think ordering and equality are two abilities. For example, map type is not orderable, but can be used as grouping keys.

cloud-fan · 2024-03-29T03:31:19Z

what is the behavior today? Do we allow to use map type as window partition key?

chenhao-db · 2024-03-29T05:25:52Z

Things get a bit more complex after #45549.

For example, map type is not orderable, but can be used as grouping keys.

It can only be used as grouping keys after #45549. However, with the map normalization and comparison code introduced in this PR, the map type can also be orderable.

what is the behavior today?

Before #45549, the SQL snippet in my PR description would fail with an INTERNAL_ERROR. After #45549, it would be successfully executed. When a column is used as a window partition key, it is translated into a local Sort over a HashPartitioning. Although the current type check code of SortOrder rejects map type, the check doesn't run when the SortOrder is constructed from a window partition key. Then, the Sort node will successfully compare map inputs, because #45549 supports map comparison.

There are still some issues with sorting maps, because the map normalization code is only inserted into aggregate, but not sort. But I think they are fixable, and the map type is indeed orderable. However, if a type is not orderable, it cannot be a window partition key, so the code change in this PR is still necessary.

cloud-fan · 2024-03-29T05:34:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+            partitionSpec.foreach { p =>
+              if (!RowOrdering.isOrderable(p.dataType)) {
+                p.dataTypeMismatch(p, TypeCheckResult.DataTypeMismatch(
+                  errorSubClass = "INVALID_ORDERING_TYPE",


why not use the error class EXPRESSION_TYPE_IS_NOT_ORDERABLE?

I cannot really tell the difference between these two error classes, so I just randomly picked one. Now I changed it into EXPRESSION_TYPE_IS_NOT_ORDERABLE instead.

cloud-fan · 2024-03-29T05:35:00Z

@chenhao-db thanks for the analysis! Let's be strict first and we can relax map type later if needed.

cloud-fan · 2024-03-29T08:17:43Z

thanks, merging to master!

### What changes were proposed in this pull request? In the `Window` node, both `partitionSpec` and `orderSpec` must be orderable, but the current type check only verifies `orderSpec` is orderable. This can cause an error in later optimizing phases. Given a query: ``` with t as (select id, map(id, id) as m from range(0, 10)) select rank() over (partition by m order by id) from t ``` Before the PR, it fails with an `INTERNAL_ERROR`: ``` org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window partition keys cannot be map type. SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103) at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94) ... ``` After the PR, it fails with a `EXPRESSION_TYPE_IS_NOT_ORDERABLE`, which is expected: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [EXPRESSION_TYPE_IS_NOT_ORDERABLE] Column expression "m" cannot be sorted because its type "MAP<BIGINT, BIGINT>" is not orderable. SQLSTATE: 42822; line 2 pos 53; Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)apache#4] +- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)apache#4, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)apache#4] +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)apache#4], [m#0], [id#1L ASC NULLS FIRST] +- Project [id#1L, m#0] +- SubqueryAlias t +- SubqueryAlias t +- Project [id#1L, map(id#1L, id#1L) AS m#0] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52) ... ``` ### How was this patch tested? Unit test. Closes apache#45730 from chenhao-db/SPARK-47572. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

initial

e47643a

github-actions bot added the SQL label Mar 26, 2024

chenhao-db mentioned this pull request Mar 26, 2024

[SPARK-47569][SQL] Disallow comparing variant. #45726

Closed

Merge branch 'master' into SPARK-47572

86fd4ca

cloud-fan reviewed Mar 29, 2024

View reviewed changes

cloud-fan approved these changes Mar 29, 2024

View reviewed changes

change error class

2231e50

cloud-fan closed this in 609bd48 Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47572][SQL] Enforce Window partitionSpec is orderable. #45730

[SPARK-47572][SQL] Enforce Window partitionSpec is orderable. #45730

chenhao-db commented Mar 26, 2024 •

edited

cloud-fan commented Mar 29, 2024

cloud-fan commented Mar 29, 2024

chenhao-db commented Mar 29, 2024 •

edited

cloud-fan Mar 29, 2024

chenhao-db Mar 29, 2024

cloud-fan commented Mar 29, 2024

cloud-fan commented Mar 29, 2024

[SPARK-47572][SQL] Enforce Window partitionSpec is orderable. #45730

[SPARK-47572][SQL] Enforce Window partitionSpec is orderable. #45730

Conversation

chenhao-db commented Mar 26, 2024 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Mar 29, 2024

cloud-fan commented Mar 29, 2024

chenhao-db commented Mar 29, 2024 • edited

cloud-fan Mar 29, 2024

Choose a reason for hiding this comment

chenhao-db Mar 29, 2024

Choose a reason for hiding this comment

cloud-fan commented Mar 29, 2024

cloud-fan commented Mar 29, 2024

chenhao-db commented Mar 26, 2024 •

edited

chenhao-db commented Mar 29, 2024 •

edited