[SPARK-33338][SQL] GROUP BY using literal map should not fail #30246

dongjoon-hyun · 2020-11-04T09:26:57Z

What changes were proposed in this pull request?

This PR aims to fix semanticEquals works correctly on GetMapValue expressions having literal maps with ArrayBasedMapData and GenericArrayData.

Why are the changes needed?

This is a regression from Apache Spark 1.6.x.

scala> sc.version
res1: String = 1.6.3

scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+

Apache Spark 2.x ~ 3.0.1 raiseRuntimeException for the following queries.

CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a

BEFORE

Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)

AFTER

spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs with the newly added test case.

…rrectly

dongjoon-hyun · 2020-11-04T09:29:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

@@ -316,6 +316,8 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression {
      (value, o.value) match {
        case (null, null) => true
        case (a: Array[Byte], b: Array[Byte]) => util.Arrays.equals(a, b)
+        case (a: ArrayBasedMapData, b: ArrayBasedMapData) =>
+          a.keyArray == b.keyArray && a.valueArray == b.valueArray


GenericArrayData has equals.

Quick question, why we don't have equals in ArrayBasedMapData?

I also considered that way first, but I didn't do that because of this.

/** * This is an internal data representation for map type in Spark SQL. This should not implement * `equals` and `hashCode` because the type cannot be used as join keys, grouping keys, or * in equality tests. See SPARK-9415 and PR#13847 for the discussions. */ abstract class MapData extends Serializable

That's the reason why I focused on literal map equality only.

SparkQA · 2020-11-04T10:11:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35197/

SparkQA · 2020-11-04T10:41:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35197/

dongjoon-hyun · 2020-11-04T10:55:07Z

Could you review this, @cloud-fan and @maropu and @viirya ?

SparkQA · 2020-11-04T11:33:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35198/

SparkQA · 2020-11-04T12:02:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35198/

SparkQA · 2020-11-04T14:01:33Z

Test build #130596 has finished for PR 30246 at commit 8c29c48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-04T14:31:40Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+      sql(s"CREATE TABLE t USING ORC LOCATION '${dir.toURI}' AS SELECT map('k1', 'v1') m, 'k1' k")
+      Seq(
+        "SELECT map('k1', 'v1')[k] FROM t GROUP BY 1",
+        "SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]",


how about SELECT map('k1', 'v1', 'k2', 'v2')[k] FROM t GROUP BY map('k2', 'v2', 'k1', 'v1')[k]?

Since we don't normalize the literal maps, they are not the same maps, @cloud-fan . We should not handle it here, @cloud-fan .

SparkQA · 2020-11-04T15:11:50Z

Test build #130597 has finished for PR 30246 at commit b7367d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-04T16:34:53Z

Thank you, @HyukjinKwon and @cloud-fan .
Merged to master/3.0/2.4.

### What changes were proposed in this pull request? This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`. ### Why are the changes needed? This is a regression from Apache Spark 1.6.x. ```scala scala> sc.version res1: String = 1.6.3 scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show +---+ |_c0| +---+ | v1| +---+ ``` Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries. ```sql CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k SELECT map('k1', 'v1')[k] FROM t GROUP BY 1 SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k] SELECT map('k1', 'v1')[k] a FROM t GROUP BY a ``` **BEFORE** ```scala Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ``` **AFTER** ```sql spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1; v1 Time taken: 1.278 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]; v1 Time taken: 0.313 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a; v1 Time taken: 0.265 seconds, Fetched 1 row(s) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #30246 from dongjoon-hyun/SPARK-33338. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 42c0b17) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

viirya

looks good. just a quick question #30246 (comment).

[SPARK-33338][SQL] semanticEquals should handle static GetMapValue co…

8c29c48

…rrectly

dongjoon-hyun commented Nov 4, 2020

View reviewed changes

Add more test cases

b7367d7

dongjoon-hyun changed the title ~~[SPARK-33338][SQL] semanticEquals should handle static GetMapValue correctly~~ [SPARK-33338][SQL] GROUP BY using literal map should not fail Nov 4, 2020

HyukjinKwon approved these changes Nov 4, 2020

View reviewed changes

cloud-fan reviewed Nov 4, 2020

View reviewed changes

dongjoon-hyun closed this in 42c0b17 Nov 4, 2020

dongjoon-hyun deleted the SPARK-33338 branch November 4, 2020 16:50

viirya reviewed Nov 4, 2020

View reviewed changes

c27kwan mentioned this pull request Sep 6, 2022

[SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData #37771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33338][SQL] GROUP BY using literal map should not fail #30246

[SPARK-33338][SQL] GROUP BY using literal map should not fail #30246

dongjoon-hyun commented Nov 4, 2020 •

edited

Loading

dongjoon-hyun Nov 4, 2020

viirya Nov 4, 2020

dongjoon-hyun Nov 4, 2020

dongjoon-hyun Nov 4, 2020

viirya Nov 4, 2020

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

dongjoon-hyun commented Nov 4, 2020 •

edited

Loading

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

cloud-fan Nov 4, 2020

dongjoon-hyun Nov 4, 2020

SparkQA commented Nov 4, 2020

dongjoon-hyun commented Nov 4, 2020

viirya left a comment •

edited

Loading

[SPARK-33338][SQL] GROUP BY using literal map should not fail #30246

[SPARK-33338][SQL] GROUP BY using literal map should not fail #30246

Conversation

dongjoon-hyun commented Nov 4, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun Nov 4, 2020

Choose a reason for hiding this comment

viirya Nov 4, 2020

Choose a reason for hiding this comment

dongjoon-hyun Nov 4, 2020

Choose a reason for hiding this comment

dongjoon-hyun Nov 4, 2020

Choose a reason for hiding this comment

viirya Nov 4, 2020

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

dongjoon-hyun commented Nov 4, 2020 • edited Loading

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

SparkQA commented Nov 4, 2020

cloud-fan Nov 4, 2020

Choose a reason for hiding this comment

dongjoon-hyun Nov 4, 2020

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2020

dongjoon-hyun commented Nov 4, 2020

viirya left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 4, 2020 •

edited

Loading

dongjoon-hyun commented Nov 4, 2020 •

edited

Loading

viirya left a comment •

edited

Loading