[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate #7819

viirya · 2015-07-31T07:01:26Z

JIRA: https://issues.apache.org/jira/browse/SPARK-9415

Following up #7787. We shouldn't use MapType as grouping keys and join keys too.

viirya · 2015-07-31T07:01:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

              case _ => // OK
            }

            aggregateExprs.foreach(checkValidAggregateExpression)
-            aggregateExprs.foreach(checkValidGroupingExprs)
+            groupingExprs.foreach(checkValidGroupingExprs)


A bug in #7787.

SparkQA · 2015-07-31T07:46:46Z

Test build #39180 has finished for PR 7819 at commit 7463398.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T12:35:16Z

Test build #39215 has finished for PR 7819 at commit 005ee0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-08-01T05:26:16Z

Thanks - I've merged this.

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes #27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes apache#27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c913b9d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes apache#27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MapType can't be used as join keys, grouping keys.

7463398

viirya reviewed Jul 31, 2015
View reviewed changes

For comments.

005ee0c

asfgit closed this in 3320b0b Aug 1, 2015

iRakson mentioned this pull request Feb 14, 2020

[SPARK-27619][SQL]MapType should be prohibited in hash expressions #27580

Closed

viirya deleted the map_join_groupby branch December 27, 2023 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate #7819

[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate #7819

Uh oh!

viirya commented Jul 31, 2015

Uh oh!

viirya Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

rxin commented Aug 1, 2015

Uh oh!

Uh oh!

[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate #7819

[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate #7819

Uh oh!

Conversation

viirya commented Jul 31, 2015

Uh oh!

viirya Jul 31, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

rxin commented Aug 1, 2015

Uh oh!

Uh oh!