-
Notifications
You must be signed in to change notification settings - Fork 28.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-27619][SQL] MapType should be prohibited in hash expressions
### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes #27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information
Showing
5 changed files
with
39 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters