[SPARK-19893][SQL] should not run DataFrame set oprations with map type #17236

cloud-fan · 2017-03-10T06:44:52Z

What changes were proposed in this pull request?

In spark SQL, map type can't be used in equality test/comparison, and Intersect/Except/Distinct do need equality test for all columns, we should not allow map type in Intersect/Except/Distinct.

How was this patch tested?

new regression test

cloud-fan · 2017-03-10T06:45:27Z

cc @yhuai @sameeragarwal @hvanhovell

SparkQA · 2017-03-10T06:47:28Z

Test build #74300 has started for PR 17236 at commit d670a11.

cloud-fan · 2017-03-10T08:21:57Z

retest this please

hvanhovell · 2017-03-10T08:39:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -319,7 +322,7 @@ trait CheckAnalysis extends PredicateHelper {
              // Check if the data types match.
              dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, dt2), ci) =>
                // SPARK-18058: we shall not care about the nullability of columns
-                if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty) {
+                if (!dt1.sameType(dt2)) {


We shouldn't change this. This makes sure we generate the correct error message for union. The problem is that an earlier pair of columns might not be the same but castable, while a later pair column is not the same and we can't cast them; the error should show the latter pair of columns and not the first one (which happens if we revert this).

See #16882 for more information.

hvanhovell · 2017-03-10T08:52:35Z

@cloud-fan can we also check Distinct? We rewrite this into an Aggregate in the optimizer, which has the same problem with MapType.

SparkQA · 2017-03-10T10:03:54Z

Test build #74309 has finished for PR 17236 at commit d670a11.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-03-10T10:28:08Z

LGTM pending jenkins

SparkQA · 2017-03-10T11:43:04Z

Test build #74316 has finished for PR 17236 at commit 268a4a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-10T11:49:56Z

Test build #74317 has finished for PR 17236 at commit eb49284.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-03-10T11:58:16Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    val e2 = intercept[AnalysisException](df.except(df))
+    assert(e2.message.contains(
+      "Cannot have map type columns in DataFrame which calls intersect/except/distinct"))
+    val e3 = intercept[AnalysisException](df.distinct())


For some reason I don't understand, the code path for df.distinct() actually creates an aggregate, whereas the SQL code path uses the Distinct operator. So we need to actually issue a sql statement here.

SparkQA · 2017-03-10T20:12:05Z

Test build #74329 has finished for PR 17236 at commit 92a7efb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-10T23:49:11Z

Test build #74343 has finished for PR 17236 at commit 54f4ae1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-11T00:14:55Z

thanks for the review, merging to master!

In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17236 from cloud-fan/map. (cherry picked from commit fb9beda) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-03-11T00:34:12Z

backported to 2.1/2.0

srowen · 2017-03-15T11:38:25Z

@cloud-fan I think branch 2.0 fails after this commit unfortunately, such as in
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.0-test-maven-hadoop-2.4/990/

[info] Compiling 174 Scala sources and 19 Java sources to /home/jenkins/workspace/spark-branch-2.0-test-maven-hadoop-2.4/sql/core/target/scala-2.11/test-classes...
[error] /home/jenkins/workspace/spark-branch-2.0-test-maven-hadoop-2.4/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:1704: Missing closing brace `}' assumed here
[error] }
[error] ^

cloud-fan · 2017-03-15T13:02:05Z

I'm pretty sorry about this, I have fixed it in branch-2.0, @srowen thanks for pointing this out!

magic890 · 2019-10-29T15:24:03Z

@cloud-fan @yhuai @sameeragarwal @hvanhovell Any plan to re-enable Intersect / Except / Distinct operations for Map type?
The current behaviour prevent even the left_anti join.

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes #27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes apache#27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c913b9d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes apache#27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

magic890 · 2020-08-31T12:54:20Z

@iRakson @cloud-fan @tarekbecker @w4-sjcho @yhuai @sameeragarwal @hvanhovell
Any plan to re-enable Intersect / Except / Distinct operations for Map type?
The current behaviour prevent even the left_anti join, not only the hash operations.

cloud-fan · 2020-08-31T13:12:32Z

You can turn map into array by map_entries function to work around it. We don't plan to re-enable it unless we make map type orderable.

cloud-fan changed the title ~~[SPARK-xxxx][SQL] Cannot run intersect/except with map type~~ [SPARK-19893][SQL] Cannot run intersect/except with map type Mar 10, 2017

hvanhovell reviewed Mar 10, 2017

View reviewed changes

cloud-fan force-pushed the map branch from d670a11 to 268a4a0 Compare March 10, 2017 10:18

cloud-fan changed the title ~~[SPARK-19893][SQL] Cannot run intersect/except with map type~~ [SPARK-19893][SQL] Cannot run intersect/except/distinct with map type Mar 10, 2017

cloud-fan force-pushed the map branch from 268a4a0 to eb49284 Compare March 10, 2017 10:23

hvanhovell reviewed Mar 10, 2017

View reviewed changes

Cannot run intersect/except with map type

92a7efb

cloud-fan force-pushed the map branch from eb49284 to 92a7efb Compare March 10, 2017 18:49

cloud-fan changed the title ~~[SPARK-19893][SQL] Cannot run intersect/except/distinct with map type~~ [SPARK-19893][SQL] should not run DataFrame set oprations with map type Mar 10, 2017

fix test

54f4ae1

asfgit closed this in fb9beda Mar 11, 2017

iRakson mentioned this pull request Feb 14, 2020

[SPARK-27619][SQL]MapType should be prohibited in hash expressions #27580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19893][SQL] should not run DataFrame set oprations with map type #17236

[SPARK-19893][SQL] should not run DataFrame set oprations with map type #17236

cloud-fan commented Mar 10, 2017 •

edited

Loading

cloud-fan commented Mar 10, 2017

SparkQA commented Mar 10, 2017

cloud-fan commented Mar 10, 2017

hvanhovell Mar 10, 2017

hvanhovell Mar 10, 2017

hvanhovell commented Mar 10, 2017

SparkQA commented Mar 10, 2017

hvanhovell commented Mar 10, 2017

SparkQA commented Mar 10, 2017

SparkQA commented Mar 10, 2017

hvanhovell Mar 10, 2017

SparkQA commented Mar 10, 2017

SparkQA commented Mar 10, 2017

cloud-fan commented Mar 11, 2017

cloud-fan commented Mar 11, 2017

srowen commented Mar 15, 2017

cloud-fan commented Mar 15, 2017

magic890 commented Oct 29, 2019

magic890 commented Aug 31, 2020

cloud-fan commented Aug 31, 2020

[SPARK-19893][SQL] should not run DataFrame set oprations with map type #17236

[SPARK-19893][SQL] should not run DataFrame set oprations with map type #17236

Conversation

cloud-fan commented Mar 10, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Mar 10, 2017

SparkQA commented Mar 10, 2017

cloud-fan commented Mar 10, 2017

hvanhovell Mar 10, 2017

Choose a reason for hiding this comment

hvanhovell Mar 10, 2017

Choose a reason for hiding this comment

hvanhovell commented Mar 10, 2017

SparkQA commented Mar 10, 2017

hvanhovell commented Mar 10, 2017

SparkQA commented Mar 10, 2017

SparkQA commented Mar 10, 2017

hvanhovell Mar 10, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 10, 2017

SparkQA commented Mar 10, 2017

cloud-fan commented Mar 11, 2017

cloud-fan commented Mar 11, 2017

srowen commented Mar 15, 2017

cloud-fan commented Mar 15, 2017

magic890 commented Oct 29, 2019

magic890 commented Aug 31, 2020

cloud-fan commented Aug 31, 2020

cloud-fan commented Mar 10, 2017 •

edited

Loading