Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19893][SQL] should not run DataFrame set oprations with map type #17236

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Mar 10, 2017

What changes were proposed in this pull request?

In spark SQL, map type can't be used in equality test/comparison, and Intersect/Except/Distinct do need equality test for all columns, we should not allow map type in Intersect/Except/Distinct.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @yhuai @sameeragarwal @hvanhovell

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74300 has started for PR 17236 at commit d670a11.

@cloud-fan cloud-fan changed the title [SPARK-xxxx][SQL] Cannot run intersect/except with map type [SPARK-19893][SQL] Cannot run intersect/except with map type Mar 10, 2017
@cloud-fan
Copy link
Contributor Author

retest this please

@@ -319,7 +322,7 @@ trait CheckAnalysis extends PredicateHelper {
// Check if the data types match.
dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, dt2), ci) =>
// SPARK-18058: we shall not care about the nullability of columns
if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty) {
if (!dt1.sameType(dt2)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change this. This makes sure we generate the correct error message for union. The problem is that an earlier pair of columns might not be the same but castable, while a later pair column is not the same and we can't cast them; the error should show the latter pair of columns and not the first one (which happens if we revert this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #16882 for more information.

@hvanhovell
Copy link
Contributor

@cloud-fan can we also check Distinct? We rewrite this into an Aggregate in the optimizer, which has the same problem with MapType.

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74309 has finished for PR 17236 at commit d670a11.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-19893][SQL] Cannot run intersect/except with map type [SPARK-19893][SQL] Cannot run intersect/except/distinct with map type Mar 10, 2017
@hvanhovell
Copy link
Contributor

LGTM pending jenkins

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74316 has finished for PR 17236 at commit 268a4a0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74317 has finished for PR 17236 at commit eb49284.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val e2 = intercept[AnalysisException](df.except(df))
assert(e2.message.contains(
"Cannot have map type columns in DataFrame which calls intersect/except/distinct"))
val e3 = intercept[AnalysisException](df.distinct())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I don't understand, the code path for df.distinct() actually creates an aggregate, whereas the SQL code path uses the Distinct operator. So we need to actually issue a sql statement here.

@cloud-fan cloud-fan changed the title [SPARK-19893][SQL] Cannot run intersect/except/distinct with map type [SPARK-19893][SQL] should not run DataFrame set oprations with map type Mar 10, 2017
@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74329 has finished for PR 17236 at commit 92a7efb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74343 has finished for PR 17236 at commit 54f4ae1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in fb9beda Mar 11, 2017
asfgit pushed a commit that referenced this pull request Mar 11, 2017
In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17236 from cloud-fan/map.

(cherry picked from commit fb9beda)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Mar 11, 2017
In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17236 from cloud-fan/map.

(cherry picked from commit fb9beda)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor Author

backported to 2.1/2.0

@srowen
Copy link
Member

srowen commented Mar 15, 2017

@cloud-fan I think branch 2.0 fails after this commit unfortunately, such as in
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.0-test-maven-hadoop-2.4/990/

[info] Compiling 174 Scala sources and 19 Java sources to /home/jenkins/workspace/spark-branch-2.0-test-maven-hadoop-2.4/sql/core/target/scala-2.11/test-classes...
[error] /home/jenkins/workspace/spark-branch-2.0-test-maven-hadoop-2.4/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:1704: Missing closing brace `}' assumed here
[error] }
[error] ^

@cloud-fan
Copy link
Contributor Author

I'm pretty sorry about this, I have fixed it in branch-2.0, @srowen thanks for pointing this out!

@magic890
Copy link

@cloud-fan @yhuai @sameeragarwal @hvanhovell Any plan to re-enable Intersect / Except / Distinct operations for Map type?
The current behaviour prevent even the left_anti join.

cloud-fan pushed a commit that referenced this pull request Feb 26, 2020
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.

When `spark.sql.legacy.useHashOnMapType` is set to false:

```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```

when `spark.sql.legacy.useHashOnMapType` is set to true :

```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]

```

### Why are the changes needed?

As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```

Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236.

### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.

### How was this patch tested?
UT added.

Closes #27580 from iRakson/SPARK-27619.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
tallamjr pushed a commit to tallamjr/spark that referenced this pull request Feb 26, 2020
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.

When `spark.sql.legacy.useHashOnMapType` is set to false:

```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```

when `spark.sql.legacy.useHashOnMapType` is set to true :

```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]

```

### Why are the changes needed?

As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```

Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236.

### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.

### How was this patch tested?
UT added.

Closes apache#27580 from iRakson/SPARK-27619.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c913b9d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.

When `spark.sql.legacy.useHashOnMapType` is set to false:

```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```

when `spark.sql.legacy.useHashOnMapType` is set to true :

```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]

```

### Why are the changes needed?

As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```

Also `MapType` is prohibited for aggregation / joins / equality comparisons apache#7819 and set operations apache#17236.

### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.

### How was this patch tested?
UT added.

Closes apache#27580 from iRakson/SPARK-27619.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@magic890
Copy link

@iRakson @cloud-fan @tarekbecker @w4-sjcho @yhuai @sameeragarwal @hvanhovell
Any plan to re-enable Intersect / Except / Distinct operations for Map type?
The current behaviour prevent even the left_anti join, not only the hash operations.

@cloud-fan
Copy link
Contributor Author

You can turn map into array by map_entries function to work around it. We don't plan to re-enable it unless we make map type orderable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants