[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union #23726

mgaido91 · 2019-02-02T13:10:50Z

What changes were proposed in this pull request?

When there is a Union, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA).

The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans.

How was this patch tested?

added UT

…n Union

SparkQA · 2019-02-02T17:14:20Z

Test build #102014 has finished for PR 23726 at commit dd8f1d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2019-02-02T19:45:39Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

-  override def output: Seq[Attribute] =
-    children.map(_.output).transpose.map(attrs =>
-      attrs.head.withNullability(attrs.exists(_.nullable)))
+  override def output: Seq[Attribute] = {


Don't we need this logic in UnionExec too?

I'd say so, but I have not yet been able to find a use case and therefore a UT for that too. If you check the failing case of the JIRA, for instance, the current change works. Because after the analysis all the plans children of union have their attributes casted appropriately. So I am not sure that is really needed.

I still feel a bit weird about different output between Union/UnionExec... (But, I don't have any better solution...)

+1, it would be better to keep the output consistent between logical and phyiscal plan, although there is no direct benefit.

viirya · 2019-02-03T10:02:31Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    children.map(_.output).transpose.map { attrs =>
+      val firstAttr = attrs.head
+      val outAttr = if (attrs.exists(_.isInstanceOf[UnresolvedAttribute])) {
+        firstAttr


We need this case? I recall we don't call output when the plan is unresolved.

I was not sure indeed. Let me remove it then. Thanks.

SparkQA · 2019-02-03T14:17:25Z

Test build #102022 has finished for PR 23726 at commit 0f903ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-04T14:53:08Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+        firstAttr.withDataType(attrs.map(_.dataType).reduce(StructType.merge))
+      } catch {
+        // If the data types are not compatible (eg. Decimals with different precision/scale)
+        // return the first type


All type compatibly checks should be in resolved?

yes, right. Let me update the comment. Thanks.

SparkQA · 2019-02-04T19:18:46Z

Test build #102035 has finished for PR 23726 at commit 6e686ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-02-05T07:59:21Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    children.map(_.output).transpose.map { attrs =>
+      val firstAttr = attrs.head
+      val outAttr = try {
+        firstAttr.withDataType(attrs.map(_.dataType).reduce(StructType.merge))


Adding withDataType to Attribute seems overkill to me. It is also weird to change an attribute's data type. Should we just create an attribute here manually?

This approach seemed cleaner and more consistent with similar changes, keeping all attribute copying logic in a single place, so I prefer like this. Anyway if others prefer to inline it here, I can change it.

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

dongjoon-hyun · 2019-02-10T01:08:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+    val testRelation1 = LocalRelation('a.map(MapType(StringType, StringType, true)))
+    val testRelation2 = LocalRelation('a.map(MapType(StringType, StringType, false)))
+    val query = Union(testRelation2, testRelation1)
+    assert(query.output.head.dataType == MapType(StringType, StringType, true))


Could you add more data types additionally, e.g., Struct and Array?

dongjoon-hyun · 2019-02-10T01:11:08Z

Hi, @cloud-fan and @gatorsmile .
It depends on the perspective, can we consider this as a new feature (or improvement)?

SparkQA · 2019-02-10T15:23:44Z

Test build #102146 has finished for PR 23726 at commit bc57f88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-11T06:07:43Z

shall we fix Intersect as well?

cloud-fan · 2019-02-11T06:11:29Z

I think it's a bug fix, as we may hit problems if a struct field is nullable but we report it as not. But I'm not sure how serious the bug is, sounds like it's hard to hit the bug.

mgaido91 · 2019-02-11T08:07:31Z

@cloud-fan I don't think we should fix Intersect since in that case the data must be present in both sides so there cannot be items with null if in the first plan (actually either of them) there is none.
I agree with this being a bug fix for a case which is probably not common.

cloud-fan · 2019-02-11T08:53:00Z

For Intersect, it's an improvement not bug fix?

mgaido91 · 2019-02-11T09:03:24Z

@cloud-fan what we might want to do as an improvement is the exact opposite of what we are doing here IMHO, ie. marking as not nullable datatypes which are nullable in the first plan and not nullable in the second. But this would mean create a new method for doing this and I am not sure that the little gain we have for the optimization is worth the burden of maintaining it (it would be only for complex datatypes with different nullabilities...).

SparkQA · 2019-02-11T20:35:10Z

Test build #102202 has finished for PR 23726 at commit 4921719.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-12T03:29:52Z

since we are touching Union, I'm wondering if we should just create new output attribubutes for Union, instead of reusing the first child's output. cc @gatorsmile

mgaido91 · 2019-02-12T08:35:55Z

@cloud-fan that is an interesting point indeed. I wondered about that too, but I wanted to keep the scope of this fix as limited as possible. What about trying and do that in a followup?

mgaido91 · 2019-02-26T08:51:59Z

any thoughts on the above point @cloud-fan @gatorsmile ? Thanks.

mgaido91 · 2019-03-28T10:19:00Z

any more comments on this? Thanks.

cloud-fan · 2019-03-29T16:55:58Z

So this patch does fix the problem, but my concern is, picking the first child's output as output is pretty tricky. Before adding more tricks to it, I'm thinking about if there is a way to fix the problem entirely, by always using a new seq of attr IDs as the output.

mgaido91 · 2019-03-29T17:03:29Z

@cloud-fan I am not I 100% get your comment. I don't see the relationship between creating new attribute IDs and reporting the correct nullabilty for nested fields. I think just creating new IDs/attributes doesn't solve/is not related to this issue, but maybe I am missing something...

cloud-fan · 2019-03-29T17:12:39Z

so the data type resolving logic is still needed, but if we use new attributes, the withDataType API is not needed. How about we remove withDataType from this PR and move forward? We can explore the new attributes idea later.

mgaido91 · 2019-03-29T17:16:53Z

ok, let me do that, thanks @cloud-fan

SparkQA · 2019-03-29T19:27:33Z

Test build #104087 has finished for PR 23726 at commit ea6cccc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-30T13:07:09Z

Test build #104109 has finished for PR 23726 at commit 3031cee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-01T01:42:11Z

Retest this please.

SparkQA · 2019-04-01T05:46:41Z

Test build #104152 has finished for PR 23726 at commit 3031cee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-01T06:30:23Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    children.map(_.output).transpose.map { attrs =>
+      val firstAttr = attrs.head
+      val nullable = attrs.exists(_.nullable)
+      try {


this is a minor bug fix and I think no backport is needed. This try-cache looks overkill now.

Based on https://issues.apache.org/jira/browse/SPARK-27685 it sounds like this can have a correctness impact for queries and it looks like a pretty straightforward fix. Given this, I think we should consider a 2.4.x backport.

sounds reasonable. @viirya can you send a new PR for branch 2.4? thanks!

@cloud-fan I can, but if you really mention me not @mgaido91?

ah sorry, my mistake :P

@mgaido91 can you send a backport PR please?

+1 for backporting this!

sure,I am doing it, thanks

SparkQA · 2019-04-01T12:45:12Z

Test build #104159 has finished for PR 23726 at commit 874ad6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-01T14:19:18Z

thanks, merging to master!

…n Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes apache#23726 from mgaido91/SPARK-26812. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-26812][SQL] Report correct nullability for complex datatypes i…

dd8f1d2

…n Union

peter-toth reviewed Feb 2, 2019

View reviewed changes

viirya reviewed Feb 3, 2019

View reviewed changes

address comment

0f903ab

maropu reviewed Feb 4, 2019

View reviewed changes

fix comment

6e686ed

viirya reviewed Feb 5, 2019

View reviewed changes

dongjoon-hyun reviewed Feb 10, 2019

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Feb 10, 2019

View reviewed changes

address comments

bc57f88

address comment

4921719

mgaido91 added 2 commits March 29, 2019 18:22

address comment

ee4ca26

fix

ea6cccc

cloud-fan approved these changes Mar 29, 2019

View reviewed changes

mgaido91 added 2 commits March 30, 2019 09:36

Merge branch 'master' of github.com:apache/spark into SPARK-26812

0012449

fix

3031cee

cloud-fan reviewed Apr 1, 2019

View reviewed changes

mgaido91 added 2 commits April 1, 2019 10:27

address comment

09c50ba

remove useless changes

874ad6f

cloud-fan closed this in 8012f55 Apr 1, 2019

[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union #23726

[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union #23726

Conversation

mgaido91 commented Feb 2, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

cloud-fan commented Feb 11, 2019

Uh oh!

cloud-fan commented Feb 11, 2019

Uh oh!

mgaido91 commented Feb 11, 2019

Uh oh!

cloud-fan commented Feb 11, 2019

Uh oh!

mgaido91 commented Feb 11, 2019

Uh oh!

SparkQA commented Feb 11, 2019

Uh oh!

cloud-fan commented Feb 12, 2019

Uh oh!

mgaido91 commented Feb 12, 2019

Uh oh!

mgaido91 commented Feb 26, 2019

Uh oh!

mgaido91 commented Mar 28, 2019

Uh oh!

cloud-fan commented Mar 29, 2019

Uh oh!

mgaido91 commented Mar 29, 2019

Uh oh!

cloud-fan commented Mar 29, 2019

Uh oh!

mgaido91 commented Mar 29, 2019

Uh oh!

SparkQA commented Mar 29, 2019

Uh oh!

SparkQA commented Mar 30, 2019

Uh oh!

dongjoon-hyun commented Apr 1, 2019

Uh oh!

SparkQA commented Apr 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JoshRosen May 14, 2019 •

edited

Loading