Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Feb 2, 2019

What changes were proposed in this pull request?

When there is a Union, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA).

The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans.

How was this patch tested?

added UT

@SparkQA
Copy link

SparkQA commented Feb 2, 2019

Test build #102014 has finished for PR 23726 at commit dd8f1d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def output: Seq[Attribute] =
children.map(_.output).transpose.map(attrs =>
attrs.head.withNullability(attrs.exists(_.nullable)))
override def output: Seq[Attribute] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need this logic in UnionExec too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say so, but I have not yet been able to find a use case and therefore a UT for that too. If you check the failing case of the JIRA, for instance, the current change works. Because after the analysis all the plans children of union have their attributes casted appropriately. So I am not sure that is really needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel a bit weird about different output between Union/UnionExec... (But, I don't have any better solution...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it would be better to keep the output consistent between logical and phyiscal plan, although there is no direct benefit.

children.map(_.output).transpose.map { attrs =>
val firstAttr = attrs.head
val outAttr = if (attrs.exists(_.isInstanceOf[UnresolvedAttribute])) {
firstAttr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this case? I recall we don't call output when the plan is unresolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure indeed. Let me remove it then. Thanks.

@SparkQA
Copy link

SparkQA commented Feb 3, 2019

Test build #102022 has finished for PR 23726 at commit 0f903ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

firstAttr.withDataType(attrs.map(_.dataType).reduce(StructType.merge))
} catch {
// If the data types are not compatible (eg. Decimals with different precision/scale)
// return the first type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All type compatibly checks should be in resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, right. Let me update the comment. Thanks.

@SparkQA
Copy link

SparkQA commented Feb 4, 2019

Test build #102035 has finished for PR 23726 at commit 6e686ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

children.map(_.output).transpose.map { attrs =>
val firstAttr = attrs.head
val outAttr = try {
firstAttr.withDataType(attrs.map(_.dataType).reduce(StructType.merge))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding withDataType to Attribute seems overkill to me. It is also weird to change an attribute's data type. Should we just create an attribute here manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seemed cleaner and more consistent with similar changes, keeping all attribute copying logic in a single place, so I prefer like this. Anyway if others prefer to inline it here, I can change it.

val testRelation1 = LocalRelation('a.map(MapType(StringType, StringType, true)))
val testRelation2 = LocalRelation('a.map(MapType(StringType, StringType, false)))
val query = Union(testRelation2, testRelation1)
assert(query.output.head.dataType == MapType(StringType, StringType, true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more data types additionally, e.g., Struct and Array?

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan and @gatorsmile .
It depends on the perspective, can we consider this as a new feature (or improvement)?

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102146 has finished for PR 23726 at commit bc57f88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

shall we fix Intersect as well?

@cloud-fan
Copy link
Contributor

I think it's a bug fix, as we may hit problems if a struct field is nullable but we report it as not. But I'm not sure how serious the bug is, sounds like it's hard to hit the bug.

@mgaido91
Copy link
Contributor Author

@cloud-fan I don't think we should fix Intersect since in that case the data must be present in both sides so there cannot be items with null if in the first plan (actually either of them) there is none.
I agree with this being a bug fix for a case which is probably not common.

@cloud-fan
Copy link
Contributor

For Intersect, it's an improvement not bug fix?

@mgaido91
Copy link
Contributor Author

@cloud-fan what we might want to do as an improvement is the exact opposite of what we are doing here IMHO, ie. marking as not nullable datatypes which are nullable in the first plan and not nullable in the second. But this would mean create a new method for doing this and I am not sure that the little gain we have for the optimization is worth the burden of maintaining it (it would be only for complex datatypes with different nullabilities...).

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102202 has finished for PR 23726 at commit 4921719.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

since we are touching Union, I'm wondering if we should just create new output attribubutes for Union, instead of reusing the first child's output. cc @gatorsmile

@mgaido91
Copy link
Contributor Author

@cloud-fan that is an interesting point indeed. I wondered about that too, but I wanted to keep the scope of this fix as limited as possible. What about trying and do that in a followup?

@mgaido91
Copy link
Contributor Author

any thoughts on the above point @cloud-fan @gatorsmile ? Thanks.

@mgaido91
Copy link
Contributor Author

any more comments on this? Thanks.

@cloud-fan
Copy link
Contributor

So this patch does fix the problem, but my concern is, picking the first child's output as output is pretty tricky. Before adding more tricks to it, I'm thinking about if there is a way to fix the problem entirely, by always using a new seq of attr IDs as the output.

@mgaido91
Copy link
Contributor Author

@cloud-fan I am not I 100% get your comment. I don't see the relationship between creating new attribute IDs and reporting the correct nullabilty for nested fields. I think just creating new IDs/attributes doesn't solve/is not related to this issue, but maybe I am missing something...

@cloud-fan
Copy link
Contributor

so the data type resolving logic is still needed, but if we use new attributes, the withDataType API is not needed. How about we remove withDataType from this PR and move forward? We can explore the new attributes idea later.

@mgaido91
Copy link
Contributor Author

ok, let me do that, thanks @cloud-fan

@SparkQA
Copy link

SparkQA commented Mar 29, 2019

Test build #104087 has finished for PR 23726 at commit ea6cccc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2019

Test build #104109 has finished for PR 23726 at commit 3031cee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 1, 2019

Test build #104152 has finished for PR 23726 at commit 3031cee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

children.map(_.output).transpose.map { attrs =>
val firstAttr = attrs.head
val nullable = attrs.exists(_.nullable)
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a minor bug fix and I think no backport is needed. This try-cache looks overkill now.

Copy link
Contributor

@JoshRosen JoshRosen May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://issues.apache.org/jira/browse/SPARK-27685 it sounds like this can have a correctness impact for queries and it looks like a pretty straightforward fix. Given this, I think we should consider a 2.4.x backport.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable. @viirya can you send a new PR for branch 2.4? thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I can, but if you really mention me not @mgaido91?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, my mistake :P

@mgaido91 can you send a backport PR please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for backporting this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure,I am doing it, thanks

@SparkQA
Copy link

SparkQA commented Apr 1, 2019

Test build #104159 has finished for PR 23726 at commit 874ad6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8012f55 Apr 1, 2019
mgaido91 added a commit to mgaido91/spark that referenced this pull request May 14, 2019
…n Union

## What changes were proposed in this pull request?

When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA).

The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans.

## How was this patch tested?

added UT

Closes apache#23726 from mgaido91/SPARK-26812.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants