-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union #23726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #102014 has finished for PR 23726 at commit
|
| override def output: Seq[Attribute] = | ||
| children.map(_.output).transpose.map(attrs => | ||
| attrs.head.withNullability(attrs.exists(_.nullable))) | ||
| override def output: Seq[Attribute] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need this logic in UnionExec too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say so, but I have not yet been able to find a use case and therefore a UT for that too. If you check the failing case of the JIRA, for instance, the current change works. Because after the analysis all the plans children of union have their attributes casted appropriately. So I am not sure that is really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel a bit weird about different output between Union/UnionExec... (But, I don't have any better solution...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it would be better to keep the output consistent between logical and phyiscal plan, although there is no direct benefit.
| children.map(_.output).transpose.map { attrs => | ||
| val firstAttr = attrs.head | ||
| val outAttr = if (attrs.exists(_.isInstanceOf[UnresolvedAttribute])) { | ||
| firstAttr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this case? I recall we don't call output when the plan is unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure indeed. Let me remove it then. Thanks.
|
Test build #102022 has finished for PR 23726 at commit
|
| firstAttr.withDataType(attrs.map(_.dataType).reduce(StructType.merge)) | ||
| } catch { | ||
| // If the data types are not compatible (eg. Decimals with different precision/scale) | ||
| // return the first type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All type compatibly checks should be in resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, right. Let me update the comment. Thanks.
|
Test build #102035 has finished for PR 23726 at commit
|
| children.map(_.output).transpose.map { attrs => | ||
| val firstAttr = attrs.head | ||
| val outAttr = try { | ||
| firstAttr.withDataType(attrs.map(_.dataType).reduce(StructType.merge)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding withDataType to Attribute seems overkill to me. It is also weird to change an attribute's data type. Should we just create an attribute here manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seemed cleaner and more consistent with similar changes, keeping all attribute copying logic in a single place, so I prefer like this. Anyway if others prefer to inline it here, I can change it.
...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
Outdated
Show resolved
Hide resolved
| val testRelation1 = LocalRelation('a.map(MapType(StringType, StringType, true))) | ||
| val testRelation2 = LocalRelation('a.map(MapType(StringType, StringType, false))) | ||
| val query = Union(testRelation2, testRelation1) | ||
| assert(query.output.head.dataType == MapType(StringType, StringType, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more data types additionally, e.g., Struct and Array?
|
Hi, @cloud-fan and @gatorsmile . |
|
Test build #102146 has finished for PR 23726 at commit
|
|
shall we fix |
|
I think it's a bug fix, as we may hit problems if a struct field is nullable but we report it as not. But I'm not sure how serious the bug is, sounds like it's hard to hit the bug. |
|
@cloud-fan I don't think we should fix |
|
For |
|
@cloud-fan what we might want to do as an improvement is the exact opposite of what we are doing here IMHO, ie. marking as not nullable datatypes which are nullable in the first plan and not nullable in the second. But this would mean create a new method for doing this and I am not sure that the little gain we have for the optimization is worth the burden of maintaining it (it would be only for complex datatypes with different nullabilities...). |
|
Test build #102202 has finished for PR 23726 at commit
|
|
since we are touching |
|
@cloud-fan that is an interesting point indeed. I wondered about that too, but I wanted to keep the scope of this fix as limited as possible. What about trying and do that in a followup? |
|
any thoughts on the above point @cloud-fan @gatorsmile ? Thanks. |
|
any more comments on this? Thanks. |
|
So this patch does fix the problem, but my concern is, picking the first child's output as output is pretty tricky. Before adding more tricks to it, I'm thinking about if there is a way to fix the problem entirely, by always using a new seq of attr IDs as the output. |
|
@cloud-fan I am not I 100% get your comment. I don't see the relationship between creating new attribute IDs and reporting the correct nullabilty for nested fields. I think just creating new IDs/attributes doesn't solve/is not related to this issue, but maybe I am missing something... |
|
so the data type resolving logic is still needed, but if we use new attributes, the |
|
ok, let me do that, thanks @cloud-fan |
|
Test build #104087 has finished for PR 23726 at commit
|
|
Test build #104109 has finished for PR 23726 at commit
|
|
Retest this please. |
|
Test build #104152 has finished for PR 23726 at commit
|
| children.map(_.output).transpose.map { attrs => | ||
| val firstAttr = attrs.head | ||
| val nullable = attrs.exists(_.nullable) | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a minor bug fix and I think no backport is needed. This try-cache looks overkill now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on https://issues.apache.org/jira/browse/SPARK-27685 it sounds like this can have a correctness impact for queries and it looks like a pretty straightforward fix. Given this, I think we should consider a 2.4.x backport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable. @viirya can you send a new PR for branch 2.4? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I can, but if you really mention me not @mgaido91?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry, my mistake :P
@mgaido91 can you send a backport PR please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for backporting this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure,I am doing it, thanks
|
Test build #104159 has finished for PR 23726 at commit
|
|
thanks, merging to master! |
…n Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes apache#23726 from mgaido91/SPARK-26812. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
When there is a
Union, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA).The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans.
How was this patch tested?
added UT