New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068
Conversation
Can one of the admins verify this patch? |
ping @marmbrus @liancheng |
ok to test |
Test build #25681 has started for PR 4068 at commit
|
Test build #25681 has finished for PR 4068 at commit
|
Test FAILed. |
The problem is that: Currently the So we need a way to indicate whether a For SPARK-3698, we can do this by searching required field with case sensitive, if success, we are done. if not, we still have chance if the resolver is case insensitive, so we can do the check in For SPARK-5278 here, it's more complicated. it seems to me that the only way is adding a flag to What do you think? @marmbrus @liancheng |
Test build #25747 has started for PR 4068 at commit
|
Test build #25747 has finished for PR 4068 at commit
|
Test FAILed. |
Test build #25750 has started for PR 4068 at commit
|
Test build #25750 has finished for PR 4068 at commit
|
Test FAILed. |
Test build #25754 has started for PR 4068 at commit
|
I had a new idea: Don't try to figure out whether a |
Test build #25754 has finished for PR 4068 at commit
|
Test PASSed. |
ping @marmbrus @liancheng @rxin |
@yhuai can you review this one first? |
val actualField = fields.filter(f => resolver(f.name, fieldName)) | ||
if (actualField.length == 0) { | ||
sys.error( | ||
s"No such struct field $fieldName in ${fields.map(_.name).mkString(", ")}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CheckResolution
should catch it. If we cannot resolve it, just leave it unchanged. Can you see if there is a unit test for this? If not, can you add one? Maybe we can also log it like what LogicalPlan.resolve
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we leave it unchanged, CheckResolution
can't catch it. The reason is that, we need Resolver
to check if a GetField
is resolved, but we can't get Resolver
inside GetField
.
Fortunately, we can catch it at runtime, as GetField
will report error if it can't find the required field.
Which way should we prefer? Leaving it unchanged or reporting error right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I see. With your change, the resolved
field will always be true as long as its child's resolved
is true (the existence of the desired field is not considered). Actually, we are breaking the semantic of resolved
at here. I do think checking ambiguity is necessary, but I think it is also necessary to follow the definition of resolved
.
I am sorry if I missed it. Can you explain why you prefer not to put this check in GetField
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, we need Resolver
to check the existence of desired field. However, we can't get it inside GetField
, that's why we also did this check in LogicalPlan.resolveNesting
.
And we build GetField
in SqlParser
, so it's hard to put Resolver
into GetField
during building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a well defined resolved
field. Actually, I feel UnresolvedGetField
is a better idea. I think it is reasonable to have a UnresolvedXXX
whenever we need to do lookup or name matching to resolve it. @marmbrus What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @marmbrus
Test build #26290 has started for PR 4068 at commit
|
Test build #26290 has finished for PR 4068 at commit
|
Test PASSed. |
Hi @yhuai , I have updated this PR introducing |
Test build #26638 has started for PR 4068 at commit
|
@@ -17,6 +17,8 @@ | |||
|
|||
package org.apache.spark.sql | |||
|
|||
import org.apache.spark.sql.catalyst.analysis.UnresolvedGetField |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Import order
Okay, you have convinced me that this is cleaner. Two nits about formatting and then this LGTM. Can you also fix the description of the PR, as that becomes the commit message. Would be awesome to include this in 1.2 if you can update soon. Thanks for working on this! |
Test build #26815 has started for PR 4068 at commit
|
Test build #26815 has finished for PR 4068 at commit
|
Test PASSed. |
Hi @marmbrus , I have updated it, is it ready to go? Thanks. |
Test build #26833 has started for PR 4068 at commit
|
Test build #26833 has finished for PR 4068 at commit
|
Test FAILed. |
Test build #26837 has started for PR 4068 at commit
|
Test build #26837 has finished for PR 4068 at commit
|
Test PASSed. |
… of ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField (cherry picked from commit 4793c84) Signed-off-by: Michael Armbrust <michael@databricks.com>
Thanks! merged to master and 1.3 |
This PR introduced two subtle regressions in the way we handle nested fields in ORDER BY queries: |
Hi @marmbrus , I'm looking into it, will send a fix ASAP. |
Here is a partial solution: #4892 |
When the
GetField
chain(a.b.c.d.....
) is interrupted byGetItem
likea.b[0].c.d....
, then the check of ambiguous reference to fields is broken.The reason is that: for something like
a.b[0].c.d
, we first parse it toGetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")
. Then inLogicalPlan#resolve
, we resolve"a.b"
and build aGetField
chain from bottom(the relation). But for the 2 outerGetFiled
, we have to resolve them inAnalyzer
or do it inGetField
lazily, check data type of child, search needed field, etc. which is similar to what we have done inLogicalPlan#resolve
.