Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

When the GetField chain(a.b.c.d.....) is interrupted by GetItem like a.b[0].c.d...., then the check of ambiguous reference to fields is broken.
The reason is that: for something like a.b[0].c.d, we first parse it to GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). Then in LogicalPlan#resolve, we resolve "a.b" and build a GetField chain from bottom(the relation). But for the 2 outer GetFiled, we have to resolve them in Analyzer or do it in GetField lazily, check data type of child, search needed field, etc. which is similar to what we have done in LogicalPlan#resolve.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cloud-fan
Copy link
Contributor Author

ping @marmbrus @liancheng

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25681 has started for PR 4068 at commit f2ab566.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25681 has finished for PR 4068 at commit f2ab566.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25681/
Test FAILed.

@cloud-fan
Copy link
Contributor Author

The problem is that: Currently the GetField class is an operation which picks the first field whose name equal to the required fieldName with case sensitive. As I said before, we will parse a.b[0].c.d to GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). For the a.b, we can check anything we want before build GetField, but for the 2 outer GetFiled, we can only do the check in Analyzer(or we can expose resolver to GetField, but it's not recommended).

So we need a way to indicate whether a GetField need analyse or not.

For SPARK-3698, we can do this by searching required field with case sensitive, if success, we are done. if not, we still have chance if the resolver is case insensitive, so we can do the check in Analyzer as @marmbrus did in #3724.

For SPARK-5278 here, it's more complicated. it seems to me that the only way is adding a flag to GetField, or introduce UnresolvedGetField.

What do you think? @marmbrus @liancheng

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25747 has started for PR 4068 at commit bfe069b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25747 has finished for PR 4068 at commit bfe069b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25747/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25750 has started for PR 4068 at commit d8c1dc9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25750 has finished for PR 4068 at commit d8c1dc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25750/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25754 has started for PR 4068 at commit 53048b6.

  • This patch merges cleanly.

@cloud-fan
Copy link
Contributor Author

I had a new idea: Don't try to figure out whether a GetField need analyse or not, just analyse it anyway, until we reach the fixed point.
Thus we can do the check in Analyzer happily :)

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25754 has finished for PR 4068 at commit 53048b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25754/
Test PASSed.

@cloud-fan
Copy link
Contributor Author

Hi @marmbrus , how about this fix? If you feel OK, I'll update #2405 in the same way.

@cloud-fan
Copy link
Contributor Author

ping @marmbrus @liancheng @rxin

@rxin
Copy link
Contributor

rxin commented Jan 24, 2015

@yhuai can you review this one first?

val actualField = fields.filter(f => resolver(f.name, fieldName))
if (actualField.length == 0) {
sys.error(
s"No such struct field $fieldName in ${fields.map(_.name).mkString(", ")}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CheckResolution should catch it. If we cannot resolve it, just leave it unchanged. Can you see if there is a unit test for this? If not, can you add one? Maybe we can also log it like what LogicalPlan.resolve does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave it unchanged, CheckResolution can't catch it. The reason is that, we need Resolver to check if a GetField is resolved, but we can't get Resolver inside GetField.
Fortunately, we can catch it at runtime, as GetField will report error if it can't find the required field.
Which way should we prefer? Leaving it unchanged or reporting error right away?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see. With your change, the resolved field will always be true as long as its child's resolved is true (the existence of the desired field is not considered). Actually, we are breaking the semantic of resolved at here. I do think checking ambiguity is necessary, but I think it is also necessary to follow the definition of resolved.

I am sorry if I missed it. Can you explain why you prefer not to put this check in GetField?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, we need Resolver to check the existence of desired field. However, we can't get it inside GetField, that's why we also did this check in LogicalPlan.resolveNesting.
And we build GetField in SqlParser, so it's hard to put Resolver into GetField during building.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a well defined resolved field. Actually, I feel UnresolvedGetField is a better idea. I think it is reasonable to have a UnresolvedXXX whenever we need to do lookup or name matching to resolve it. @marmbrus What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @marmbrus

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26290 has started for PR 4068 at commit d9cf8fd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26290 has finished for PR 4068 at commit d9cf8fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26290/
Test PASSed.

@cloud-fan
Copy link
Contributor Author

Hi @yhuai , I have updated this PR introducing UnresolvedGetField to fix this issue. Do you have time to review it? Thanks!

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26638 has started for PR 4068 at commit 1b8f924.

  • This patch does not merge cleanly.

@@ -17,6 +17,8 @@

package org.apache.spark.sql

import org.apache.spark.sql.catalyst.analysis.UnresolvedGetField
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Import order

@marmbrus
Copy link
Contributor

marmbrus commented Feb 4, 2015

Okay, you have convinced me that this is cleaner. Two nits about formatting and then this LGTM. Can you also fix the description of the PR, as that becomes the commit message.

Would be awesome to include this in 1.2 if you can update soon.

Thanks for working on this!

@cloud-fan cloud-fan changed the title [SPARK-5278][SQL] complete the check of ambiguous reference to fields [SPARK-5278][SQL] Introduce UnresolvedGetFeld and complete the check of ambiguous reference to fields Feb 5, 2015
@cloud-fan cloud-fan changed the title [SPARK-5278][SQL] Introduce UnresolvedGetFeld and complete the check of ambiguous reference to fields [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields Feb 5, 2015
@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26815 has started for PR 4068 at commit 8d32e44.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26815 has finished for PR 4068 at commit 8d32e44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • case class GetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26815/
Test PASSed.

@cloud-fan
Copy link
Contributor Author

Hi @marmbrus , I have updated it, is it ready to go? Thanks.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26833 has started for PR 4068 at commit 085619c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26833 has finished for PR 4068 at commit 085619c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26833/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26837 has started for PR 4068 at commit a6857b5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26837 has finished for PR 4068 at commit a6857b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • * @param modelClass String name for model class (used for error messages)
    • case class Data(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]])
    • s" but class priors vector pi had $
    • s" but class conditionals array theta had $
    • case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
    • * @param modelClass String name for model class (used for error messages)
    • * @param modelClass String name for model class (used for error messages)
    • case class Data(weights: Vector, intercept: Double)
    • * @param modelClass String name for model class (used for error messages)
    • trait Saveable
    • trait Loader[M <: Saveable]
    • * @return (class name, version, metadata)
    • case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
    • case class GetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26837/
Test PASSed.

@asfgit asfgit closed this in 4793c84 Feb 6, 2015
asfgit pushed a commit that referenced this pull request Feb 6, 2015
… of ambiguous reference to fields

When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4068 from cloud-fan/simple and squashes the following commits:

a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField

(cherry picked from commit 4793c84)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@marmbrus
Copy link
Contributor

marmbrus commented Feb 6, 2015

Thanks! merged to master and 1.3

@marmbrus
Copy link
Contributor

marmbrus commented Mar 3, 2015

This PR introduced two subtle regressions in the way we handle nested fields in ORDER BY queries:
https://issues.apache.org/jira/browse/SPARK-6145

@cloud-fan
Copy link
Contributor Author

Hi @marmbrus , I'm looking into it, will send a fix ASAP.

@marmbrus
Copy link
Contributor

marmbrus commented Mar 4, 2015

Here is a partial solution: #4892

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants