[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068

cloud-fan · 2015-01-16T06:54:00Z

When the GetField chain(a.b.c.d.....) is interrupted by GetItem like a.b[0].c.d...., then the check of ambiguous reference to fields is broken.
The reason is that: for something like a.b[0].c.d, we first parse it to GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). Then in LogicalPlan#resolve, we resolve "a.b" and build a GetField chain from bottom(the relation). But for the 2 outer GetFiled, we have to resolve them in Analyzer or do it in GetField lazily, check data type of child, search needed field, etc. which is similar to what we have done in LogicalPlan#resolve.

AmplabJenkins · 2015-01-16T06:57:09Z

Can one of the admins verify this patch?

cloud-fan · 2015-01-16T07:12:21Z

ping @marmbrus @liancheng

marmbrus · 2015-01-16T21:27:31Z

ok to test

SparkQA · 2015-01-16T21:32:43Z

Test build #25681 has started for PR 4068 at commit f2ab566.

This patch merges cleanly.

SparkQA · 2015-01-16T22:21:10Z

Test build #25681 has finished for PR 4068 at commit f2ab566.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-16T22:21:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25681/
Test FAILed.

cloud-fan · 2015-01-19T05:16:26Z

The problem is that: Currently the GetField class is an operation which picks the first field whose name equal to the required fieldName with case sensitive. As I said before, we will parse a.b[0].c.d to GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). For the a.b, we can check anything we want before build GetField, but for the 2 outer GetFiled, we can only do the check in Analyzer(or we can expose resolver to GetField, but it's not recommended).

So we need a way to indicate whether a GetField need analyse or not.

For SPARK-3698, we can do this by searching required field with case sensitive, if success, we are done. if not, we still have chance if the resolver is case insensitive, so we can do the check in Analyzer as @marmbrus did in #3724.

For SPARK-5278 here, it's more complicated. it seems to me that the only way is adding a flag to GetField, or introduce UnresolvedGetField.

What do you think? @marmbrus @liancheng

SparkQA · 2015-01-19T07:07:41Z

Test build #25747 has started for PR 4068 at commit bfe069b.

This patch merges cleanly.

SparkQA · 2015-01-19T07:16:51Z

Test build #25747 has finished for PR 4068 at commit bfe069b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-19T07:16:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25747/
Test FAILed.

SparkQA · 2015-01-19T07:37:42Z

Test build #25750 has started for PR 4068 at commit d8c1dc9.

This patch merges cleanly.

SparkQA · 2015-01-19T07:49:19Z

Test build #25750 has finished for PR 4068 at commit d8c1dc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-19T07:49:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25750/
Test FAILed.

SparkQA · 2015-01-19T08:37:38Z

Test build #25754 has started for PR 4068 at commit 53048b6.

This patch merges cleanly.

cloud-fan · 2015-01-19T08:37:39Z

I had a new idea: Don't try to figure out whether a GetField need analyse or not, just analyse it anyway, until we reach the fixed point.
Thus we can do the check in Analyzer happily :)

SparkQA · 2015-01-19T09:45:34Z

Test build #25754 has finished for PR 4068 at commit 53048b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-19T09:45:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25754/
Test PASSed.

cloud-fan · 2015-01-19T09:46:50Z

Hi @marmbrus , how about this fix? If you feel OK, I'll update #2405 in the same way.

cloud-fan · 2015-01-23T04:01:52Z

ping @marmbrus @liancheng @rxin

rxin · 2015-01-24T08:00:28Z

@yhuai can you review this one first?

yhuai · 2015-01-28T04:42:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+                val actualField = fields.filter(f => resolver(f.name, fieldName))
+                if (actualField.length == 0) {
+                  sys.error(
+                    s"No such struct field $fieldName in ${fields.map(_.name).mkString(", ")}")


I think CheckResolution should catch it. If we cannot resolve it, just leave it unchanged. Can you see if there is a unit test for this? If not, can you add one? Maybe we can also log it like what LogicalPlan.resolve does.

If we leave it unchanged, CheckResolution can't catch it. The reason is that, we need Resolver to check if a GetField is resolved, but we can't get Resolver inside GetField.
Fortunately, we can catch it at runtime, as GetField will report error if it can't find the required field.
Which way should we prefer? Leaving it unchanged or reporting error right away?

ok I see. With your change, the resolved field will always be true as long as its child's resolved is true (the existence of the desired field is not considered). Actually, we are breaking the semantic of resolved at here. I do think checking ambiguity is necessary, but I think it is also necessary to follow the definition of resolved.

I am sorry if I missed it. Can you explain why you prefer not to put this check in GetField?

As I said before, we need Resolver to check the existence of desired field. However, we can't get it inside GetField, that's why we also did this check in LogicalPlan.resolveNesting.
And we build GetField in SqlParser, so it's hard to put Resolver into GetField during building.

I think we should have a well defined resolved field. Actually, I feel UnresolvedGetField is a better idea. I think it is reasonable to have a UnresolvedXXX whenever we need to do lookup or name matching to resolve it. @marmbrus What do you think?

ping @marmbrus

SparkQA · 2015-01-29T04:27:46Z

Test build #26290 has started for PR 4068 at commit d9cf8fd.

This patch merges cleanly.

SparkQA · 2015-01-29T05:35:26Z

Test build #26290 has finished for PR 4068 at commit d9cf8fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-29T05:35:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26290/
Test PASSed.

cloud-fan · 2015-02-03T08:37:28Z

Hi @yhuai , I have updated this PR introducing UnresolvedGetField to fix this issue. Do you have time to review it? Thanks!

SparkQA · 2015-02-03T08:37:50Z

Test build #26638 has started for PR 4068 at commit 1b8f924.

This patch does not merge cleanly.

marmbrus · 2015-02-04T23:30:18Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

@@ -17,6 +17,8 @@

 package org.apache.spark.sql

+import org.apache.spark.sql.catalyst.analysis.UnresolvedGetField


Nit: Import order

marmbrus · 2015-02-04T23:32:31Z

Okay, you have convinced me that this is cleaner. Two nits about formatting and then this LGTM. Can you also fix the description of the PR, as that becomes the commit message.

Would be awesome to include this in 1.2 if you can update soon.

Thanks for working on this!

SparkQA · 2015-02-05T03:07:52Z

Test build #26815 has started for PR 4068 at commit 8d32e44.

This patch merges cleanly.

SparkQA · 2015-02-05T04:24:09Z

Test build #26815 has finished for PR 4068 at commit 8d32e44.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
- case class GetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression

AmplabJenkins · 2015-02-05T04:24:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26815/
Test PASSed.

cloud-fan · 2015-02-05T04:47:22Z

Hi @marmbrus , I have updated it, is it ready to go? Thanks.

SparkQA · 2015-02-05T08:17:55Z

Test build #26833 has started for PR 4068 at commit 085619c.

This patch merges cleanly.

SparkQA · 2015-02-05T08:24:00Z

Test build #26833 has finished for PR 4068 at commit 085619c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-05T08:24:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26833/
Test FAILed.

SparkQA · 2015-02-05T09:02:58Z

Test build #26837 has started for PR 4068 at commit a6857b5.

This patch merges cleanly.

SparkQA · 2015-02-05T10:12:07Z

Test build #26837 has finished for PR 4068 at commit a6857b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * @param modelClass String name for model class (used for error messages)
- case class Data(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]])
- s" but class priors vector pi had $
- s" but class conditionals array theta had $
- case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
- * @param modelClass String name for model class (used for error messages)
- * @param modelClass String name for model class (used for error messages)
- case class Data(weights: Vector, intercept: Double)
- * @param modelClass String name for model class (used for error messages)
- trait Saveable
- trait Loader[M <: Saveable]
- * @return (class name, version, metadata)
- case class UnresolvedGetField(child: Expression, fieldName: String) extends UnaryExpression
- case class GetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression

AmplabJenkins · 2015-02-05T10:12:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26837/
Test PASSed.

… of ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField (cherry picked from commit 4793c84) Signed-off-by: Michael Armbrust <michael@databricks.com>

marmbrus · 2015-02-06T21:11:11Z

Thanks! merged to master and 1.3

marmbrus · 2015-03-03T22:48:05Z

This PR introduced two subtle regressions in the way we handle nested fields in ORDER BY queries:
https://issues.apache.org/jira/browse/SPARK-6145

cloud-fan · 2015-03-04T11:27:03Z

Hi @marmbrus , I'm looking into it, will send a fix ASAP.

marmbrus · 2015-03-04T18:48:09Z

Here is a partial solution: #4892

cloud-fan force-pushed the simple branch from bfe069b to d8c1dc9 Compare January 19, 2015 07:35

cloud-fan force-pushed the simple branch from d8c1dc9 to 53048b6 Compare January 19, 2015 08:33

yhuai reviewed Jan 28, 2015
View reviewed changes

cloud-fan force-pushed the simple branch from d9cf8fd to 1b8f924 Compare February 3, 2015 08:32

marmbrus reviewed Feb 4, 2015
View reviewed changes

cloud-fan changed the title ~~[SPARK-5278][SQL] complete the check of ambiguous reference to fields~~ [SPARK-5278][SQL] Introduce UnresolvedGetFeld and complete the check of ambiguous reference to fields Feb 5, 2015

cloud-fan changed the title ~~[SPARK-5278][SQL] Introduce UnresolvedGetFeld and complete the check of ambiguous reference to fields~~ [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields Feb 5, 2015

cloud-fan force-pushed the simple branch from 8d32e44 to 085619c Compare February 5, 2015 08:13

cloud-fan added 2 commits February 5, 2015 16:58

use UnresolvedGetField

8411c40

fix import order

a6857b5

cloud-fan force-pushed the simple branch from 085619c to a6857b5 Compare February 5, 2015 08:58

asfgit closed this in 4793c84 Feb 6, 2015

cloud-fan deleted the simple branch February 7, 2015 07:47

cloud-fan mentioned this pull request Feb 7, 2015

[SPARK-2096][SQL] support dot notation on array of struct #2405

Closed

		@@ -17,6 +17,8 @@

		package org.apache.spark.sql

		import org.apache.spark.sql.catalyst.analysis.UnresolvedGetField

[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068

[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068

Conversation

cloud-fan commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

cloud-fan commented Jan 16, 2015

marmbrus commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

cloud-fan commented Jan 19, 2015

SparkQA commented Jan 19, 2015

SparkQA commented Jan 19, 2015

AmplabJenkins commented Jan 19, 2015

SparkQA commented Jan 19, 2015

SparkQA commented Jan 19, 2015

AmplabJenkins commented Jan 19, 2015

SparkQA commented Jan 19, 2015

cloud-fan commented Jan 19, 2015

SparkQA commented Jan 19, 2015

AmplabJenkins commented Jan 19, 2015

cloud-fan commented Jan 19, 2015

cloud-fan commented Jan 23, 2015

rxin commented Jan 24, 2015

yhuai Jan 28, 2015

Choose a reason for hiding this comment

cloud-fan Jan 28, 2015

Choose a reason for hiding this comment

yhuai Jan 29, 2015

Choose a reason for hiding this comment

cloud-fan Jan 29, 2015

Choose a reason for hiding this comment

yhuai Jan 29, 2015

Choose a reason for hiding this comment

cloud-fan Feb 2, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 29, 2015

SparkQA commented Jan 29, 2015

AmplabJenkins commented Jan 29, 2015

cloud-fan commented Feb 3, 2015

SparkQA commented Feb 3, 2015

marmbrus Feb 4, 2015

Choose a reason for hiding this comment

marmbrus commented Feb 4, 2015

SparkQA commented Feb 5, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

cloud-fan commented Feb 5, 2015

SparkQA commented Feb 5, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

SparkQA commented Feb 5, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

marmbrus commented Feb 6, 2015

marmbrus commented Mar 3, 2015

cloud-fan commented Mar 4, 2015

marmbrus commented Mar 4, 2015