[SPARK-2096][SQL] Correctly parse dot notations #2230

cloud-fan · 2014-09-01T12:33:25Z

First let me write down the current projections grammar of spark sql:

expression                : orExpression
orExpression              : andExpression {"or" andExpression}
andExpression             : comparisonExpression {"and" comparisonExpression}
comparisonExpression      : termExpression | termExpression "=" termExpression | termExpression ">" termExpression | ...
termExpression            : productExpression {"+"|"-" productExpression}
productExpression         : baseExpression {"*"|"/"|"%" baseExpression}
baseExpression            : expression "[" expression "]" | ... | ident | ...
ident                     : identChar {identChar | digit} | delimiters | ...
identChar                 : letter | "_" | "."
delimiters                : "," | ";" | "(" | ")" | "[" | "]" | ...
projection                : expression [["AS"] ident]
projections               : projection { "," projection}

For something like a.b.c[1], it will be parsed as:

But for something like a[1].b, the current grammar can't parse it correctly.
A simple solution is written in ParquetQuerySuite#NestedSqlParser, changed grammars are:

delimiters                : "." | "," | ";" | "(" | ")" | "[" | "]" | ...
identChar                 : letter | "_"
baseExpression            : expression "[" expression "]" | expression "." ident | ... | ident | ...

This works well, but can't cover some corner case like select t.a.b from table as t:

t.a.b parsed as GetField(GetField(UnResolved("t"), "a"), "b") instead of GetField(UnResolved("t.a"), "b") using this new grammar.
However, we can't resolve t as it's not a filed, but the whole table.(if we could do this, then select t from table as t is legal, which is unexpected)
My solution is:

dotExpressionHeader       : ident "." ident
baseExpression            : expression "[" expression "]" | expression "." ident | ... | dotExpressionHeader  | ident | ...

I passed all test cases under sql locally and add a more complex case.
"arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it.
I'm not familiar with the latter optimize phase, please correct me if I missed something.

AmplabJenkins · 2014-09-01T12:37:08Z

Can one of the admins verify this patch?

marmbrus · 2014-09-02T07:41:12Z

ok to test

SparkQA · 2014-09-02T07:44:16Z

QA tests have started for PR 2230 at commit de63082.

This patch merges cleanly.

SparkQA · 2014-09-02T07:45:13Z

QA tests have finished for PR 2230 at commit de63082.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2014-09-02T08:24:43Z

sorry for the code style, fixed! Test again please

SparkQA · 2014-09-02T08:29:05Z

QA tests have started for PR 2230 at commit 8420c84.

This patch merges cleanly.

SparkQA · 2014-09-02T10:05:52Z

QA tests have finished for PR 2230 at commit 8420c84.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-03T04:55:02Z

Thanks for working on this! The changes made to the parser seem reasonable to me. Thanks for the thorough explanation.

Can you explain your changes to LogicalPlan a little more and add some inline comments. Thats a very crucial piece of code and I'm a little nervous about changing it. Also it seems like we might be missing the distinct logic based on the failing test case.

SparkQA · 2014-09-03T08:54:36Z

QA tests have started for PR 2230 at commit 5c70874.

This patch merges cleanly.

cloud-fan · 2014-09-03T09:03:16Z

@marmbrus Sorry for missing the distinct. Since we parse the dot in SqlParser now, the only possible formats of name passed into LogicalPlan.resolve are "ident" and "ident.ident". So my change to LogicalPlan is just simplify the logic there. We can still roll back to the origin LogicalPlan.resolve if you feel it too radical :)

SparkQA · 2014-09-03T10:41:53Z

QA tests have finished for PR 2230 at commit 5c70874.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-05T02:06:32Z

Yeah, I'd like to simplify this, but unfortunately I think this version introduces a regression for hive queries. I've made a PR (against your PR) that shows this regression. cloud-fan#1 Would be great if you could merge that and either roll back or propose an alternative. Thanks :)

cloud-fan · 2014-09-05T09:04:19Z

@marmbrus Seems hive parser will pass something like "a.b.c..." to LogicalPlan, so I have to roll back(and I changed dotExpressionHeader to ident "." ident {"." ident}). And I have done some work on GetField to let it support not just StructType, but also array of struct, or array of array of struct, or array of array of ... struct.
The idea is simple. If you want a.b to work, then a must be some level of nested array of struct(level 0 means just a StructType). And the result of a.b is same level of nested array of b-type. In this way, we can handle nested array of strcut and simple struct in same process.

cloud-fan · 2014-09-05T09:09:13Z

I'm not sure how to modify lazy val resolved in GetField since it handles not only StructType now. Currently I just removed the type check. What do you think? @marmbrus

SparkQA · 2014-09-05T23:41:38Z

Can one of the admins verify this patch?

marmbrus · 2014-09-08T23:04:40Z

ok to test

marmbrus · 2014-09-08T23:07:56Z

Hmm, does Hive support using dot notation to access fields that are arbitrarily nested in arrays? If not I think it would be better to just support one level. Also, the code added for that feature uses a lot of mutable state and is a little hard to follow (in addition to removing the type check).

Since I'd really like to include your parser fixes and test cleanup, what do you think about breaking out the GetField on arrays change into another PR?

SparkQA · 2014-09-09T06:49:23Z

QA tests have started for PR 2230 at commit ca43e6d.

This patch merges cleanly.

cloud-fan · 2014-09-09T08:08:15Z

Actually hive doesn't support using dot notation to access fields of nested array, even one level. Anyway, I will put this support in another PR to keep this PR simple and clear :)

SparkQA · 2014-09-09T08:34:24Z

QA tests have finished for PR 2230 at commit ca43e6d.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2014-09-09T08:59:37Z

The failed test case seems a regression test for a new fix. I have done rebase to include the new fix. Test again please.

SparkQA · 2014-09-09T09:49:41Z

QA tests have started for PR 2230 at commit 5adb6bf.

This patch merges cleanly.

SparkQA · 2014-09-09T11:24:58Z

QA tests have finished for PR 2230 at commit 5adb6bf.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-10T02:11:58Z

Yeah I think the test failure was unrelated, though unfortunately this is out of date again. Mind updating one more time? Thanks!

cloud-fan · 2014-09-10T04:00:28Z

rebase done, test again please.

marmbrus · 2014-09-10T06:53:46Z

Jenkins, test this please

SparkQA · 2014-09-10T07:51:31Z

QA tests have started for PR 2230 at commit e1a8898.

This patch merges cleanly.

SparkQA · 2014-09-10T09:46:32Z

QA tests have finished for PR 2230 at commit e1a8898.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-10T19:57:13Z

Thanks for working on this! Merged to master.

cloud-fan changed the title ~~SPARK-2096 Correctly parse dot notations~~ [SPARK-2096][SQL] Correctly parse dot notations Sep 1, 2014

chuxi mentioned this pull request Sep 2, 2014

SPARK-2096 [SQL]: Correctly parse dot notations for accessing an array of structs #2082

Closed

cloud-fan force-pushed the dot branch from ca43e6d to 5adb6bf Compare September 9, 2014 08:57

cloud-fan added 3 commits September 10, 2014 11:33

SPARK-2096 Correctly parse dot notations

dc31698

split long line

95d733f

some enhance

16bc4c6

marmbrus and others added 3 commits September 10, 2014 11:33

add regression test for doubly nested data

a58df40

rollback LogicalPlan, support dot operation on nested array type

ee8a724

remove support for arbitrary nested arrays

e1a8898

cloud-fan force-pushed the dot branch from 5adb6bf to e1a8898 Compare September 10, 2014 03:55

asfgit closed this in e4f4886 Sep 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2096][SQL] Correctly parse dot notations #2230

[SPARK-2096][SQL] Correctly parse dot notations #2230

cloud-fan commented Sep 1, 2014

AmplabJenkins commented Sep 1, 2014

marmbrus commented Sep 2, 2014

SparkQA commented Sep 2, 2014

SparkQA commented Sep 2, 2014

cloud-fan commented Sep 2, 2014

SparkQA commented Sep 2, 2014

SparkQA commented Sep 2, 2014

marmbrus commented Sep 3, 2014

SparkQA commented Sep 3, 2014

cloud-fan commented Sep 3, 2014

SparkQA commented Sep 3, 2014

marmbrus commented Sep 5, 2014

cloud-fan commented Sep 5, 2014

cloud-fan commented Sep 5, 2014

SparkQA commented Sep 5, 2014

marmbrus commented Sep 8, 2014

marmbrus commented Sep 8, 2014

SparkQA commented Sep 9, 2014

cloud-fan commented Sep 9, 2014

SparkQA commented Sep 9, 2014

cloud-fan commented Sep 9, 2014

SparkQA commented Sep 9, 2014

SparkQA commented Sep 9, 2014

marmbrus commented Sep 10, 2014

cloud-fan commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

marmbrus commented Sep 10, 2014

[SPARK-2096][SQL] Correctly parse dot notations #2230

[SPARK-2096][SQL] Correctly parse dot notations #2230

Conversation

cloud-fan commented Sep 1, 2014

AmplabJenkins commented Sep 1, 2014

marmbrus commented Sep 2, 2014

SparkQA commented Sep 2, 2014

SparkQA commented Sep 2, 2014

cloud-fan commented Sep 2, 2014

SparkQA commented Sep 2, 2014

SparkQA commented Sep 2, 2014

marmbrus commented Sep 3, 2014

SparkQA commented Sep 3, 2014

cloud-fan commented Sep 3, 2014

SparkQA commented Sep 3, 2014

marmbrus commented Sep 5, 2014

cloud-fan commented Sep 5, 2014

cloud-fan commented Sep 5, 2014

SparkQA commented Sep 5, 2014

marmbrus commented Sep 8, 2014

marmbrus commented Sep 8, 2014

SparkQA commented Sep 9, 2014

cloud-fan commented Sep 9, 2014

SparkQA commented Sep 9, 2014

cloud-fan commented Sep 9, 2014

SparkQA commented Sep 9, 2014

SparkQA commented Sep 9, 2014

marmbrus commented Sep 10, 2014

cloud-fan commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

marmbrus commented Sep 10, 2014