Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2096][SQL] Correctly parse dot notations #2230

Closed
wants to merge 6 commits into from

Conversation

cloud-fan
Copy link
Contributor

First let me write down the current projections grammar of spark sql:

expression                : orExpression
orExpression              : andExpression {"or" andExpression}
andExpression             : comparisonExpression {"and" comparisonExpression}
comparisonExpression      : termExpression | termExpression "=" termExpression | termExpression ">" termExpression | ...
termExpression            : productExpression {"+"|"-" productExpression}
productExpression         : baseExpression {"*"|"/"|"%" baseExpression}
baseExpression            : expression "[" expression "]" | ... | ident | ...
ident                     : identChar {identChar | digit} | delimiters | ...
identChar                 : letter | "_" | "."
delimiters                : "," | ";" | "(" | ")" | "[" | "]" | ...
projection                : expression [["AS"] ident]
projections               : projection { "," projection}

For something like a.b.c[1], it will be parsed as:

But for something like a[1].b, the current grammar can't parse it correctly.
A simple solution is written in ParquetQuerySuite#NestedSqlParser, changed grammars are:

delimiters                : "." | "," | ";" | "(" | ")" | "[" | "]" | ...
identChar                 : letter | "_"
baseExpression            : expression "[" expression "]" | expression "." ident | ... | ident | ...

This works well, but can't cover some corner case like select t.a.b from table as t:

t.a.b parsed as GetField(GetField(UnResolved("t"), "a"), "b") instead of GetField(UnResolved("t.a"), "b") using this new grammar.
However, we can't resolve t as it's not a filed, but the whole table.(if we could do this, then select t from table as t is legal, which is unexpected)
My solution is:

dotExpressionHeader       : ident "." ident
baseExpression            : expression "[" expression "]" | expression "." ident | ... | dotExpressionHeader  | ident | ...

I passed all test cases under sql locally and add a more complex case.
"arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it.
I'm not familiar with the latter optimize phase, please correct me if I missed something.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cloud-fan cloud-fan changed the title SPARK-2096 Correctly parse dot notations [SPARK-2096][SQL] Correctly parse dot notations Sep 1, 2014
@marmbrus
Copy link
Contributor

marmbrus commented Sep 2, 2014

ok to test

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have started for PR 2230 at commit de63082.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have finished for PR 2230 at commit de63082.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

sorry for the code style, fixed! Test again please

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have started for PR 2230 at commit 8420c84.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have finished for PR 2230 at commit 8420c84.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2014

Thanks for working on this! The changes made to the parser seem reasonable to me. Thanks for the thorough explanation.

Can you explain your changes to LogicalPlan a little more and add some inline comments. Thats a very crucial piece of code and I'm a little nervous about changing it. Also it seems like we might be missing the distinct logic based on the failing test case.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2230 at commit 5c70874.

  • This patch merges cleanly.

@cloud-fan
Copy link
Contributor Author

@marmbrus Sorry for missing the distinct. Since we parse the dot in SqlParser now, the only possible formats of name passed into LogicalPlan.resolve are "ident" and "ident.ident". So my change to LogicalPlan is just simplify the logic there. We can still roll back to the origin LogicalPlan.resolve if you feel it too radical :)

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2230 at commit 5c70874.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 5, 2014

Yeah, I'd like to simplify this, but unfortunately I think this version introduces a regression for hive queries. I've made a PR (against your PR) that shows this regression. cloud-fan#1 Would be great if you could merge that and either roll back or propose an alternative. Thanks :)

@cloud-fan
Copy link
Contributor Author

@marmbrus Seems hive parser will pass something like "a.b.c..." to LogicalPlan, so I have to roll back(and I changed dotExpressionHeader to ident "." ident {"." ident}). And I have done some work on GetField to let it support not just StructType, but also array of struct, or array of array of struct, or array of array of ... struct.
The idea is simple. If you want a.b to work, then a must be some level of nested array of struct(level 0 means just a StructType). And the result of a.b is same level of nested array of b-type. In this way, we can handle nested array of strcut and simple struct in same process.

@cloud-fan
Copy link
Contributor Author

I'm not sure how to modify lazy val resolved in GetField since it handles not only StructType now. Currently I just removed the type check. What do you think? @marmbrus

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@marmbrus
Copy link
Contributor

marmbrus commented Sep 8, 2014

ok to test

@marmbrus
Copy link
Contributor

marmbrus commented Sep 8, 2014

Hmm, does Hive support using dot notation to access fields that are arbitrarily nested in arrays? If not I think it would be better to just support one level. Also, the code added for that feature uses a lot of mutable state and is a little hard to follow (in addition to removing the type check).

Since I'd really like to include your parser fixes and test cleanup, what do you think about breaking out the GetField on arrays change into another PR?

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 2230 at commit ca43e6d.

  • This patch merges cleanly.

@cloud-fan
Copy link
Contributor Author

Actually hive doesn't support using dot notation to access fields of nested array, even one level. Anyway, I will put this support in another PR to keep this PR simple and clear :)

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have finished for PR 2230 at commit ca43e6d.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

The failed test case seems a regression test for a new fix. I have done rebase to include the new fix. Test again please.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 2230 at commit 5adb6bf.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have finished for PR 2230 at commit 5adb6bf.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Yeah I think the test failure was unrelated, though unfortunately this is out of date again. Mind updating one more time? Thanks!

@cloud-fan
Copy link
Contributor Author

rebase done, test again please.

@marmbrus
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 2230 at commit e1a8898.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 2230 at commit e1a8898.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks for working on this! Merged to master.

@asfgit asfgit closed this in e4f4886 Sep 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants