Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12719][SQL] SQL generation support for generators, including UDTF #11563

Closed
wants to merge 4 commits into from

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

This PR is to convert SQL from analyzed logical plans containing Generate operator.

Sample Plan :

GlobalLimit 3
+- LocalLimit 3
   +- Project [gencol2#204]
      +- Generate explode(gencol1#203), true, false, Some(gentab2), [gencol2#204]
         +- Generate explode(array(array(1, 2, 3))), true, false, Some(gentab1), [gencol1#203]
            +- MetastoreRelation default, t4, None

Generated Query:

SELECT `gentab2`.`gencol2` FROM `default`.`t4` LATERAL VIEW explode(array(array(1, 2, 3))) `gentab1` AS `gencol1` LATERAL VIEW explode(`gentab1`.`gencol1`) `gentab2` AS `gencol2` LIMIT 3
  • Generators can be specified in either projection list or in the lateral view clause.
  • First case is handled in projectToSQL
  • The second case is handled in generateToSQL
  • Also contains the fix for spark-13698. As soon as it it merged , i will rebase and submit again.

How was this patch tested?

Added test cases in LogicalPlanToSQLSuite
Have also run HiveCompatibilitySuite to look for any failure in generation of Generate plans.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@dilipbiswal
Copy link
Contributor Author

cc @liancheng @gatorsmile
Thanks to @gatorsmile to let me work on this :-)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

}

Generate(generator, join = true, outer = outer, Some(alias.toLowerCase), attributes, child)
Generate(
generator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thanks !! In the doc, 4-space indentation is applicable for function declaration. Is it also applicable here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, you are right. This is part of #11538
@liancheng already reviewed it. Thanks!

@gatorsmile
Copy link
Member

Overall, LGTM. Thanks!

@@ -445,4 +461,86 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils {
"f1", "b[0].f1", "f1", "c[foo]", "d[0]"
)
}

test("SQL generation for generate") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting pretty long. i'd separate it into multiple logically separate test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin ok Reynold. I will group them into different test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I have split the tests into 5 groups. Pl. let me know if it looks ok to you.

@rxin
Copy link
Contributor

rxin commented Mar 7, 2016

Why do we need two separate cases here?

@dilipbiswal
Copy link
Contributor Author

@rxin. Hi Reynold,

We have two cases to handle.

SELECT explode(array(1,2,3)) FROM src
SELECT gentab2.* FROM t1 LATERAL VIEW explode(array(array(1,2,3))) gentab1 AS gencol1 LATERAL VIEW explode(gentab1.gencol1) gentab2 AS gencol2

Currently, I handle the first case in projToSql and the 2nd case in generateToSql,
as we wanted to generate SQLs which is closer to the original SQL.

Lateral view also can refer to columns from tables before itself. So i felt it is safer to
generate the SQL very close to the source SQL to reduce any risk. I also thought
about treating the first case as a special case of LATERAL view. In this case we
had to handle the generation of a table alias which is missing in case-1 and fixing up
the projection list above to refer to it.

However, I went with the approach in this PR as it didn't seem too complex and also retained the layout of the original SQL. I could be easily overlooking something here and would appreciate your guidance. Please let me know.

@rxin
Copy link
Contributor

rxin commented Mar 8, 2016

"as we wanted to generate SQLs which is closer to the original SQL"

Why is this a goal? I worry about the fragility of this two cases, if we really only need one to satisfy correctness.

@dilipbiswal
Copy link
Contributor Author

@rxin Thanks for the input. Let me try to work on it and see if i encounter any issues

@cloud-fan
Copy link
Contributor

Hi @dilipbiswal , could you describe the pattern of a logical plan containing Generator? And does the pattern differ when the generator is in projection list or in LATERAL VIEW?

@cloud-fan
Copy link
Contributor

I think making generated SQL closer to the original one is not our goal, instead the goal should be: make your approach as simple as possible, so that it's easy to reason about, verify the correctness, and review the code.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks..I understand now from Reynold's and your feedback. I wrongly assumed the goal to make the best effort to retain the original sql layout.

Regarding the pattern of logical plans between the two cases here is
what comes to my mind now. I will update if i remember more.

  • Projection List
    • The generate node has no qualifier.
    • There can be one generator in the projection List
    • Join is not always true.
  • LATERAL VIEW
    • Generate has a qualifier (column names optional)
    • There can be multiple generators. So we need preserve the order of from clause.
    • Join is always true as there has to be a table before the LATERAL VIEW. This
      can cause ambiguity of column resolution if we express the projection list generators
      as lateral views and don't qualify the projection list properly.
+- Project [gencol2#204]
      +- Generate explode(gencol1#203), true, false, Some(gentab2), [gencol2#204]
         +- Generate explode(array(array(1, 2, 3))), true, false, Some(gentab1), [gencol1#203]
            +- MetastoreRelation default, t4, None

In the above plan, we have two LATERAL VIEW clauses with qualifier gentab1 and gentab2 respectively.

Project [value#53]
+- Generate explode(array(1, 2, 3)), false, false, None, [value#53]
   +- MetastoreRelation default, src, None

In the above case, its a generator in the projection list.

There is a test which uses generator but has no table in the FROM clause like following.

select explode(array(1,2,3)) AS gencol

I suspect this may cause problem if we need to express the generator as a LATERAL VIEW.

Please let me know if you need any other info.

@dilipbiswal
Copy link
Contributor Author

Closing this in favor of
#11696

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants