Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr #9981

Closed
wants to merge 5 commits into from

Conversation

dilipbiswal
Copy link
Contributor

Description of the problem from @cloud-fan

Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689
When we use selectExpr, we pass in UnresolvedFunction to DataFrame.select and fall in the last case. A workaround is to do special handling for UDTF like we did for explode(and json_tuple in 1.6), wrap it with MultiAlias.
Another workaround is using expr, for example, df.select(expr("explode(a)").as(Nil)), I think selectExpr is no longer needed after we have the expr function....

@dilipbiswal
Copy link
Contributor Author

@cloud-fan @yhuai Can you please take a look ? Thanks in advance.

@dilipbiswal dilipbiswal changed the title [SPARK-11619] cannot use UDTF in DataFrame.selectExpr [SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr Nov 25, 2015
@yhuai
Copy link
Contributor

yhuai commented Nov 25, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46719 has finished for PR 9981 at commit 9620468.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46737 has finished for PR 9981 at commit 9bee82b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan can you please help trigger a retest. Looks like an unrelated failure

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46747 has finished for PR 9981 at commit 9bee82b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan failed again in the same way. Looks like its some intermittent issue and its more friendly to me :-). Wenchen, is there a way i can issue a retest request so that i keep bugging you less :-)

@cloud-fan
Copy link
Contributor

you can also try comment "retest this please"

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Actually you had advised me to do that a while back. However it does not seem to work...
Thanks for triggering the test. Do you have any idea about this intermittent failure ? It seems to happen a bit frequently these days.. Is it tied to the node that gets picked up to run the test ?

@cloud-fan
Copy link
Contributor

yea, there are some flaky tests, we are working on it.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46762 has finished for PR 9981 at commit 9bee82b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

ping @cloud-fan @yhuai

@cloud-fan
Copy link
Contributor

Thanks for working on it!

Actually I'm not sure if this is the right appraoch... We still hande explode and json_tuple specially, what's worse, we hardcode their names in the code which makes it hard to maintain(need to update it when we rename or add new UDTF).
Since we have workarounds for this problem, I think we don't need to hurry for it and we can spend more time to think of a better idea.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Sure.. I will close this PR for now. If you think of a better approach and need my help in anyway pl let me know. Thanks for your feedback.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, I was thinking about this and want to run an idea by you. Is it ok if we add the logic to inject the MultiAlias in our analyzer. As an experiment, i put the following code in ResolveGenerate() and it seems to work and also i am running the test suites to make sure.

val newProj : Seq[NamedExpression] = projectList.map { expr =>
expr match {
case a @ Alias(g: Generator, name) if g.resolved &&
g.elementTypes.size > 1 && name.equals(a.child.prettyString) =>
MultiAlias(g, Nil)
case e @ _ => e
}
}

One thing is name.equals(a.child.prettyString) does not really make sure if its an user specified alias or a system generated one. We can flag an alias to differentiate if required. Just wanted to know what you thought about this approach. Thanks in advance.

@cloud-fan
Copy link
Contributor

This still looks hacky to me. How about this:
add an extra case in Column.named to delay the aliasing job of UnresolvedFunction like
case func: UnresolvedFunction => UnresolvedAlias(func), then in Analyzer.ResolveAliases, we can handle this UnresolvedAlias when the function has been resolved.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you very much. Actually the the ResolveAliases already has the code to inject MultiAlias for resolved Generators. So we are good (thanks !!)

One question ..
Currently in Column.named() , the unresolvedFunction falls throgh the last case where it uses PrettyString to alias the column.
Alias(expr, expr.prettyString)()

In our new approach, at ResolveAliases time, we fall into the other case and use a generated alias like c0. Is this ok ? Or should we try to preserve the pretty alias semantics by introducing new cases for diffferent function expressions ? Please let me know.

@cloud-fan
Copy link
Contributor

yea, we should introduce a new case and add UnresolvedAlias only for UnresolvedFunction

@dilipbiswal dilipbiswal reopened this Dec 4, 2015
@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, changed as per your suggestion. Please take a look.

@cloud-fan
Copy link
Contributor

ok to test

case jt: JsonTuple => MultiAlias(jt, Nil)

case func: UnresolvedFunction => UnresolvedAlias(func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed one thing, this will change the alias of functions which are not generators. Maybe we should add an optionName in UnresolvedAlias as the default aliasing name instead of c0, c1, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was exactly thinking the same thing (when i had posted my last question) @cloud-fan :-). Thank you. I will make the change.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47192 has finished for PR 9981 at commit af3963c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2015

Test build #47215 has finished for PR 9981 at commit 0e6d6d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, can you please take a look and let me know what you think ? Thanks.

@@ -149,12 +149,12 @@ class Analyzer(
exprs.zipWithIndex.map {
case (expr, i) =>
expr transform {
case u @ UnresolvedAlias(child) => child match {
case u @ UnresolvedAlias(child, _) => child match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can match the aliasName here, like case u @ UnresolvedAlias(child, optionalAliasName)

@cloud-fan
Copy link
Contributor

overall LGTM, some minor comments

@dilipbiswal
Copy link
Contributor Author

@cloud-fan thanks a lot. addressed your comments.

@SparkQA
Copy link

SparkQA commented Dec 5, 2015

Test build #47219 has finished for PR 9981 at commit 7674658.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

@dilipbiswal
Copy link
Contributor Author

@yhuai Hi Yin, Wenchen has looked over the changes. Can you please let me know what you think ?

@@ -73,6 +73,10 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
checkAnswer(
df.select($"key", functions.json_tuple($"jstring", "f1", "f2", "f3", "f4", "f5")),
expected)

checkAnswer(
df.selectExpr("key", "json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5')"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to double check, after selectExpr, columns are key, f1, f2, f3, f4, f5, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai Hi Yin, actually the f1-f5 columns are being reported as c0-c5. I am debugging now..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is good to also check the column name of functions.json_tuple($"jstring", "f1", "f2", "f3", "f4", "f5")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai Hi Yin, for functions.json_tuple case i.e also the column names are c0-c5. So did it always work like this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai Hello Yin, just debugged the code a little bit and trying hard to understand. In the json_tuple function in jsonExpression.scala, we compute the elementTypes as follows

override def elementTypes: Seq[(DataType, Boolean, String)] = fieldExpressions.zipWithIndex.map {
case (_, idx) => (StringType, true, s"c$idx")
}

This name is then used while making the generator output in makeGeneratorOutput() in Analyzer. Do you think we should change this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with hive, select json_tuple('{"a":1}', 'a');, the output column is c0, which is different from when the UDTF is in lateral view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks a lot for trying in hive. Wenchen, i searched for "lateral view" in spark code and didn't find a test case. I wanted to debug to study more about it. Also wenchen, i made a change in elementTypes computation like following

override def elementTypes: Seq[(DataType, Boolean, String)] = fieldExpressions.zipWithIndex.map {
case(l @ Literal(value, ), idx) if value.toString() != "null" =>
(StringType, true, value.toString())
case (
, idx) => (StringType, true, s"c$idx")
}

I can now see the alias names correctly. I am not sure if this is the right change however. Do you have any thoughts ? Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lateral view, I think column aliases are required, right? I am fine if we use ci as the column table if using json_tuple function and using json_tuple in selectExpr have consistent behavior.

@dilipbiswal
Copy link
Contributor Author

@yhuai Hi Yin, given we have a consistent column naming (ci) in both selectExpr and function case, do the changes look ok to you ?
If we want to change the column naming to use f1.. fn , i can change it in another PR. Please let me know.

@yhuai
Copy link
Contributor

yhuai commented Dec 17, 2015

test this please

@yhuai
Copy link
Contributor

yhuai commented Dec 17, 2015

Let's trigger a new test run since the last one was several days ago.

@yhuai
Copy link
Contributor

yhuai commented Dec 17, 2015

Looks good to me.

@yhuai
Copy link
Contributor

yhuai commented Dec 17, 2015

test this please

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47871 has finished for PR 9981 at commit 7674658.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

@dilipbiswal
Copy link
Contributor Author

@yhuai Hi Yin, failure does not seem related to the change. Can we please retest ?

@gatorsmile
Copy link
Member

retest this please

@dilipbiswal
Copy link
Contributor Author

@gatorsmile thanks.. unfortunately this run also failed very early on in the cycle :(:-

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47896 has finished for PR 9981 at commit 7674658.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

@cloud-fan
Copy link
Contributor

LGTM

@yhuai
Copy link
Contributor

yhuai commented Dec 18, 2015

Merging to master

@asfgit asfgit closed this in ee444fe Dec 18, 2015
@dilipbiswal
Copy link
Contributor Author

Many thanks @yhuai @cloud-fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants