[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr #9981

dilipbiswal · 2015-11-25T22:43:22Z

Description of the problem from @cloud-fan

Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689
When we use selectExpr, we pass in UnresolvedFunction to DataFrame.select and fall in the last case. A workaround is to do special handling for UDTF like we did for explode(and json_tuple in 1.6), wrap it with MultiAlias.
Another workaround is using expr, for example, df.select(expr("explode(a)").as(Nil)), I think selectExpr is no longer needed after we have the expr function....

dilipbiswal · 2015-11-25T22:44:17Z

@cloud-fan @yhuai Can you please take a look ? Thanks in advance.

yhuai · 2015-11-25T22:48:57Z

ok to test

SparkQA · 2015-11-26T00:49:22Z

Test build #46719 has finished for PR 9981 at commit 9620468.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-26T04:52:58Z

Test build #46737 has finished for PR 9981 at commit 9bee82b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2015-11-26T05:00:43Z

@cloud-fan can you please help trigger a retest. Looks like an unrelated failure

cloud-fan · 2015-11-26T05:44:00Z

retest this please

SparkQA · 2015-11-26T06:45:21Z

Test build #46747 has finished for PR 9981 at commit 9bee82b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2015-11-26T07:10:50Z

@cloud-fan failed again in the same way. Looks like its some intermittent issue and its more friendly to me :-). Wenchen, is there a way i can issue a retest request so that i keep bugging you less :-)

cloud-fan · 2015-11-26T09:18:32Z

you can also try comment "retest this please"

dilipbiswal · 2015-11-26T09:37:37Z

@cloud-fan Actually you had advised me to do that a while back. However it does not seem to work...
Thanks for triggering the test. Do you have any idea about this intermittent failure ? It seems to happen a bit frequently these days.. Is it tied to the node that gets picked up to run the test ?

cloud-fan · 2015-11-26T09:50:58Z

yea, there are some flaky tests, we are working on it.

SparkQA · 2015-11-26T11:51:01Z

Test build #46762 has finished for PR 9981 at commit 9bee82b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2015-11-30T00:21:37Z

ping @cloud-fan @yhuai

cloud-fan · 2015-11-30T03:29:53Z

Thanks for working on it!

Actually I'm not sure if this is the right appraoch... We still hande explode and json_tuple specially, what's worse, we hardcode their names in the code which makes it hard to maintain(need to update it when we rename or add new UDTF).
Since we have workarounds for this problem, I think we don't need to hurry for it and we can spend more time to think of a better idea.

dilipbiswal · 2015-11-30T03:55:33Z

@cloud-fan Sure.. I will close this PR for now. If you think of a better approach and need my help in anyway pl let me know. Thanks for your feedback.

dilipbiswal · 2015-12-01T06:59:13Z

@cloud-fan Hi Wenchen, I was thinking about this and want to run an idea by you. Is it ok if we add the logic to inject the MultiAlias in our analyzer. As an experiment, i put the following code in ResolveGenerate() and it seems to work and also i am running the test suites to make sure.

val newProj : Seq[NamedExpression] = projectList.map { expr =>
expr match {
case a @ Alias(g: Generator, name) if g.resolved &&
g.elementTypes.size > 1 && name.equals(a.child.prettyString) =>
MultiAlias(g, Nil)
case e @ _ => e
}
}

One thing is name.equals(a.child.prettyString) does not really make sure if its an user specified alias or a system generated one. We can flag an alias to differentiate if required. Just wanted to know what you thought about this approach. Thanks in advance.

cloud-fan · 2015-12-02T03:51:24Z

This still looks hacky to me. How about this:
add an extra case in Column.named to delay the aliasing job of UnresolvedFunction like
case func: UnresolvedFunction => UnresolvedAlias(func), then in Analyzer.ResolveAliases, we can handle this UnresolvedAlias when the function has been resolved.

dilipbiswal · 2015-12-02T07:03:39Z

@cloud-fan Thank you very much. Actually the the ResolveAliases already has the code to inject MultiAlias for resolved Generators. So we are good (thanks !!)

One question ..
Currently in Column.named() , the unresolvedFunction falls throgh the last case where it uses PrettyString to alias the column.
Alias(expr, expr.prettyString)()

In our new approach, at ResolveAliases time, we fall into the other case and use a generated alias like c0. Is this ok ? Or should we try to preserve the pretty alias semantics by introducing new cases for diffferent function expressions ? Please let me know.

cloud-fan · 2015-12-02T07:11:30Z

yea, we should introduce a new case and add UnresolvedAlias only for UnresolvedFunction

dilipbiswal · 2015-12-04T07:51:53Z

@cloud-fan Hi Wenchen, changed as per your suggestion. Please take a look.

cloud-fan · 2015-12-04T08:17:41Z

ok to test

cloud-fan · 2015-12-04T08:20:26Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

    case jt: JsonTuple => MultiAlias(jt, Nil)

+    case func: UnresolvedFunction => UnresolvedAlias(func)


I missed one thing, this will change the alias of functions which are not generators. Maybe we should add an optionName in UnresolvedAlias as the default aliasing name instead of c0, c1, etc.

I was exactly thinking the same thing (when i had posted my last question) @cloud-fan :-). Thank you. I will make the change.

SparkQA · 2015-12-04T09:53:10Z

Test build #47192 has finished for PR 9981 at commit af3963c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-05T05:33:12Z

Test build #47215 has finished for PR 9981 at commit 0e6d6d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

dilipbiswal · 2015-12-05T06:18:44Z

@cloud-fan Hi Wenchen, can you please take a look and let me know what you think ? Thanks.

cloud-fan · 2015-12-05T08:38:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -149,12 +149,12 @@ class Analyzer(
      exprs.zipWithIndex.map {
        case (expr, i) =>
          expr transform {
-            case u @ UnresolvedAlias(child) => child match {
+            case u @ UnresolvedAlias(child, _) => child match {


nit: you can match the aliasName here, like case u @ UnresolvedAlias(child, optionalAliasName)

cloud-fan · 2015-12-05T08:41:20Z

overall LGTM, some minor comments

dilipbiswal · 2015-12-05T09:00:52Z

@cloud-fan thanks a lot. addressed your comments.

SparkQA · 2015-12-05T10:38:38Z

Test build #47219 has finished for PR 9981 at commit 7674658.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

dilipbiswal · 2015-12-08T02:07:37Z

@yhuai Hi Yin, Wenchen has looked over the changes. Can you please let me know what you think ?

yhuai · 2015-12-08T04:46:42Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

@@ -73,6 +73,10 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
    checkAnswer(
      df.select($"key", functions.json_tuple($"jstring", "f1", "f2", "f3", "f4", "f5")),
      expected)
+
+    checkAnswer(
+      df.selectExpr("key", "json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5')"),


Just want to double check, after selectExpr, columns are key, f1, f2, f3, f4, f5, right?

@yhuai Hi Yin, actually the f1-f5 columns are being reported as c0-c5. I am debugging now..

Maybe it is good to also check the column name of functions.json_tuple($"jstring", "f1", "f2", "f3", "f4", "f5")?

@yhuai Hi Yin, for functions.json_tuple case i.e also the column names are c0-c5. So did it always work like this ?

@yhuai Hello Yin, just debugged the code a little bit and trying hard to understand. In the json_tuple function in jsonExpression.scala, we compute the elementTypes as follows

override def elementTypes: Seq[(DataType, Boolean, String)] = fieldExpressions.zipWithIndex.map {
case (_, idx) => (StringType, true, s"c$idx")
}

This name is then used while making the generator output in makeGeneratorOutput() in Analyzer. Do you think we should change this ?

I checked with hive, select json_tuple('{"a":1}', 'a');, the output column is c0, which is different from when the UDTF is in lateral view.

@cloud-fan Thanks a lot for trying in hive. Wenchen, i searched for "lateral view" in spark code and didn't find a test case. I wanted to debug to study more about it. Also wenchen, i made a change in elementTypes computation like following

override def elementTypes: Seq[(DataType, Boolean, String)] = fieldExpressions.zipWithIndex.map {
case(l @ Literal(value, ), idx) if value.toString() != "null" =>
(StringType, true, value.toString())
case (, idx) => (StringType, true, s"c$idx")
}

I can now see the alias names correctly. I am not sure if this is the right change however. Do you have any thoughts ? Thank you.

For lateral view, I think column aliases are required, right? I am fine if we use ci as the column table if using json_tuple function and using json_tuple in selectExpr have consistent behavior.

dilipbiswal · 2015-12-15T21:06:16Z

@yhuai Hi Yin, given we have a consistent column naming (ci) in both selectExpr and function case, do the changes look ok to you ?
If we want to change the column naming to use f1.. fn , i can change it in another PR. Please let me know.

yhuai · 2015-12-17T00:06:58Z

test this please

yhuai · 2015-12-17T00:07:19Z

Let's trigger a new test run since the last one was several days ago.

yhuai · 2015-12-17T00:07:52Z

Looks good to me.

yhuai · 2015-12-17T00:31:08Z

test this please

SparkQA · 2015-12-17T02:11:07Z

Test build #47871 has finished for PR 9981 at commit 7674658.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

dilipbiswal · 2015-12-17T02:18:23Z

@yhuai Hi Yin, failure does not seem related to the change. Can we please retest ?

gatorsmile · 2015-12-17T03:00:43Z

retest this please

dilipbiswal · 2015-12-17T03:36:21Z

@gatorsmile thanks.. unfortunately this run also failed very early on in the cycle :(:-

gatorsmile · 2015-12-17T04:02:31Z

retest this please

SparkQA · 2015-12-17T06:25:59Z

Test build #47896 has finished for PR 9981 at commit 7674658.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n

cloud-fan · 2015-12-18T17:18:18Z

LGTM

yhuai · 2015-12-18T17:52:07Z

Merging to master

dilipbiswal · 2015-12-18T18:37:17Z

Many thanks @yhuai @cloud-fan

dilipbiswal changed the title ~~[SPARK-11619] cannot use UDTF in DataFrame.selectExpr~~ [SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr Nov 25, 2015

dilipbiswal closed this Nov 30, 2015

dilipbiswal reopened this Dec 4, 2015

cloud-fan reviewed Dec 4, 2015
View reviewed changes

dilipbiswal added 4 commits December 4, 2015 11:42

[SPARK-11619] cannot use UDTF in DataFrame.selectExpr

66cb225

fix test failure

7f5b8b0

Incorporate Wenchen's comments.

0dc6022

Add a default parameter to UnresolvedAlias

0e6d6d9

dilipbiswal force-pushed the spark-11619 branch from af3963c to 0e6d6d9 Compare December 5, 2015 03:43

cloud-fan reviewed Dec 5, 2015
View reviewed changes

wenchen's comments

7674658

yhuai reviewed Dec 8, 2015
View reviewed changes

asfgit closed this in ee444fe Dec 18, 2015

		case jt: JsonTuple => MultiAlias(jt, Nil)

		case func: UnresolvedFunction => UnresolvedAlias(func)

[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr #9981

[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr #9981

Conversation

dilipbiswal commented Nov 25, 2015

dilipbiswal commented Nov 25, 2015

yhuai commented Nov 25, 2015

SparkQA commented Nov 26, 2015

SparkQA commented Nov 26, 2015

dilipbiswal commented Nov 26, 2015

cloud-fan commented Nov 26, 2015

SparkQA commented Nov 26, 2015

dilipbiswal commented Nov 26, 2015

cloud-fan commented Nov 26, 2015

dilipbiswal commented Nov 26, 2015

cloud-fan commented Nov 26, 2015

SparkQA commented Nov 26, 2015

dilipbiswal commented Nov 30, 2015

cloud-fan commented Nov 30, 2015

dilipbiswal commented Nov 30, 2015

dilipbiswal commented Dec 1, 2015

cloud-fan commented Dec 2, 2015

dilipbiswal commented Dec 2, 2015

cloud-fan commented Dec 2, 2015

dilipbiswal commented Dec 4, 2015

cloud-fan commented Dec 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2015

SparkQA commented Dec 5, 2015

dilipbiswal commented Dec 5, 2015

Choose a reason for hiding this comment

cloud-fan commented Dec 5, 2015

dilipbiswal commented Dec 5, 2015

SparkQA commented Dec 5, 2015

dilipbiswal commented Dec 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal commented Dec 15, 2015

yhuai commented Dec 17, 2015

yhuai commented Dec 17, 2015

yhuai commented Dec 17, 2015

yhuai commented Dec 17, 2015

SparkQA commented Dec 17, 2015

dilipbiswal commented Dec 17, 2015

gatorsmile commented Dec 17, 2015

dilipbiswal commented Dec 17, 2015

gatorsmile commented Dec 17, 2015

SparkQA commented Dec 17, 2015

cloud-fan commented Dec 18, 2015

yhuai commented Dec 18, 2015

dilipbiswal commented Dec 18, 2015