-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr #9981
Conversation
@cloud-fan @yhuai Can you please take a look ? Thanks in advance. |
ok to test |
Test build #46719 has finished for PR 9981 at commit
|
Test build #46737 has finished for PR 9981 at commit
|
@cloud-fan can you please help trigger a retest. Looks like an unrelated failure |
retest this please |
Test build #46747 has finished for PR 9981 at commit
|
@cloud-fan failed again in the same way. Looks like its some intermittent issue and its more friendly to me :-). Wenchen, is there a way i can issue a retest request so that i keep bugging you less :-) |
you can also try comment "retest this please" |
@cloud-fan Actually you had advised me to do that a while back. However it does not seem to work... |
yea, there are some flaky tests, we are working on it. |
Test build #46762 has finished for PR 9981 at commit
|
ping @cloud-fan @yhuai |
Thanks for working on it! Actually I'm not sure if this is the right appraoch... We still hande |
@cloud-fan Sure.. I will close this PR for now. If you think of a better approach and need my help in anyway pl let me know. Thanks for your feedback. |
@cloud-fan Hi Wenchen, I was thinking about this and want to run an idea by you. Is it ok if we add the logic to inject the MultiAlias in our analyzer. As an experiment, i put the following code in ResolveGenerate() and it seems to work and also i am running the test suites to make sure. val newProj : Seq[NamedExpression] = projectList.map { expr => One thing is name.equals(a.child.prettyString) does not really make sure if its an user specified alias or a system generated one. We can flag an alias to differentiate if required. Just wanted to know what you thought about this approach. Thanks in advance. |
This still looks hacky to me. How about this: |
@cloud-fan Thank you very much. Actually the the ResolveAliases already has the code to inject MultiAlias for resolved Generators. So we are good (thanks !!) One question .. In our new approach, at ResolveAliases time, we fall into the other case and use a generated alias like c0. Is this ok ? Or should we try to preserve the pretty alias semantics by introducing new cases for diffferent function expressions ? Please let me know. |
yea, we should introduce a new case and add |
@cloud-fan Hi Wenchen, changed as per your suggestion. Please take a look. |
ok to test |
case jt: JsonTuple => MultiAlias(jt, Nil) | ||
|
||
case func: UnresolvedFunction => UnresolvedAlias(func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed one thing, this will change the alias of functions which are not generators. Maybe we should add an optionName in UnresolvedAlias
as the default aliasing name instead of c0
, c1
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was exactly thinking the same thing (when i had posted my last question) @cloud-fan :-). Thank you. I will make the change.
Test build #47192 has finished for PR 9981 at commit
|
af3963c
to
0e6d6d9
Compare
Test build #47215 has finished for PR 9981 at commit
|
@cloud-fan Hi Wenchen, can you please take a look and let me know what you think ? Thanks. |
@@ -149,12 +149,12 @@ class Analyzer( | |||
exprs.zipWithIndex.map { | |||
case (expr, i) => | |||
expr transform { | |||
case u @ UnresolvedAlias(child) => child match { | |||
case u @ UnresolvedAlias(child, _) => child match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can match the aliasName
here, like case u @ UnresolvedAlias(child, optionalAliasName)
overall LGTM, some minor comments |
@cloud-fan thanks a lot. addressed your comments. |
Test build #47219 has finished for PR 9981 at commit
|
@yhuai Hi Yin, Wenchen has looked over the changes. Can you please let me know what you think ? |
@@ -73,6 +73,10 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { | |||
checkAnswer( | |||
df.select($"key", functions.json_tuple($"jstring", "f1", "f2", "f3", "f4", "f5")), | |||
expected) | |||
|
|||
checkAnswer( | |||
df.selectExpr("key", "json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5')"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to double check, after selectExpr
, columns are key, f1, f2, f3, f4, f5
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai Hi Yin, actually the f1-f5 columns are being reported as c0-c5. I am debugging now..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is good to also check the column name of functions.json_tuple($"jstring", "f1", "f2", "f3", "f4", "f5")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai Hi Yin, for functions.json_tuple case i.e also the column names are c0-c5. So did it always work like this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai Hello Yin, just debugged the code a little bit and trying hard to understand. In the json_tuple function in jsonExpression.scala, we compute the elementTypes as follows
override def elementTypes: Seq[(DataType, Boolean, String)] = fieldExpressions.zipWithIndex.map {
case (_, idx) => (StringType, true, s"c$idx")
}
This name is then used while making the generator output in makeGeneratorOutput() in Analyzer. Do you think we should change this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with hive, select json_tuple('{"a":1}', 'a');
, the output column is c0
, which is different from when the UDTF is in lateral view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thanks a lot for trying in hive. Wenchen, i searched for "lateral view" in spark code and didn't find a test case. I wanted to debug to study more about it. Also wenchen, i made a change in elementTypes computation like following
override def elementTypes: Seq[(DataType, Boolean, String)] = fieldExpressions.zipWithIndex.map {
case(l @ Literal(value, ), idx) if value.toString() != "null" =>
(StringType, true, value.toString())
case (, idx) => (StringType, true, s"c$idx")
}
I can now see the alias names correctly. I am not sure if this is the right change however. Do you have any thoughts ? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lateral view, I think column aliases are required, right? I am fine if we use ci
as the column table if using json_tuple
function and using json_tuple
in selectExpr have consistent behavior.
@yhuai Hi Yin, given we have a consistent column naming (ci) in both selectExpr and function case, do the changes look ok to you ? |
test this please |
Let's trigger a new test run since the last one was several days ago. |
Looks good to me. |
test this please |
Test build #47871 has finished for PR 9981 at commit
|
@yhuai Hi Yin, failure does not seem related to the change. Can we please retest ? |
retest this please |
@gatorsmile thanks.. unfortunately this run also failed very early on in the cycle :(:- |
retest this please |
Test build #47896 has finished for PR 9981 at commit
|
LGTM |
Merging to master |
Many thanks @yhuai @cloud-fan |
Description of the problem from @cloud-fan
Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689
When we use
selectExpr
, we pass inUnresolvedFunction
toDataFrame.select
and fall in the last case. A workaround is to do special handling for UDTF like we did forexplode
(andjson_tuple
in 1.6), wrap it withMultiAlias
.Another workaround is using
expr
, for example,df.select(expr("explode(a)").as(Nil))
, I thinkselectExpr
is no longer needed after we have theexpr
function....