-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat #19728
Conversation
Test build #83745 has finished for PR 19728 at commit
|
Test build #83746 has finished for PR 19728 at commit
|
Test build #83747 has finished for PR 19728 at commit
|
ctx.addMutableState("UTF8String[]", args, "") | ||
|
||
val inputs = evals.zipWithIndex.map { case (eval, index) => | ||
if (eval.isNull != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eval.isNull
is not a pre-evaluated constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eval.isNull
is not a pre-evaluated constant, I expect that the following code at lines 73-74 will be generated. Only when we ensure it is true
, we can avoid assigning a value (use default null
value).
ev.copy(evals.map(_.code).mkString("\n") + s""" | ||
UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
val inputs = evals.tail.zipWithIndex.map { case (eval, index) => | ||
if (eval.isNull != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as https://github.com/apache/spark/pull/19728/files#r150441287.
@@ -163,13 +190,18 @@ case class ConcatWs(children: Seq[Expression]) | |||
} | |||
}.unzip | |||
|
|||
ev.copy(evals.map(_.code).mkString("\n") + | |||
// ev.copy(evals.map(_.code).mkString("\n") + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget to remove commented code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
if (eval.isNull != "true") { | ||
s""" | ||
${eval.code} | ||
$args[$index] = ${eval.value}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eval.isNull
is evaluated to null
at runtime, eval.value
is useless. We should assign null
in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
if (eval.isNull != "true") { | ||
s""" | ||
${eval.code} | ||
$args[$index] = ${eval.value}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (${ev.value} == null) { | ||
${ev.isNull} = true; | ||
val argNums = evals.length | ||
val args = ctx.freshName("argLen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argLen? I think it's not length of args?
${eval.code} | ||
if (!${eval.isNull}) { | ||
$args[$index] = ${eval.value}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are better to assign null too if eval.isNull == null
, because args
is global variable and we need to override all values for previous row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, while args
is global, UTF8String[]
is allocated before executing these assignments. Thus, we can ensure all of elements in args
are null
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is guaranteed before the first row processed. After we process rows, args
are updated for each row. E.g., args[0]
can be updated and assigned with a string for row 0. In next row, if eval.isNull
is evaluated to true, we don't override back to null, so arg[0]
is still the string value for row 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the next row is processed, $args = new UTF8String[$argNums]
is executed again in apply()
method. In other words, I think that current implementation does not reuse UTF8String[]
between different rows.
Do you mean UTF8String[]
is reused between different rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I didn't notice that args
is not globally initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking your time for my code.
It is not easy to imagine the generated code. I may have to put the generated code.
Test build #83770 has finished for PR 19728 at commit
|
Test build #83779 has finished for PR 19728 at commit
|
${ev.isNull} = true; | ||
val argNums = evals.length | ||
val args = ctx.freshName("args") | ||
ctx.addMutableState("UTF8String[]", args, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid defining this as a global variable, what do you think?
UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
val argNums = evals.length | ||
val args = ctx.freshName("argLen") | ||
ctx.addMutableState("UTF8String[]", args, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also can be kept method local, IMHO
val inputs = evals.tail.zipWithIndex.map { case (eval, index) => | ||
if (eval.isNull != "true") { | ||
s""" | ||
${eval.code} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in this and the next 4 lines an indentation space is missing if I am not wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
|
||
ev.copy(evals.map(_.code).mkString("\n") + s""" | ||
UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
val argNums = evals.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be evals.length -1
, or even better, I'd suggest to declare two variables: sep
and strings
(or the name you prefer) to hold head
and tail
. This would help readability too IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry that I cannot understand your suggestion.
argNums
is referred as $argNums
to allocate an array. Are you suggesting to allocate an array as new UTF8String[$argNums + 1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I am saying the opposite. Let's consider an example, maybe I better can explain what I mean in this way. Let's assume that we are running concat_ws(',', 'a', 'b')
. Then, evals
would contain 3 elements. So here your argNums
would be 3. But here you would be using only $args[0]
and $args[1]
, because the first element (','
, the separator) is handled differently.
Thus, I would suggest to have something like:
val sep = evals.head
val strings = evals.tail
val argNums = strings.length // note that this is evals.length -1
...
I think that in this way the code would be much clearer (other than fixing this little bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your kindly explanation. I totally agree with you.
Test build #83787 has finished for PR 19728 at commit
|
Test build #83791 has finished for PR 19728 at commit
|
Test build #83795 has finished for PR 19728 at commit
|
UTF8String ${ev.value} = UTF8String.concat($inputs); | ||
if (${ev.value} == null) { | ||
${ev.isNull} = true; | ||
val argNums = evals.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numArgs
.
val args = ctx.freshName("args") | ||
|
||
val inputs = evals.zipWithIndex.map { case (eval, index) => | ||
if (eval.isNull != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not mix in optimizations with bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will remove this.
Test build #83944 has finished for PR 19728 at commit
|
if (${ev.value} == null) { | ||
${ev.isNull} = true; | ||
} | ||
val numArgs = evals.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can inline it
UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
val separator = evals.head | ||
val strings = evals.tail | ||
val argNums = strings.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numArgs
ev.copy(s""" | ||
UTF8String[] $args = new UTF8String[$argNums]; | ||
$codes | ||
UTF8String ${ev.value} = UTF8String.concatWs(${separator.value}, $args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val code = if (ctx.INPUT_ROW != null && ctx.currentVars == null)
...
ev.copy(code = s"""
UTF8String[] $args = new UTF8String[$argNums];
${separator.code}
...
""")
can you split it into 3 PRs? The approaches for these 3 expressions are quite different. |
Test build #83974 has finished for PR 19728 at commit
|
Test build #83983 has finished for PR 19728 at commit
|
thanks, merging to master/2.2! |
## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19728 from kiszk/SPARK-22498. (cherry picked from commit d54bfec) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#19728 from kiszk/SPARK-22498. (cherry picked from commit d54bfec) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR changes
concat
code generation to place generated code for expression for arguments into separated methods if these size could be large.This PR resolved the case of
concat
with a lot of argumentHow was this patch tested?
Added new test cases into
StringExpressionsSuite