Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat #19728

Closed
wants to merge 10 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Nov 12, 2017

What changes were proposed in this pull request?

This PR changes concat code generation to place generated code for expression for arguments into separated methods if these size could be large.
This PR resolved the case of concat with a lot of argument

How was this patch tested?

Added new test cases into StringExpressionsSuite

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83745 has finished for PR 19728 at commit 16ddb2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83746 has finished for PR 19728 at commit 756254b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83747 has finished for PR 19728 at commit bc97d41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ctx.addMutableState("UTF8String[]", args, "")

val inputs = evals.zipWithIndex.map { case (eval, index) =>
if (eval.isNull != "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eval.isNull is not a pre-evaluated constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eval.isNull is not a pre-evaluated constant, I expect that the following code at lines 73-74 will be generated. Only when we ensure it is true, we can avoid assigning a value (use default null value).

ev.copy(evals.map(_.code).mkString("\n") + s"""
UTF8String ${ev.value} = UTF8String.concatWs($inputs);
val inputs = evals.tail.zipWithIndex.map { case (eval, index) =>
if (eval.isNull != "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -163,13 +190,18 @@ case class ConcatWs(children: Seq[Expression])
}
}.unzip

ev.copy(evals.map(_.code).mkString("\n") +
// ev.copy(evals.map(_.code).mkString("\n") +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget to remove commented code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

if (eval.isNull != "true") {
s"""
${eval.code}
$args[$index] = ${eval.value};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eval.isNull is evaluated to null at runtime, eval.value is useless. We should assign null in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

if (eval.isNull != "true") {
s"""
${eval.code}
$args[$index] = ${eval.value};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (${ev.value} == null) {
${ev.isNull} = true;
val argNums = evals.length
val args = ctx.freshName("argLen")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argLen? I think it's not length of args?

${eval.code}
if (!${eval.isNull}) {
$args[$index] = ${eval.value};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are better to assign null too if eval.isNull == null, because args is global variable and we need to override all values for previous row.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, while args is global, UTF8String[] is allocated before executing these assignments. Thus, we can ensure all of elements in args are null.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is guaranteed before the first row processed. After we process rows, args are updated for each row. E.g., args[0] can be updated and assigned with a string for row 0. In next row, if eval.isNull is evaluated to true, we don't override back to null, so arg[0] is still the string value for row 0.

Copy link
Member Author

@kiszk kiszk Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the next row is processed, $args = new UTF8String[$argNums] is executed again in apply() method. In other words, I think that current implementation does not reuse UTF8String[] between different rows.
Do you mean UTF8String[] is reused between different rows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I didn't notice that args is not globally initialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking your time for my code.
It is not easy to imagine the generated code. I may have to put the generated code.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83770 has finished for PR 19728 at commit 0ed8d16.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat and concat_ws [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat, concat_ws, and elt Nov 13, 2017
@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83779 has finished for PR 19728 at commit 6fdb451.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

${ev.isNull} = true;
val argNums = evals.length
val args = ctx.freshName("args")
ctx.addMutableState("UTF8String[]", args, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid defining this as a global variable, what do you think?

UTF8String ${ev.value} = UTF8String.concatWs($inputs);
val argNums = evals.length
val args = ctx.freshName("argLen")
ctx.addMutableState("UTF8String[]", args, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also can be kept method local, IMHO

val inputs = evals.tail.zipWithIndex.map { case (eval, index) =>
if (eval.isNull != "true") {
s"""
${eval.code}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in this and the next 4 lines an indentation space is missing if I am not wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch


ev.copy(evals.map(_.code).mkString("\n") + s"""
UTF8String ${ev.value} = UTF8String.concatWs($inputs);
val argNums = evals.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be evals.length -1, or even better, I'd suggest to declare two variables: sep and strings (or the name you prefer) to hold head and tail. This would help readability too IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry that I cannot understand your suggestion.
argNums is referred as $argNums to allocate an array. Are you suggesting to allocate an array as new UTF8String[$argNums + 1]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I am saying the opposite. Let's consider an example, maybe I better can explain what I mean in this way. Let's assume that we are running concat_ws(',', 'a', 'b'). Then, evals would contain 3 elements. So here your argNums would be 3. But here you would be using only $args[0] and $args[1], because the first element (',', the separator) is handled differently.
Thus, I would suggest to have something like:

val sep = evals.head
val strings = evals.tail
val argNums = strings.length // note that this is evals.length -1
...

I think that in this way the code would be much clearer (other than fixing this little bug).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your kindly explanation. I totally agree with you.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83787 has finished for PR 19728 at commit e6b73ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83791 has finished for PR 19728 at commit cd20410.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83795 has finished for PR 19728 at commit e668b98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

UTF8String ${ev.value} = UTF8String.concat($inputs);
if (${ev.value} == null) {
${ev.isNull} = true;
val argNums = evals.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: numArgs.

val args = ctx.freshName("args")

val inputs = evals.zipWithIndex.map { case (eval, index) =>
if (eval.isNull != "true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not mix in optimizations with bug fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will remove this.

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83944 has finished for PR 19728 at commit 555575e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (${ev.value} == null) {
${ev.isNull} = true;
}
val numArgs = evals.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can inline it

UTF8String ${ev.value} = UTF8String.concatWs($inputs);
val separator = evals.head
val strings = evals.tail
val argNums = strings.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: numArgs

ev.copy(s"""
UTF8String[] $args = new UTF8String[$argNums];
$codes
UTF8String ${ev.value} = UTF8String.concatWs(${separator.value}, $args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val code = if (ctx.INPUT_ROW != null && ctx.currentVars == null)
...
ev.copy(code = s"""
  UTF8String[] $args = new UTF8String[$argNums];
  ${separator.code}
  ...
""")

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 17, 2017

can you split it into 3 PRs? The approaches for these 3 expressions are quite different.

@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83974 has finished for PR 19728 at commit 22019b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat, concat_ws, and elt [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat Nov 18, 2017
@kiszk
Copy link
Member Author

kiszk commented Nov 18, 2017

Sure done.
#19777 for concat_ws. #19778 for elt

@SparkQA
Copy link

SparkQA commented Nov 18, 2017

Test build #83983 has finished for PR 19728 at commit b9ca4ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

asfgit pushed a commit that referenced this pull request Nov 18, 2017
## What changes were proposed in this pull request?

This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large.
This PR resolved the case of `concat` with a lot of argument

## How was this patch tested?

Added new test cases into `StringExpressionsSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #19728 from kiszk/SPARK-22498.

(cherry picked from commit d54bfec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in d54bfec Nov 18, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?

This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large.
This PR resolved the case of `concat` with a lot of argument

## How was this patch tested?

Added new test cases into `StringExpressionsSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes apache#19728 from kiszk/SPARK-22498.

(cherry picked from commit d54bfec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants