[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions #19767

cloud-fan · 2017-11-16T17:42:51Z

What changes were proposed in this pull request?

A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons:

a deep expression tree, e.g. a very complex filter condition
many individual expressions, e.g. expressions can have many children, operators can have many expressions.
a deep query plan tree (with whole stage codegen)

This PR focuses on 1. There are already several patches(#15620 #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue.

This PR proposes to fix this issue in Expression.genCode, to make sure the code for a single expression won't grow too big.

According to @maropu 's benchmark, no regression is found with TPCDS (thanks @maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing

How was this patch tested?

existing test

cloud-fan · 2017-11-16T17:43:42Z

cc @kiszk @rednaxelafx @maropu @gatorsmile

SparkQA · 2017-11-16T20:34:40Z

Test build #83943 has finished for PR 19767 at commit aaa5b6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-11-17T02:22:44Z

I like this approach. Does this pr cover all the issue that #18641 describes? or, orthogonal? Anyway, I checked the TPCDS perf with this current pr: https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing

viirya · 2017-11-17T02:47:42Z

Seems a good approach that saves us much effort to add similar codes for many expressions.

viirya · 2017-11-17T02:49:58Z

Also from the numbers provided by @maropu, looks no significant regression.

maropu · 2017-11-17T02:58:59Z

Probably, I feel we better track the changes of actual bytecode size statistics (e.g, maxCodeSize) in tpcds, so I'll check later.

viirya · 2017-11-17T03:01:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+
+        ve.code = s"$funcFullName(${ctx.INPUT_ROW});"
+      }
+
      if (ve.code.nonEmpty) {
        // Add `this` in the comment.
        ve.copy(code = s"${ctx.registerComment(this.toString)}\n" + ve.code.trim)


Should we move the comment into the function?

I don't have a strong preference, it's ok to have comment at function caller side.

viirya · 2017-11-17T03:19:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+             |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+             |  ${ve.code.trim}
+             |  $setValue
+             |  $setIsNull


nit: when isNull is evaluated to true at runtime, we don't need to set value.

yea it's already done when define setIsNull

cloud-fan · 2017-11-17T08:22:09Z

@maropu it partially covers #18641 . One problem is that, for an expression, if its child generates code less than 1024, and it has many children, then we still have an issue. CaseWhen is a little different because it at most can have 20 children(depends on spark.sql.codegen.maxCaseBranches). So we can still prevent failures, but may not be able to JIT.

kiszk · 2017-11-17T08:49:34Z

Looks good direction if we do not see performance degradation.

kiszk · 2017-11-17T08:57:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+          s"""
+             |private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
+             |  ${ve.code.trim}
+             |  $setValue


Can we always pass value as a return value instead of void? It can reduce # of global variables.

good suggestion! Actually, this is a general strategy which can be applied to more places. If there are only boolean global variables, it's very easy to fold them into one array.

Thanks.

IMHO, I am curious whether we will see no performance degradation by using one array to compact many boolean variables.
I am waiting for the updated result in this discussion. This is because the current code seems to measure performance of interpreter due to lack of warmup.

Is that a bad idea to prepare some utility classes to store a pair (value, isNull) for this splitting cases? I feel class fields are valuable resources.

creating objects will be a big overhead. I think having a global boolean variable is better.

I originally thought we could avoid the overhead by using thread-local singleton? But, it's a bit weird, so the current code looks good.

The current code can pass the value as a local variable if this method is inlined, or can pass the value on register for return value if this method is not inlined.
On the other hand, to use a object will always introduce memory accesses to access fields in caller and callee.

SparkQA · 2017-11-17T15:02:02Z

Test build #83963 has finished for PR 19767 at commit 3dab5bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-18T07:38:02Z

LGTM

felixcheung · 2017-11-18T23:55:20Z

should this go to 2.2?

viirya · 2017-11-19T01:38:08Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

@@ -64,52 +64,22 @@ case class If(predicate: Expression, trueValue: Expression, falseValue: Expressi
    val trueEval = trueValue.genCode(ctx)
    val falseEval = falseValue.genCode(ctx)

-    // place generated code of condition, true value and false value in separate methods if
-    // their code combined is large
-    val combinedLength = condEval.code.length + trueEval.code.length + falseEval.code.length


Actually I think this removed part is orthogonal to what this PR did. Even condition, true, and false expressions are not more than threshold individually, their combination is still more than the threshold. We will pack them into a big method after this PR.

This PR deals the oversize gen'd codes in deeply nested expressions, not oversize combination of codes from the children.

I already explained it in #19767 (comment)

Mostly it's ok because the threshold is just an estimation, not a big deal to make it 2 times larger. CASE WHEN may be a problem and we can evaluate it in #18641 after this PR gets merged.

Two problems I think for this. One is even the two childs' code don't exceed the threshold individually, a method not over 64k but over 8k is still big and bad for JIT. One is we estimate it with code length, I'm not sure if two 1000 length childs won't definitely generate 64k method in the end.

There is no way to guarantee it with the current string based codegen framework, even without this PR. 1000 length code may also generate 64kb byte code in the end.

1024 is not a good estimation at all, kind of random to me. So multiplying it with 2 doesn't seem a big issue. CASE WHEN may have issue.

BTW if it's really an issue, we can add splitting logic in non-leaf/non-unary nodes. This is much less work than before because: 1. no need to care about unary nodes 2. the splitting logic can be simpler because all children are guaranteed to generate less than 1000 LOC.

kiszk · 2017-11-19T08:33:36Z

Would it be better to replace "SPARK-...." in test cases for (#15620 #18972 #18641) with this JIRA number or add this JIRA number to these test cases

cloud-fan · 2017-11-21T22:16:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+        val funcName = ctx.freshName(nodeName)
+        val funcFullName = ctx.addNewFunction(funcName,
+          s"""
+             |private $javaType $funcName(InternalRow ${ctx.INPUT_ROW}) {


To continue the discussion in #19767 (comment)

I think there are more global variables can be eliminated by leveraging the method return value. However in some cases, we use global variables to avoid creating an object for each iteration, then we are facing a trade-off between GC overhead and global variable overhead. It would be great if java has something like C struct and can allocate objects on method stack...

cc @rednaxelafx @mgaido91 too

thanks for this fix. I like your approach here.

Actually what happens depends on the type of the variable and anyway I think that most of the time we are reinitializing anyway these objects, thus the only thing we are saving using global variables is the pointer and I am not sure if this is a big deal.

cloud-fan · 2017-11-22T00:35:28Z

@felixcheung I'd like to keep it in master only, it has larger impaction than other related PRs.

SparkQA · 2017-11-22T00:51:49Z

Test build #84085 has finished for PR 19767 at commit d126977.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-22T02:32:36Z

will review it tonight.

gatorsmile · 2017-11-22T05:01:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+      if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) {
+        val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
+          val globalIsNull = ctx.freshName("globalIsNull")
+          ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;")


-> ctx.JAVA_BOOLEAN

gatorsmile · 2017-11-22T06:15:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

@@ -105,6 +105,36 @@ abstract class Expression extends TreeNode[Expression] {
      val isNull = ctx.freshName("isNull")
      val value = ctx.freshName("value")
      val ve = doGenCode(ctx, ExprCode("", isNull, value))
+
+      // TODO: support whole stage codegen too
+      if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) {


Could you change 1024 to 1? Just to ensure whether all the tests can pass and then change it back to 1024?

I think it won't work because of hitting other limitations, e.g. JVM constant pool.

I'll try something bigger, like 100

gatorsmile · 2017-11-22T06:21:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+           """.stripMargin)
+
+        ve.value = newValue
+        ve.code = s"$javaType $newValue = $funcFullName(${ctx.INPUT_ROW});"


Create a separate function for this?

gatorsmile · 2017-11-22T06:29:48Z

LGTM except the above comments.

SparkQA · 2017-11-22T08:46:53Z

Test build #84099 has finished for PR 19767 at commit c875329.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-11-22T12:53:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+      }
+
+      val javaType = ctx.javaType(dataType)
+      val newValue = ctx.freshName("value")


why is this needed? I think we can use eval.value instead of it

ev.value may be a global variable and here we need a local variable.

why do we strictly need a local variable here? Can't we simply assign ev.value to the generated function return value?

then how are we going to change this?
eval.code = s"$javaType $newValue = $funcFullName(${ctx.INPUT_ROW});"

Saving a local variable is nothing and I think we shouldn't complicate the code(check if a variable is global) because of this.

ah, do you mean just do eval.value = s"$funcFullName(${ctx.INPUT_ROW})"? Let me try

I meant:

eval.code = s"${eval.value} = $funcFullName(${ctx.INPUT_ROW});"

this won't work because ${eval.value} is not declared if it's not a global variable. I went with

eval.code = "" eval.value = s"$funcFullName(${ctx.INPUT_ROW})"

I see, sorry, you're right. Then I think your previous solution is better: in this way if eval.value is used multiple times we are recomputing the function every time, thus your original implementation was fine, sorry for the bad comment.

SparkQA · 2017-11-22T13:38:37Z

Test build #84104 has finished for PR 19767 at commit 86cba3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-22T14:28:07Z

Test build #84111 has finished for PR 19767 at commit e494844.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-22T16:01:06Z

Test build #84110 has finished for PR 19767 at commit 29188fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-22T17:09:57Z

Test build #84112 has finished for PR 19767 at commit c015d33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-22T18:05:21Z

LGTM

gatorsmile · 2017-11-22T18:06:11Z

Thanks! Merged to master.

viirya reviewed Nov 17, 2017

View reviewed changes

kiszk reviewed Nov 17, 2017

View reviewed changes

cloud-fan changed the title ~~[WIP][SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions~~ [SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions Nov 17, 2017

viirya reviewed Nov 19, 2017

View reviewed changes

viirya mentioned this pull request Nov 19, 2017

[SPARK-22551][SQL] Prevent possible 64kb compile error for common expression types #19780

Closed

cloud-fan added 2 commits November 21, 2017 22:53

fix java 64kb compile error for deeply nested expressions

e63bb6e

address comment

d126977

cloud-fan force-pushed the codegen branch from 3dab5bd to d126977 Compare November 21, 2017 21:59

cloud-fan commented Nov 21, 2017

View reviewed changes

gatorsmile reviewed Nov 22, 2017

View reviewed changes

gatorsmile mentioned this pull request Nov 22, 2017

[SPARK-22520][SQL] Support code generation for large CaseWhen #19752

Closed

address comment

c875329

fix test

86cba3c

mgaido91 reviewed Nov 22, 2017

View reviewed changes

cloud-fan added 4 commits November 22, 2017 14:46

change back the threshold

29188fe

simplify the code

6bea161

set smaller threshold for testing

e494844

revert back...

c015d33

asfgit closed this in 0605ad7 Nov 22, 2017

[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions #19767

[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions #19767

Conversation

cloud-fan commented Nov 16, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 16, 2017

SparkQA commented Nov 16, 2017

maropu commented Nov 17, 2017 • edited

viirya commented Nov 17, 2017

viirya commented Nov 17, 2017

maropu commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 17, 2017

kiszk commented Nov 17, 2017

kiszk Nov 17, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2017

viirya commented Nov 18, 2017

felixcheung commented Nov 18, 2017

viirya Nov 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Nov 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 22, 2017

SparkQA commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

Choose a reason for hiding this comment

gatorsmile Nov 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 22, 2017

SparkQA commented Nov 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Nov 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 22, 2017

SparkQA commented Nov 22, 2017

SparkQA commented Nov 22, 2017

SparkQA commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

cloud-fan commented Nov 16, 2017 •

edited

maropu commented Nov 17, 2017 •

edited

kiszk Nov 17, 2017 •

edited

viirya Nov 19, 2017 •

edited

kiszk commented Nov 19, 2017 •

edited

gatorsmile Nov 22, 2017 •

edited

mgaido91 Nov 22, 2017 •

edited