Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions #19767

Closed
wants to merge 8 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 16, 2017

What changes were proposed in this pull request?

A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons:

  1. a deep expression tree, e.g. a very complex filter condition
  2. many individual expressions, e.g. expressions can have many children, operators can have many expressions.
  3. a deep query plan tree (with whole stage codegen)

This PR focuses on 1. There are already several patches(#15620 #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue.

This PR proposes to fix this issue in Expression.genCode, to make sure the code for a single expression won't grow too big.

According to @maropu 's benchmark, no regression is found with TPCDS (thanks @maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing

How was this patch tested?

existing test

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83943 has finished for PR 19767 at commit aaa5b6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 17, 2017

I like this approach. Does this pr cover all the issue that #18641 describes? or, orthogonal? Anyway, I checked the TPCDS perf with this current pr: https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing

@viirya
Copy link
Member

viirya commented Nov 17, 2017

Seems a good approach that saves us much effort to add similar codes for many expressions.

@viirya
Copy link
Member

viirya commented Nov 17, 2017

Also from the numbers provided by @maropu, looks no significant regression.

@maropu
Copy link
Member

maropu commented Nov 17, 2017

Probably, I feel we better track the changes of actual bytecode size statistics (e.g, maxCodeSize) in tpcds, so I'll check later.


ve.code = s"$funcFullName(${ctx.INPUT_ROW});"
}

if (ve.code.nonEmpty) {
// Add `this` in the comment.
ve.copy(code = s"${ctx.registerComment(this.toString)}\n" + ve.code.trim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the comment into the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, it's ok to have comment at function caller side.

|private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
| ${ve.code.trim}
| $setValue
| $setIsNull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when isNull is evaluated to true at runtime, we don't need to set value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea it's already done when define setIsNull

@cloud-fan
Copy link
Contributor Author

@maropu it partially covers #18641 . One problem is that, for an expression, if its child generates code less than 1024, and it has many children, then we still have an issue. CaseWhen is a little different because it at most can have 20 children(depends on spark.sql.codegen.maxCaseBranches). So we can still prevent failures, but may not be able to JIT.

@kiszk
Copy link
Member

kiszk commented Nov 17, 2017

Looks good direction if we do not see performance degradation.

s"""
|private void $funcName(InternalRow ${ctx.INPUT_ROW}) {
| ${ve.code.trim}
| $setValue
Copy link
Member

@kiszk kiszk Nov 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always pass value as a return value instead of void? It can reduce # of global variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion! Actually, this is a general strategy which can be applied to more places. If there are only boolean global variables, it's very easy to fold them into one array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

IMHO, I am curious whether we will see no performance degradation by using one array to compact many boolean variables.
I am waiting for the updated result in this discussion. This is because the current code seems to measure performance of interpreter due to lack of warmup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a bad idea to prepare some utility classes to store a pair (value, isNull) for this splitting cases? I feel class fields are valuable resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating objects will be a big overhead. I think having a global boolean variable is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought we could avoid the overhead by using thread-local singleton? But, it's a bit weird, so the current code looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code can pass the value as a local variable if this method is inlined, or can pass the value on register for return value if this method is not inlined.
On the other hand, to use a object will always introduce memory accesses to access fields in caller and callee.

@cloud-fan cloud-fan changed the title [WIP][SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions [SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions Nov 17, 2017
@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83963 has finished for PR 19767 at commit 3dab5bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Nov 18, 2017

LGTM

@felixcheung
Copy link
Member

should this go to 2.2?

@@ -64,52 +64,22 @@ case class If(predicate: Expression, trueValue: Expression, falseValue: Expressi
val trueEval = trueValue.genCode(ctx)
val falseEval = falseValue.genCode(ctx)

// place generated code of condition, true value and false value in separate methods if
// their code combined is large
val combinedLength = condEval.code.length + trueEval.code.length + falseEval.code.length
Copy link
Member

@viirya viirya Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this removed part is orthogonal to what this PR did. Even condition, true, and false expressions are not more than threshold individually, their combination is still more than the threshold. We will pack them into a big method after this PR.

This PR deals the oversize gen'd codes in deeply nested expressions, not oversize combination of codes from the children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already explained it in #19767 (comment)

Mostly it's ok because the threshold is just an estimation, not a big deal to make it 2 times larger. CASE WHEN may be a problem and we can evaluate it in #18641 after this PR gets merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two problems I think for this. One is even the two childs' code don't exceed the threshold individually, a method not over 64k but over 8k is still big and bad for JIT. One is we estimate it with code length, I'm not sure if two 1000 length childs won't definitely generate 64k method in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to guarantee it with the current string based codegen framework, even without this PR. 1000 length code may also generate 64kb byte code in the end.

1024 is not a good estimation at all, kind of random to me. So multiplying it with 2 doesn't seem a big issue. CASE WHEN may have issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if it's really an issue, we can add splitting logic in non-leaf/non-unary nodes. This is much less work than before because: 1. no need to care about unary nodes 2. the splitting logic can be simpler because all children are guaranteed to generate less than 1000 LOC.

@kiszk
Copy link
Member

kiszk commented Nov 19, 2017

Would it be better to replace "SPARK-...." in test cases for (#15620 #18972 #18641) with this JIRA number or add this JIRA number to these test cases

val funcName = ctx.freshName(nodeName)
val funcFullName = ctx.addNewFunction(funcName,
s"""
|private $javaType $funcName(InternalRow ${ctx.INPUT_ROW}) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To continue the discussion in #19767 (comment)

I think there are more global variables can be eliminated by leveraging the method return value. However in some cases, we use global variables to avoid creating an object for each iteration, then we are facing a trade-off between GC overhead and global variable overhead. It would be great if java has something like C struct and can allocate objects on method stack...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this fix. I like your approach here.

Actually what happens depends on the type of the variable and anyway I think that most of the time we are reinitializing anyway these objects, thus the only thing we are saving using global variables is the pointer and I am not sure if this is a big deal.

@cloud-fan
Copy link
Contributor Author

@felixcheung I'd like to keep it in master only, it has larger impaction than other related PRs.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84085 has finished for PR 19767 at commit d126977.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

will review it tonight.

if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) {
val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") {
val globalIsNull = ctx.freshName("globalIsNull")
ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> ctx.JAVA_BOOLEAN

@@ -105,6 +105,36 @@ abstract class Expression extends TreeNode[Expression] {
val isNull = ctx.freshName("isNull")
val value = ctx.freshName("value")
val ve = doGenCode(ctx, ExprCode("", isNull, value))

// TODO: support whole stage codegen too
if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change 1024 to 1? Just to ensure whether all the tests can pass and then change it back to 1024?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it won't work because of hitting other limitations, e.g. JVM constant pool.

I'll try something bigger, like 100

""".stripMargin)

ve.value = newValue
ve.code = s"$javaType $newValue = $funcFullName(${ctx.INPUT_ROW});"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a separate function for this?

@gatorsmile
Copy link
Member

LGTM except the above comments.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84099 has finished for PR 19767 at commit c875329.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

val javaType = ctx.javaType(dataType)
val newValue = ctx.freshName("value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? I think we can use eval.value instead of it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ev.value may be a global variable and here we need a local variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we strictly need a local variable here? Can't we simply assign ev.value to the generated function return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then how are we going to change this?
eval.code = s"$javaType $newValue = $funcFullName(${ctx.INPUT_ROW});"

Saving a local variable is nothing and I think we shouldn't complicate the code(check if a variable is global) because of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, do you mean just do eval.value = s"$funcFullName(${ctx.INPUT_ROW})"? Let me try

Copy link
Contributor

@mgaido91 mgaido91 Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant:

eval.code = s"${eval.value} = $funcFullName(${ctx.INPUT_ROW});"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work because ${eval.value} is not declared if it's not a global variable. I went with

eval.code = ""
eval.value = s"$funcFullName(${ctx.INPUT_ROW})"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sorry, you're right. Then I think your previous solution is better: in this way if eval.value is used multiple times we are recomputing the function every time, thus your original implementation was fine, sorry for the bad comment.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84104 has finished for PR 19767 at commit 86cba3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84111 has finished for PR 19767 at commit e494844.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84110 has finished for PR 19767 at commit 29188fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84112 has finished for PR 19767 at commit c015d33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 0605ad7 Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants