[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076

saurfang · 2015-06-29T01:43:35Z

By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.

cloud-fan · 2015-06-29T04:21:19Z

Does GenerateProjection have this issue too?

sarutak · 2015-06-29T05:03:23Z

ok to test.

saurfang · 2015-06-30T01:35:03Z

@cloud-fan Possibly. A similar naive test on GenerateProjection breaks on equals functions. That being said, I wonder what would be a concrete case that will trigger this problem as it isn't apparent to me how GenerateProjection is used.

Should I go ahead and fix GenerateProjection as well? If so, can I put helper function that splits the code at package.scala or you recommend creating an utility object under codegen package.

maropu · 2015-07-10T06:23:55Z

This fix is kind of hack things to me.
It'd be better to check the code size and, if it is over 64KB (the janino limitation), throw an exception
to fall back into InterpretedMutableProjection.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L193

saurfang · 2015-07-10T15:39:29Z

I agree with you this fix is kind of a hack. However I would argue if a human were to write these code, he would also split the projection calls into blocks. The difference is really a human would break them into logical blocks (e.g. groups columns that are similar to each other) for readability purposes. Machine doesn't care and would just split them as long as things work. This seems to be the suggestion I have seen when I searched for this exception which also happens in other code generation frameworks.

That being said, if we were to fall back to InterpretedMutableProjection, would spark-unsafe not go into effect anymore? I understand unsafe is currently predicated on codegen and unsafe is kind of critical for our current use case.

I also think this kind of scenario where you have lots of columns that can be code generated is where codegen would likely shine the most?

maropu · 2015-07-11T14:06:17Z

Yes, and falling back into normal expressions turns off unsafe optimization.
I feel concerned that this fix is less meaningful for most users
because they are intended to have few columns.
So, I think that this PR fixes the bug by throwing an exception in CodeGenerator#compile
because it is trivial and prevents other codegen'd classes (e.g., GenerateOrdering and GenerateProjection) from having the same issue.
Then, we need to discuss your proposal in following PRs.

rxin · 2015-07-15T23:26:52Z

In reality how many columns would it take to go over the limit?

rxin · 2015-07-17T05:04:45Z

I think it is fine to leave the number at 50. @saurfang do you have time to bring this up to date, and don't create the extra methods if there is only one group?

chenghao-intel · 2015-07-17T06:42:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

@@ -42,4 +42,8 @@ class CodeGenerationSuite extends SparkFunSuite {

    futures.foreach(Await.result(_, 10.seconds))
  }
+
+  test("SPARK-8443: code size limit") {
+    GenerateMutableProjection.generate(List.fill(5000)(EqualTo(Literal(1), Literal(1))))


Will codegen cause exception if the code size is too large? Or we at least to execute the code once?

Yes. The exception triggers as soon as compile(code) is called

saurfang · 2015-07-18T05:35:40Z

I have pushed a new commit that if only one block is generated, then projections will be inlined as before. Can you please review? Or do you prefer I doing the other approach that groups every say 50 calls rather than explicitly checks the code length?

rxin · 2015-07-18T06:48:20Z

...main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala

+          case Nil => List(code)
+          case head::tail =>
+            // code size limit is 64kb and each char takes less or equal to 2 bytes
+            if (head.length < 32 * 1000) {


can you give it more factor of safety? e.g. use 16k?

@rxin

per @rxin suggestion to improve code readability

saurfang · 2015-07-18T22:52:04Z

Sounds good. Is this more like what you were looking for?

rxin · 2015-07-18T22:58:24Z

Yup - thanks.

Jenkins, ok to test.

rxin · 2015-07-18T22:58:28Z

Jenkins, ok to test.

SparkQA · 2015-07-19T00:33:45Z

Test build #37741 has finished for PR 7076 at commit adef95a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-19T01:31:53Z

...main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala

+        projectionBlocks.append(blockBuilder.toString())
+        blockBuilder.clear()
+      }
+      blockBuilder.append(projection)


Should we insert newlines so that the generated code is slightly more readable?

The code itself already has a new line before and after. I looked at debug results and the code look reasonably. I'm happy to add an extra newline to be safe in case that assumption changes in the future. Just let me know.

We can add it later if it's a problem; this seems fine for now, but just wanted to check. Thanks for looking into this.

JoshRosen · 2015-07-19T01:34:03Z

This seems reasonable to me, although I have one quick question RE: newlines in the generated code.

rxin · 2015-07-19T01:35:33Z

@saurfang in addition to josh's comments, can you update the test so tries to execute the generated code, in addition to just compiling?

JoshRosen · 2015-07-19T01:35:35Z

~~Actually, I think there's one piece missing: we should add a which that generates a projection that gets split and then actually executes the generated projection.~~

Edit: Reynold beat me to it :)

saurfang · 2015-07-19T02:19:23Z

Done. Let me know if this is sufficient.

SparkQA · 2015-07-19T03:49:58Z

Test build #37747 has finished for PR 7076 at commit b7a7635.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-19T04:05:26Z

Thanks - merging this in master!

maropu mentioned this pull request Jul 15, 2015

[SPARK-9058][SQL] Split projectionCode if it is too large for JVM #7418

Closed

chenghao-intel reviewed Jul 17, 2015
View reviewed changes

saurfang added 2 commits July 18, 2015 01:08

[SPARK-8443][SQL] split projection code by size limit

9405680

[SPARK-8443][SQL] inline execution if one block only

1b5aa7e

saurfang force-pushed the codegen_size_limit branch from 590a9f4 to 1b5aa7e Compare July 18, 2015 05:35

rxin reviewed Jul 18, 2015
View reviewed changes

[SPARK-8443][SQL] Use safer factor and rewrite splitting code

adef95a

per @rxin suggestion to improve code readability

JoshRosen reviewed Jul 19, 2015
View reviewed changes

[SPARK-8443][SQL] Execute and verify split projections in test

b7a7635

asfgit closed this in 6cb6096 Jul 19, 2015

This was referenced Feb 17, 2016

[SPARK-13242] [SQL] Generate one method per when clause #11221

Closed

[SPARK-13242] [SQL] Fall back to interpreting complex when expressions #11243

Closed

kiszk mentioned this pull request Mar 31, 2016

[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076

[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076

saurfang commented Jun 29, 2015

cloud-fan commented Jun 29, 2015

sarutak commented Jun 29, 2015

saurfang commented Jun 30, 2015

maropu commented Jul 10, 2015

saurfang commented Jul 10, 2015

maropu commented Jul 11, 2015

rxin commented Jul 15, 2015

rxin commented Jul 17, 2015

chenghao-intel Jul 17, 2015

saurfang Jul 18, 2015

saurfang commented Jul 18, 2015

rxin Jul 18, 2015

saurfang commented Jul 18, 2015

rxin commented Jul 18, 2015

rxin commented Jul 18, 2015

SparkQA commented Jul 19, 2015

JoshRosen Jul 19, 2015

saurfang Jul 19, 2015

JoshRosen Jul 19, 2015

JoshRosen commented Jul 19, 2015

rxin commented Jul 19, 2015

JoshRosen commented Jul 19, 2015

saurfang commented Jul 19, 2015

SparkQA commented Jul 19, 2015

rxin commented Jul 19, 2015

[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076

[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076

Conversation

saurfang commented Jun 29, 2015

cloud-fan commented Jun 29, 2015

sarutak commented Jun 29, 2015

saurfang commented Jun 30, 2015

maropu commented Jul 10, 2015

saurfang commented Jul 10, 2015

maropu commented Jul 11, 2015

rxin commented Jul 15, 2015

rxin commented Jul 17, 2015

chenghao-intel Jul 17, 2015

Choose a reason for hiding this comment

saurfang Jul 18, 2015

Choose a reason for hiding this comment

saurfang commented Jul 18, 2015

rxin Jul 18, 2015

Choose a reason for hiding this comment

saurfang commented Jul 18, 2015

rxin commented Jul 18, 2015

rxin commented Jul 18, 2015

SparkQA commented Jul 19, 2015

JoshRosen Jul 19, 2015

Choose a reason for hiding this comment

saurfang Jul 19, 2015

Choose a reason for hiding this comment

JoshRosen Jul 19, 2015

Choose a reason for hiding this comment

JoshRosen commented Jul 19, 2015

rxin commented Jul 19, 2015

JoshRosen commented Jul 19, 2015

saurfang commented Jul 19, 2015

SparkQA commented Jul 19, 2015

rxin commented Jul 19, 2015