-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076
Conversation
Does |
ok to test. |
@cloud-fan Possibly. A similar naive test on Should I go ahead and fix |
This fix is kind of hack things to me. |
I agree with you this fix is kind of a hack. However I would argue if a human were to write these code, he would also split the projection calls into blocks. The difference is really a human would break them into logical blocks (e.g. groups columns that are similar to each other) for readability purposes. Machine doesn't care and would just split them as long as things work. This seems to be the suggestion I have seen when I searched for this exception which also happens in other code generation frameworks. That being said, if we were to fall back to I also think this kind of scenario where you have lots of columns that can be code generated is where codegen would likely shine the most? |
Yes, and falling back into normal expressions turns off unsafe optimization. |
In reality how many columns would it take to go over the limit? |
I think it is fine to leave the number at 50. @saurfang do you have time to bring this up to date, and don't create the extra methods if there is only one group? |
@@ -42,4 +42,8 @@ class CodeGenerationSuite extends SparkFunSuite { | |||
|
|||
futures.foreach(Await.result(_, 10.seconds)) | |||
} | |||
|
|||
test("SPARK-8443: code size limit") { | |||
GenerateMutableProjection.generate(List.fill(5000)(EqualTo(Literal(1), Literal(1)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will codegen cause exception if the code size is too large? Or we at least to execute the code once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The exception triggers as soon as compile(code)
is called
I have pushed a new commit that if only one block is generated, then projections will be inlined as before. Can you please review? Or do you prefer I doing the other approach that groups every say 50 calls rather than explicitly checks the code length? |
590a9f4
to
1b5aa7e
Compare
case Nil => List(code) | ||
case head::tail => | ||
// code size limit is 64kb and each char takes less or equal to 2 bytes | ||
if (head.length < 32 * 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give it more factor of safety? e.g. use 16k?
per @rxin suggestion to improve code readability
Sounds good. Is this more like what you were looking for? |
Yup - thanks. Jenkins, ok to test. |
Jenkins, ok to test. |
Test build #37741 has finished for PR 7076 at commit
|
projectionBlocks.append(blockBuilder.toString()) | ||
blockBuilder.clear() | ||
} | ||
blockBuilder.append(projection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we insert newlines so that the generated code is slightly more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code itself already has a new line before and after. I looked at debug results and the code look reasonably. I'm happy to add an extra newline to be safe in case that assumption changes in the future. Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add it later if it's a problem; this seems fine for now, but just wanted to check. Thanks for looking into this.
This seems reasonable to me, although I have one quick question RE: newlines in the generated code. |
@saurfang in addition to josh's comments, can you update the test so tries to execute the generated code, in addition to just compiling? |
Edit: Reynold beat me to it :) |
Done. Let me know if this is sufficient. |
Test build #37747 has finished for PR 7076 at commit
|
Thanks - merging this in master! |
By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.