Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits #7076

Closed
wants to merge 4 commits into from

Conversation

saurfang
Copy link
Contributor

By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.

@cloud-fan
Copy link
Contributor

Does GenerateProjection have this issue too?

@sarutak
Copy link
Member

sarutak commented Jun 29, 2015

ok to test.

@saurfang
Copy link
Contributor Author

@cloud-fan Possibly. A similar naive test on GenerateProjection breaks on equals functions. That being said, I wonder what would be a concrete case that will trigger this problem as it isn't apparent to me how GenerateProjection is used.

Should I go ahead and fix GenerateProjection as well? If so, can I put helper function that splits the code at package.scala or you recommend creating an utility object under codegen package.

@maropu
Copy link
Member

maropu commented Jul 10, 2015

This fix is kind of hack things to me.
It'd be better to check the code size and, if it is over 64KB (the janino limitation), throw an exception
to fall back into InterpretedMutableProjection.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L193

@saurfang
Copy link
Contributor Author

I agree with you this fix is kind of a hack. However I would argue if a human were to write these code, he would also split the projection calls into blocks. The difference is really a human would break them into logical blocks (e.g. groups columns that are similar to each other) for readability purposes. Machine doesn't care and would just split them as long as things work. This seems to be the suggestion I have seen when I searched for this exception which also happens in other code generation frameworks.

That being said, if we were to fall back to InterpretedMutableProjection, would spark-unsafe not go into effect anymore? I understand unsafe is currently predicated on codegen and unsafe is kind of critical for our current use case.

I also think this kind of scenario where you have lots of columns that can be code generated is where codegen would likely shine the most?

@maropu
Copy link
Member

maropu commented Jul 11, 2015

Yes, and falling back into normal expressions turns off unsafe optimization.
I feel concerned that this fix is less meaningful for most users
because they are intended to have few columns.
So, I think that this PR fixes the bug by throwing an exception in CodeGenerator#compile
because it is trivial and prevents other codegen'd classes (e.g., GenerateOrdering and GenerateProjection) from having the same issue.
Then, we need to discuss your proposal in following PRs.

@rxin
Copy link
Contributor

rxin commented Jul 15, 2015

In reality how many columns would it take to go over the limit?

@rxin
Copy link
Contributor

rxin commented Jul 17, 2015

I think it is fine to leave the number at 50. @saurfang do you have time to bring this up to date, and don't create the extra methods if there is only one group?

@@ -42,4 +42,8 @@ class CodeGenerationSuite extends SparkFunSuite {

futures.foreach(Await.result(_, 10.seconds))
}

test("SPARK-8443: code size limit") {
GenerateMutableProjection.generate(List.fill(5000)(EqualTo(Literal(1), Literal(1))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will codegen cause exception if the code size is too large? Or we at least to execute the code once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The exception triggers as soon as compile(code) is called

@saurfang
Copy link
Contributor Author

I have pushed a new commit that if only one block is generated, then projections will be inlined as before. Can you please review? Or do you prefer I doing the other approach that groups every say 50 calls rather than explicitly checks the code length?

case Nil => List(code)
case head::tail =>
// code size limit is 64kb and each char takes less or equal to 2 bytes
if (head.length < 32 * 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give it more factor of safety? e.g. use 16k?

per @rxin suggestion to improve code readability
@saurfang
Copy link
Contributor Author

Sounds good. Is this more like what you were looking for?

@rxin
Copy link
Contributor

rxin commented Jul 18, 2015

Yup - thanks.

Jenkins, ok to test.

@rxin
Copy link
Contributor

rxin commented Jul 18, 2015

Jenkins, ok to test.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37741 has finished for PR 7076 at commit adef95a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

projectionBlocks.append(blockBuilder.toString())
blockBuilder.clear()
}
blockBuilder.append(projection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we insert newlines so that the generated code is slightly more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself already has a new line before and after. I looked at debug results and the code look reasonably. I'm happy to add an extra newline to be safe in case that assumption changes in the future. Just let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it later if it's a problem; this seems fine for now, but just wanted to check. Thanks for looking into this.

@JoshRosen
Copy link
Contributor

This seems reasonable to me, although I have one quick question RE: newlines in the generated code.

@rxin
Copy link
Contributor

rxin commented Jul 19, 2015

@saurfang in addition to josh's comments, can you update the test so tries to execute the generated code, in addition to just compiling?

@JoshRosen
Copy link
Contributor

Actually, I think there's one piece missing: we should add a which that generates a projection that gets split and then actually executes the generated projection.

Edit: Reynold beat me to it :)

@saurfang
Copy link
Contributor Author

Done. Let me know if this is sufficient.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37747 has finished for PR 7076 at commit b7a7635.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 19, 2015

Thanks - merging this in master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants