New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit #16648
Conversation
ok to test |
Test build #71730 has finished for PR 16648 at commit
|
This patch doesn't seem to be working: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass8 has grown past JVM limit of 0xFFFF |
Steps to replicate: val schema = StructType(
(0 to 8000).map(n ⇒ StructField(s"column_$n", StringType))
)
val values = schema.map(_ ⇒ null)
val rows = spark.sparkContext.parallelize(Seq(Row(values:_*)))
val frame = spark.sqlContext.createDataFrame(rows, schema)
frame
.write
.format("parquet")
.save("/tmp/test") |
Thanks for that other test case. The one you provide I would say falls in the same class of error, however, this patch is still capable of addressing some others that still exist. While class-splitting is capable of handling more complex schemas (ones that are reliant on object creation like for JavaBean's and Avro), there are still instances where the shear number of variables can still blow the constant pool limit. In particular, if an enormous amount of mutable state is kicked up into the outer class. In spark 2.0.x, it was previously the case that global mutable state was used more sparingly, however, (as is more directly the case for your test case) there are instances where conditional expressions produce an enormous amount of mutable state (see SPARK-18091 for a recent change that can produce a great degree of mutable state). In your test case, the shear amount of mutable state generated for conditional null-checks is already over 65,536. One strategy might be to create a cache of excess mutable state only when the volume of mutable state threatens to breach the constant pool limit. Some type of cache (perhaps even just a simple array in the outer class) still accessible to the outer and nested classes would allow us to both keep code between classes within limits, and also keep the amount of mutable state in the outer class manageable. I thought such a caching scheme was a bit out of the scope of this look into class-splitting. Other strategies may exist for addressing the constant pool limit for the existing code-generation scheme, but I don't see how they fit quite as well given Catalysts proclivity for generating a single large class for each of its operations. I'd be glad to look into an implementation of such an excess-mutable state caching scheme if it'd help this PR along by addressing a wider range of use cases. Thanks again for the case, I'm glad to hear thoughts on this issue. |
@bdrillard Is there a particular reason why this patch hasn't been looked at yet? I think you should CC some of the authors of the code you have changed to speed things up. |
85e81ed
to
abfd06f
Compare
Test build #74341 has finished for PR 16648 at commit
|
abfd06f
to
187d5c3
Compare
val frame = spark.sqlContext.createDataFrame(rows, schema) | ||
|
||
frame.show() | ||
// .write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to remove these commented lines before the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I had just cleaned up that test to match automated the Scala stylechecks.
Test build #74345 has finished for PR 16648 at commit
|
a1e5937
to
2bf183e
Compare
Test build #74346 has finished for PR 16648 at commit
|
51ee31c
to
635535e
Compare
I've made some changes to this PR to address @mkiedys comments, and I'm using their test case, as it sets a higher bar for both class splitting and management of mutable state. Mutable state and its initialization seems to create a significant potential limitation for the size of schemas that can be marshaled to datasets. Not only is it possible for the amount of private variables required by mutable state to themselves grow beyond 2^16, but the initialization functions, which include references to that state, when inlined to the main outerclass, also puts significant strain towards the Constant Pool limit. The strategy I attempt to implement, including class splitting, as already mentioned above, is to 'compact' mutable state of primitives and simply-assigned objects into bounded arrays that can be initialized with simple loops rather than large init functions. |
final ${ctx.javaType(dataType)} ${ev.value} = $partitionMaskTerm + $countTerm; | ||
$countTerm++;""", isNull = "false") | ||
final ${ctx.javaType(dataType)} ${ev.value} = $partitionMaskTermAccessor + $countTermAccessor; | ||
$countTermAccessor++;""", isNull = "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Accessor suffixes to variable names add quite a bit of noise in this PR. What value do they add from your perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having addMutableState
return an accessor string is an important part of addressing the manner in which mutable state can contribute to Constant Pool errors. Code that creates mutable state usually takes for granted that the symbol used to declare the state will be inlined as a private member variable to the class. However, for sufficiently complicated schemas, mutable state and its initialization alone can breach the Constant Pool limit. The strategy I settled on was to have mutable state potentially be compacted into arrays of like type and initialization, this way, we can reduce the number of references that would count to the constant pool limit. Of course, if the mutable state is stored in an array, rather than in a private variable named after the symbol, we need to return back the accessor for that index in the compacted mutable state array, hence the 'accessor' suffixes. I had also tried a class-based approach, in which excess mutable state could become static members of nested classes, initialization functions for the state could still exceed the constant pool limit.
This PR can be condensed to two core components to approach a solution to the (hard-and-fast) Constant Pool limit:
- split excess code among classes
- compact excess mutable state into arrays
I should mention, not all mutable state is compacted into arrays. Only primitives and collections of simply-assigned objects (null assigned, or no assignment). But this array compaction strategy reduces references enough to allow even complex schemas in which we would potentially generate much more state than 2^16 to still be converted to datasets.
Test build #74348 has finished for PR 16648 at commit
|
Test build #75399 has finished for PR 16648 at commit
|
Test build #75585 has finished for PR 16648 at commit
|
I encountered this Exception when handling a data frame with 3000+ columns. I hope this patch got resolved soon. |
@bdrillard if you don't have time to finish this up I am happy to update this to latest. I would really like to see this fixed since it's silly that you can't have more than 3k columns |
Just wanted to mention this is a blocker for using most of the pipeline transformers for wide data frames, which is sad since 3000 columns (my use case) is not really very large. |
gentle ping @bdrillard |
@HyukjinKwon @robert3005 I'll have some time soon to update this PR for the latest master. Thanks for the interest. It is a non-trivial change and would require a comprehensive code review. |
Yea, I just wanted to make sure this is on progress in any way. |
320db91
to
f5f8f5c
Compare
Test build #76816 has finished for PR 16648 at commit
|
Test build #76818 has finished for PR 16648 at commit
|
Test build #76822 has finished for PR 16648 at commit
|
1c08d1c
to
8ce47fa
Compare
Test build #76978 has started for PR 16648 at commit |
As written in the comment, this PR enables the following two features. Current generated code in the description seem to show only feature 1. Would it be possible to update code to include features 1 and 2?
|
@kiszk, I've updated the pull-request description to include example code generation for mutable state compaction as well (which comes from inspecting the generated code for the test case for SPARK-18016). |
Test build #77102 has finished for PR 16648 at commit
|
* variable is inlined to the class, or an array access if the variable is to be stored | ||
* in an array of variables of the same type and initialization. | ||
*/ | ||
def addMutableState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating generated code. It makes clear.
When I see this method, addMutableState
seems to return an array element if a variable is primitive or simple object even if the number of mutable states are small. Does to always use array element lead to performance overhead compared to using instance variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk That's a good question. My initial reaction would be to say that any differences would be negligible, since Java will store both member variables of an Object and values of an array in heap, and one usually assumes O(1) access time for an array item (naturally we'd assume O(1) access for a member variable).
But I've tried to take time and do analysis of the bytecode generated under both scenarios, where all data is inlined as member variables, and where data is compacted into arrays. The conservative conclusion is that the difference in performance is negligible. The results and the testing classes can be seen here. Some folks with more knowledge of JVM architecture may catch a flaw in my experiment design, this is my shot at an informative benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that both of order is the same. Beyond that, I am interested in the difference of coefficients. Do you know them?
When I see the benchmark program, I have one question. Is there any reason to measure the performance by only one trial. IIUC, in general, Java performance is measured after warmup run due to the existence of Just-In-Time compiler.
Here is a good example.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example on creating a more representative timing profile. I'll rework the experiment to programmatically run several trials and account for warmup, etc. I had run several trials each with similar results, but we may as well make a more robust profiling driver that can run far more trials.
As for calculating the coefficients of the algorithm directly, I have not yet done so. I'll have to perform analysis of the bytecode I generated (and included in the repo) to provide a more satisfying answer.
@bdrillard Can we split this PR into two smaller PRs?
IIUC, What do you think? |
@kiszk We could do that, definitely. Changes in Feature 1 (splitting excess code among classes) are limited to the I can either do two separate PRs, or I can refactor the commit history such that there are two commits that can be viewed separately in this same PR, one for each feature. The only reason I'd hesitate to open two separate PRs (and so try to get the commit history to separate the two major features as two commits), is that code generation and state proliferation are tightly coupled: the test case that I include, SPARK-18016, cannot be resolved with just splitting into nested sub-classes or by compacting mutable state -- both features are required. Any class that would require sub-classes would probably also require mutable state compaction (and vice-versa). So I'd suggest the strategy where we have a commit for Feature 1 and a commit for Feature 2. That way in the PR we can choose to only view one commit at a time during review, but the code is still tested against the more representative test case. Would that be alright? |
I see. I understand two facts.
In my personal opinion, it would be good to split this into two PRs while I understand Fact 2. IIUC, the review is performed based on each PR, not each commit. To split it into two PRs would make review easier. If you make the first PR, you could put the test case with smaller columns. Then, when you will make another PR, you could update it with larger columns. |
@kiszk Sure, I'm glad to help make this change easier to review. I'll first make a PR that focuses on code splitting into nested classes. There should be a test case with a number of columns that is high enough to induce at least one instance of generating a nested class, but still low enough to not trigger failure based on the amount of global mutable state. If that PR is reviewed and is considered acceptable, we can open a second to address global mutable state. |
So this PR introduces 2 approaches to work around the Constant Pool Limit: 1. put member variables to inner class 2: compact primitive declarations into arrays. It looks to me that either of them can solve the problem, do we really need both of them? |
@cloud-fan Good question, and I think we can resolve it by using different values for We should also note that #18075 does slightly more than putting just member variables into nested classes. While it is true that a significant degree of local state alone that would get inlined to the Outer Class gets inlined instead to nested classes instead, the patch leads to even more reductions in the size of the constant pool, since there are additional items that get inlined to nested classes that also count towards the limit (e.g. field references, method references, variable types, method types, etc, see Java class file, The Constant Pool). The second feature (included here, but not in #18075), is precisely as you describe: it takes simply-declared fields that would be inlined globally and compacts them into like-typed and like-declared arrays. However, if we set Conversely, if we only include mutable state compaction at Looking at both #18075 and this pull-request, I think the takeaway is that even if all we do for the moment is split excess code among nested classes, we can still make a significant gain in the number of columns a Dataset can hold, which gives #18075 merit on its own. If we want to increase that limit even more though, we'll have to address proliferation of global state as well, perhaps by opening a follow-up PR that focuses on it more closely, maybe using the compaction strategy I've attempted here in #16648, or by exploring another method. Thoughts on this? |
thanks, this makes a lot sense! I'll review #18075 |
…lass Splitting ## What changes were proposed in this pull request? This pull-request exclusively includes the class splitting feature described in #16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes: 1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes. 2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used. 3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class. ## How was this patch tested? Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool. Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in #16648): excess global mutable state. A second PR may be opened to resolve that issue. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes #18075 from bdrillard/class_splitting_only.
…lass Splitting ## What changes were proposed in this pull request? This pull-request exclusively includes the class splitting feature described in apache#16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes: 1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes. 2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used. 3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class. ## How was this patch tested? Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool. Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in apache#16648): excess global mutable state. A second PR may be opened to resolve that issue. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes apache#18075 from bdrillard/class_splitting_only.
ping @bdrillard for the 2nd part of this PR |
Thanks @kiszk, I'll work on preparing a PR for the second half of this issue. |
kindly ping @bdrillard |
1 similar comment
kindly ping @bdrillard |
@bdrillard gentle ping |
I'm blocking out time to prepare the part 2 PR for this issue starting today over this week, regarding compaction of excess primitive state. cc: @kiszk |
@bdrillard Thank you very much |
Can one of the admins verify this patch? |
This PR was addressed in #18075. |
[class_splitting] Code Generation: Constant Pool Limit
What changes were proposed in this pull request?
Supports code generation for large structures that would previously trigger a Constant Pool limit exception, as noted in SPARK-18016. In this fix, when the volume of generated code for the class would exceed 1600k bytes, a new private nested class is declared, and any new functions that would have been inlined to the outer class with an
addNewFunction
call are inlined to the new nested class instead.addNewFunction
also would now return the name of the function registered (class-qualified, if it would be inlined to a nested class), so that the caller of the function can call it even if inlined to a different class. Additional nested classes are generated if the threshold is met subsequent times. These nested classes are instantiated and declared at the bottom of the generated outer class.Because private nested classes have access to the outer class's global state, but their functions and local state do not count towards the outer class's Constant Pool, and that they can be instantiated in the same outer class without the need to declare additional classes and handle the dependency injection, they seem to be a good candidate to solve this particular issue.
One key quality of this patch is that the common path for code generation remains unaffected. The 1600k threshold necessary to split of a nested class should only be exceeded in scenarios where the schema is extremely large. Generated code for most use cases will still be inlined entirely to the single outer class.
One other feature of this patch is the compaction of primitive declarations and simply-assigned object declarations (objects initialized to null, or not initialized at all) into arrays. Code generation can sometimes produce an excess of global state, so we reduce the number of declarations that would count towards the Constant Limit by creating arrays such that for a given variable of common type and declaration, it can be mapped to a specific index of an array. Using arrays of mutable state, the number of variables in global state is less constrained by the constant pool limit, but by the maximum size of an array.
This patch splits code (only code registered through the
addNewFunction
call) among the outer class and nested classes, and creates arrays for compactable mutable state like below:How was this patch tested?
Added a new test to the DataframeComplexTypeSuite that tests converting a large structure to a dataset. Ran full regression tests across every module.