Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

Closed
wants to merge 13 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Mar 27, 2016

What changes were proposed in this pull request?

This PR reduces Java byte code size of method in SpecificColumnarIterator by using two approaches:

  1. Generate and call getTYPEColumnAccessor() for each type, which is actually used, for instantiating accessors
  2. Group a lot of method calls (more than 4000) into a method

How was this patch tested?

Added a new unit test to InMemoryColumnarQuerySuite

Here is generate code

/* 033 */   private org.apache.spark.sql.execution.columnar.CachedBatch batch = null;
/* 034 */
/* 035 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor;
/* 036 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1;
/* 037 */
/* 038 */   public SpecificColumnarIterator() {
/* 039 */     this.nativeOrder = ByteOrder.nativeOrder();
/* 030 */     this.mutableRow = new MutableUnsafeRow(rowWriter);
/* 041 */   }
/* 042 */
/* 043 */   public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes,
/* 044 */     boolean columnNullables[]) {
/* 044 */     this.input = input;
/* 046 */     this.columnTypes = columnTypes;
/* 047 */     this.columnIndexes = columnIndexes;
/* 048 */   }
/* 049 */
/* 050 */
/* 051 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) {
/* 052 */     byte[] buffer = batch.buffers()[columnIndexes[idx]];
/* 053 */     return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder));
/* 054 */   }
/* 055 */
/* 056 */
/* 057 */
/* 058 */
/* 059 */
/* 060 */
/* 061 */   public boolean hasNext() {
/* 062 */     if (currentRow < numRowsInBatch) {
/* 063 */       return true;
/* 064 */     }
/* 065 */     if (!input.hasNext()) {
/* 066 */       return false;
/* 067 */     }
/* 068 */
/* 069 */     batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
/* 070 */     currentRow = 0;
/* 071 */     numRowsInBatch = batch.numRows();
/* 072 */     accessor = getIntColumnAccessor(0);
/* 073 */     accessor1 = getIntColumnAccessor(1);
/* 074 */
/* 075 */     return hasNext();
/* 076 */   }
/* 077 */
/* 078 */   public InternalRow next() {
/* 079 */     currentRow += 1;
/* 080 */     bufferHolder.reset();
/* 081 */     rowWriter.zeroOutNullBytes();
/* 082 */     accessor.extractTo(mutableRow, 0);
/* 083 */     accessor1.extractTo(mutableRow, 1);
/* 084 */     unsafeRow.setTotalSize(bufferHolder.totalSize());
/* 085 */     return unsafeRow;
/* 086 */   }

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54274 has finished for PR 11984 at commit fea2a52.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54276 has finished for PR 11984 at commit e56406e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Mar 27, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54279 has finished for PR 11984 at commit 60f6719.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54277 has finished for PR 11984 at commit 60f6719.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54287 has finished for PR 11984 at commit 226bad5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54288 has finished for PR 11984 at commit 9346793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Mar 28, 2016

@davies , would it be possible to have a chance to look this?

}

/* 4000 = 64000 bytes / 16 (up to 16 bytes per one call)) */
val numberOfStatementsThreshold = 4000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A java method will not be JITted if it's over 8K, so we may need smaller threshold here. Could you also manual check that (for performance)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies , thank you for your comment. I did not know the limitation. Now, I confirmed these methods are compiled as follows:

hotspot_pid19296.log:<nmethod compile_id='10059' compiler='C1' level='3' entry='0x00007f03a9574500' size='3024' address='0x00007f03a95742d0' relocation_offset='296' insts_offset='560' stub_offset='2096' scopes_data_offset='2472' scopes_pcs_offset='2680' dependencies_offset='2984' nul_chk_table_offset='2992' oops_offset='2424' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator hasNext ()Z' bytes='92' count='384' iicount='384' stamp='25.140'/>
hotspot_pid19296.log:<nmethod compile_id='11143' compiler='C1' level='3' entry='0x00007f03a8ec0680' size='3656' address='0x00007f03a8ec0450' relocation_offset='296' insts_offset='560' stub_offset='2352' scopes_data_offset='2752' scopes_pcs_offset='3192' dependencies_offset='3592' nul_chk_table_offset='3600' oops_offset='2664' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator next ()Lorg/apache/spark/sql/catalyst/InternalRow;' bytes='88' count='384' iicount='384' stamp='34.011'/>
hotspot_pid19296.log:<nmethod compile_id='11144' compiler='C1' level='3' entry='0x00007f03a9a5dbc0' size='226544' address='0x00007f03a9a5a890' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163488' scopes_pcs_offset='198456' dependencies_offset='222520' nul_chk_table_offset='222528' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='6867' count='391' iicount='391' stamp='34.105'/>
hotspot_pid19296.log:<nmethod compile_id='11255' compiler='C1' level='3' entry='0x00007f03aa327e00' size='226632' address='0x00007f03aa324ad0' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198544' dependencies_offset='222608' nul_chk_table_offset='222616' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='521' iicount='521' stamp='37.163'/>
hotspot_pid19296.log:<nmethod compile_id='11256' compiler='C1' level='3' entry='0x00007f03aa35f380' size='226664' address='0x00007f03aa35c050' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198552' dependencies_offset='222632' nul_chk_table_offset='222640' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='530' iicount='530' stamp='37.286'/>
hotspot_pid19296.log:<nmethod compile_id='11257' compiler='C1' level='3' entry='0x00007f03aa396900' size='226664' address='0x00007f03aa3935d0' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198552' dependencies_offset='222632' nul_chk_table_offset='222640' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='541' iicount='541' stamp='37.427'/>
hotspot_pid19296.log:<nmethod compile_id='11263' compiler='C1' level='3' entry='0x00007f03aa3cde80' size='226632' address='0x00007f03aa3cab50' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198544' dependencies_offset='222608' nul_chk_table_offset='222616' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='547' iicount='547' stamp='37.516'/>
hotspot_pid19296.log:<nmethod compile_id='11264' compiler='C1' level='3' entry='0x00007f03aa405400' size='226664' address='0x00007f03aa4020d0' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198552' dependencies_offset='222632' nul_chk_table_offset='222640' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='555' iicount='555' stamp='37.607'/>
hotspot_pid19296.log:<nmethod compile_id='10750' compiler='C1' level='3' entry='0x00007f03a9fe9dc0' size='340208' address='0x00007f03a9fe5790' relocation_offset='296' insts_offset='17968' stub_offset='182384' scopes_data_offset='193120' scopes_pcs_offset='304680' dependencies_offset='335480' handler_table_offset='335488' nul_chk_table_offset='337840' oops_offset='192920' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5367' count='231' iicount='231' stamp='29.373'/>
hotspot_pid19296.log:<nmethod compile_id='10751' compiler='C1' level='3' entry='0x00007f03aa03cec0' size='340528' address='0x00007f03aa038890' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='239' iicount='239' stamp='29.464'/>
hotspot_pid19296.log:<nmethod compile_id='11053' compiler='C1' level='3' entry='0x00007f03aa1c74c0' size='340528' address='0x00007f03aa1c2e90' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='367' iicount='367' stamp='32.552'/>
hotspot_pid19296.log:<nmethod compile_id='11056' compiler='C1' level='3' entry='0x00007f03aa227c40' size='340528' address='0x00007f03aa223610' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='370' iicount='370' stamp='32.699'/>
hotspot_pid19296.log:<nmethod compile_id='11054' compiler='C1' level='3' entry='0x00007f03aa27ec40' size='340528' address='0x00007f03aa27a610' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='370' iicount='370' stamp='32.948'/>
hotspot_pid19296.log:<nmethod compile_id='11055' compiler='C1' level='3' entry='0x00007f03aa2d1e80' size='340528' address='0x00007f03aa2cd850' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='369' iicount='369' stamp='33.083'/>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed this by running the following program:

    val df = sc.parallelize(1 to 100).toDF()
    val aggr = {1 to 3000}.map(colnum => avg(df.col("_1")).as(s"col_$colnum"))
    val res = df.groupBy("_1").agg(count("_1"), aggr: _*).cache()
    var i = 0
    for (i <- 0 to 110)
      res.collect()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could be jitted after decrease it to 500, right?

val shortCls = accessorCls.substring(accessorCls.lastIndexOf(".") + 1)
dt match {
case t if ctx.isPrimitiveType(dt) =>
s"$accessorName = get${accessorClasses.getOrElseUpdate(accessorCls, shortCls)}($index);"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just call ColumnAccessor.apply() (making it accessible in generated java code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this in detail? I cannot understand your suggestion. Should this call we put ``ColumnAccessor.apply()` in this method or generated java code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$accessorName = ($accessorCls) ColumnAccessor.apply(columnTypes[$index], ByteBuffer.wrap(batch.buffers()[columnIndexes[$index]]));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may change the number of bytes per columns, you need to redo the calculation and test it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, a generated method getIntColumnAccessor() still calls ``ColumnAccessor.apply()`.

Do you want to directly call ColumnAccessor.apply() from a method hasNext() instead of calling it thru getIntColumnAccessor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to call ColumnAccessor.apply() to avoid these complicity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your motivation. I will revert my changes to avoid these complicity for reducing bytecode size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you directly call ColumnAccessor.apply, I think we don't need getXXXColumnAccessor anymore?

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54538 has finished for PR 11984 at commit beb9840.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54599 has finished for PR 11984 at commit 16cf602.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Mar 31, 2016

Again, I confirmed whether these method are compiled.

hotspot_pid30419.log:<nmethod compile_id='8066' compiler='C1' level='2' entry='0x00007fa2fa4d5a40' size='2480' address='0x00007fa2fa4d5850' relocation_offset='296' insts_offset='496' stub_offset='1328' scopes_data_offset='1664' scopes_pcs_offset='2008' dependencies_offset='2408' nul_chk_table_offset='2416' oops_offset='1624' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator hasNext ()Z' bytes='116' 
hotspot_pid30419.log:<nmethod compile_id='10941' compiler='C1' level='3' entry='0x00007fa2fad77b20' size='3552' address='0x00007fa2fad77910' relocation_offset='296' insts_offset='528' stub_offset='2288' scopes_data_offset='2672' scopes_pcs_offset='3104' dependencies_offset='3488' nul_chk_table_offset='3496' oops_offset='2584' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator next ()Lorg/apache/spark/sql/catalyst/InternalRow;' bytes='84' count='638' iicount='638' stamp='13.693'/>
hotspot_pid30419.log:<nmethod compile_id='10335' compiler='C1' level='3' entry='0x00007fa2fa9b8fe0' size='412608' address='0x00007fa2fa9b3e10' relocation_offset='296' insts_offset='20944' stub_offset='240080' scopes_data_offset='251056' scopes_pcs_offset='361672' dependencies_offset='404712' handler_table_offset='404720' nul_chk_table_offset='407792' oops_offset='250888' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5667' count='333' iicount='333' stamp='11.349'/>
hotspot_pid30419.log:<nmethod compile_id='10336' compiler='C1' level='3' entry='0x00007fa2f9ffe220' size='9792' address='0x00007fa2f9ffdf50' relocation_offset='296' insts_offset='720' stub_offset='5840' scopes_data_offset='6256' scopes_pcs_offset='8760' dependencies_offset='9624' handler_table_offset='9632' nul_chk_table_offset='9728' oops_offset='6120' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='59' count='340' iicount='340' stamp='11.369'/>
hotspot_pid30419.log:<nmethod compile_id='10340' compiler='C1' level='3' entry='0x00007fa2faa1dbe0' size='412848' address='0x00007fa2faa18a10' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='368' iicount='368' stamp='11.481'/>
hotspot_pid30419.log:<nmethod compile_id='10341' compiler='C1' level='3' entry='0x00007fa2faa828a0' size='412848' address='0x00007fa2faa7d6d0' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='368' iicount='368' stamp='11.634'/>
hotspot_pid30419.log:<nmethod compile_id='10601' compiler='C1' level='3' entry='0x00007fa2fab22a60' size='412848' address='0x00007fa2fab1d890' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='459' iicount='459' stamp='12.269'/>
hotspot_pid30419.log:<nmethod compile_id='10602' compiler='C1' level='3' entry='0x00007fa2fab89360' size='412848' address='0x00007fa2fab84190' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='465' iicount='465' stamp='12.353'/>
hotspot_pid30419.log:<nmethod compile_id='10548' compiler='C1' level='3' entry='0x00007fa2fa477d60' size='1696' address='0x00007fa2fa477bd0' relocation_offset='296' insts_offset='400' stub_offset='1136' scopes_data_offset='1360' scopes_pcs_offset='1496' dependencies_offset='1656' nul_chk_table_offset='1664' oops_offset='1320' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='29' count='385' iicount='385' stamp='11.913'/>
hotspot_pid30419.log:<nmethod compile_id='10550' compiler='C1' level='3' entry='0x00007fa2fa23f440' size='90160' address='0x00007fa2fa23df90' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='388' iicount='388' stamp='11.935'/>
hotspot_pid30419.log:<nmethod compile_id='10552' compiler='C1' level='3' entry='0x00007fa2faae3840' size='90160' address='0x00007fa2faae2390' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='395' iicount='395' stamp='11.958'/>
hotspot_pid30419.log:<nmethod compile_id='10553' compiler='C1' level='3' entry='0x00007fa2faaf9880' size='90160' address='0x00007fa2faaf83d0' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='398' iicount='398' stamp='11.981'/>
hotspot_pid30419.log:<nmethod compile_id='10786' compiler='C1' level='3' entry='0x00007fa2facdf700' size='90160' address='0x00007fa2facde250' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='542' iicount='542' stamp='12.974'/>
hotspot_pid30419.log:<nmethod compile_id='10943' compiler='C1' level='3' entry='0x00007fa2fad7e140' size='90064' address='0x00007fa2fad7cc90' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65664' scopes_pcs_offset='78776' dependencies_offset='88440' nul_chk_table_offset='88448' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2667' count='647' iicount='647' stamp='13.722'/>

* We should keep less than 8000
*/
val numberOfStatementsThreshold = 200
val (initializerAccessorFuncs, initializerAccessorCalls, extractorFuncs, extractorCalls) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use ctx.addFunction to simplify these

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54606 has finished for PR 11984 at commit a310bfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54607 has finished for PR 11984 at commit c1acf82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54654 has finished for PR 11984 at commit 60cebd5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54659 has finished for PR 11984 at commit 3a05ddf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Mar 31, 2016

LGTM, merging this into master, and 1.6 (if no conflict).

asfgit pushed a commit that referenced this pull request Mar 31, 2016
…xceed JVM size limit for cached DataFrames

## What changes were proposed in this pull request?

This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using two approaches:
1. Generate and call ```getTYPEColumnAccessor()``` for each type, which is actually used, for instantiating accessors
2. Group a lot of method calls (more than 4000) into a method

## How was this patch tested?

Added a new unit test to ```InMemoryColumnarQuerySuite```

Here is generate code

```java
/* 033 */   private org.apache.spark.sql.execution.columnar.CachedBatch batch = null;
/* 034 */
/* 035 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor;
/* 036 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1;
/* 037 */
/* 038 */   public SpecificColumnarIterator() {
/* 039 */     this.nativeOrder = ByteOrder.nativeOrder();
/* 030 */     this.mutableRow = new MutableUnsafeRow(rowWriter);
/* 041 */   }
/* 042 */
/* 043 */   public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes,
/* 044 */     boolean columnNullables[]) {
/* 044 */     this.input = input;
/* 046 */     this.columnTypes = columnTypes;
/* 047 */     this.columnIndexes = columnIndexes;
/* 048 */   }
/* 049 */
/* 050 */
/* 051 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) {
/* 052 */     byte[] buffer = batch.buffers()[columnIndexes[idx]];
/* 053 */     return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder));
/* 054 */   }
/* 055 */
/* 056 */
/* 057 */
/* 058 */
/* 059 */
/* 060 */
/* 061 */   public boolean hasNext() {
/* 062 */     if (currentRow < numRowsInBatch) {
/* 063 */       return true;
/* 064 */     }
/* 065 */     if (!input.hasNext()) {
/* 066 */       return false;
/* 067 */     }
/* 068 */
/* 069 */     batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
/* 070 */     currentRow = 0;
/* 071 */     numRowsInBatch = batch.numRows();
/* 072 */     accessor = getIntColumnAccessor(0);
/* 073 */     accessor1 = getIntColumnAccessor(1);
/* 074 */
/* 075 */     return hasNext();
/* 076 */   }
/* 077 */
/* 078 */   public InternalRow next() {
/* 079 */     currentRow += 1;
/* 080 */     bufferHolder.reset();
/* 081 */     rowWriter.zeroOutNullBytes();
/* 082 */     accessor.extractTo(mutableRow, 0);
/* 083 */     accessor1.extractTo(mutableRow, 1);
/* 084 */     unsafeRow.setTotalSize(bufferHolder.totalSize());
/* 085 */     return unsafeRow;
/* 086 */   }
```

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #11984 from kiszk/SPARK-14138.
@davies
Copy link
Contributor

davies commented Mar 31, 2016

@kiszk Could you create another patch for 1.6 ?

@kiszk
Copy link
Member Author

kiszk commented Mar 31, 2016

@davies , thank you for merging it. Do you want me to create another patch for master? This is because you have merged this into 1.6.
I will do that.

@davies
Copy link
Contributor

davies commented Mar 31, 2016

@kiszk Just realized that this PR is against 1.6 branch, could you always create PR for master first?

@davies
Copy link
Contributor

davies commented Mar 31, 2016

@kiszk Please send a PR for master, and close this PR, thanks!

@kiszk
Copy link
Member Author

kiszk commented Mar 31, 2016

@davies , sorry for made a mistake. I will first create a PR for the master next time.
I will send a PR regarding this issue for master this afternoon in Japan, and will close this PR.

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 1, 2016
…xceed JVM size limit for cached DataFrames

## What changes were proposed in this pull request?

This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using two approaches:
1. Generate and call ```getTYPEColumnAccessor()``` for each type, which is actually used, for instantiating accessors
2. Group a lot of method calls (more than 4000) into a method

## How was this patch tested?

Added a new unit test to ```InMemoryColumnarQuerySuite```

Here is generate code

```java
/* 033 */   private org.apache.spark.sql.execution.columnar.CachedBatch batch = null;
/* 034 */
/* 035 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor;
/* 036 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1;
/* 037 */
/* 038 */   public SpecificColumnarIterator() {
/* 039 */     this.nativeOrder = ByteOrder.nativeOrder();
/* 030 */     this.mutableRow = new MutableUnsafeRow(rowWriter);
/* 041 */   }
/* 042 */
/* 043 */   public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes,
/* 044 */     boolean columnNullables[]) {
/* 044 */     this.input = input;
/* 046 */     this.columnTypes = columnTypes;
/* 047 */     this.columnIndexes = columnIndexes;
/* 048 */   }
/* 049 */
/* 050 */
/* 051 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) {
/* 052 */     byte[] buffer = batch.buffers()[columnIndexes[idx]];
/* 053 */     return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder));
/* 054 */   }
/* 055 */
/* 056 */
/* 057 */
/* 058 */
/* 059 */
/* 060 */
/* 061 */   public boolean hasNext() {
/* 062 */     if (currentRow < numRowsInBatch) {
/* 063 */       return true;
/* 064 */     }
/* 065 */     if (!input.hasNext()) {
/* 066 */       return false;
/* 067 */     }
/* 068 */
/* 069 */     batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
/* 070 */     currentRow = 0;
/* 071 */     numRowsInBatch = batch.numRows();
/* 072 */     accessor = getIntColumnAccessor(0);
/* 073 */     accessor1 = getIntColumnAccessor(1);
/* 074 */
/* 075 */     return hasNext();
/* 076 */   }
/* 077 */
/* 078 */   public InternalRow next() {
/* 079 */     currentRow += 1;
/* 080 */     bufferHolder.reset();
/* 081 */     rowWriter.zeroOutNullBytes();
/* 082 */     accessor.extractTo(mutableRow, 0);
/* 083 */     accessor1.extractTo(mutableRow, 1);
/* 084 */     unsafeRow.setTotalSize(bufferHolder.totalSize());
/* 085 */     return unsafeRow;
/* 086 */   }
```

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes apache#11984 from kiszk/SPARK-14138.

(cherry picked from commit f12f11e)
@kiszk kiszk closed this Apr 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants