[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

kiszk · 2016-03-27T04:27:02Z

What changes were proposed in this pull request?

This PR reduces Java byte code size of method in SpecificColumnarIterator by using two approaches:

Generate and call getTYPEColumnAccessor() for each type, which is actually used, for instantiating accessors
Group a lot of method calls (more than 4000) into a method

How was this patch tested?

Added a new unit test to InMemoryColumnarQuerySuite

Here is generate code

/* 033 */   private org.apache.spark.sql.execution.columnar.CachedBatch batch = null;
/* 034 */
/* 035 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor;
/* 036 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1;
/* 037 */
/* 038 */   public SpecificColumnarIterator() {
/* 039 */     this.nativeOrder = ByteOrder.nativeOrder();
/* 030 */     this.mutableRow = new MutableUnsafeRow(rowWriter);
/* 041 */   }
/* 042 */
/* 043 */   public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes,
/* 044 */     boolean columnNullables[]) {
/* 044 */     this.input = input;
/* 046 */     this.columnTypes = columnTypes;
/* 047 */     this.columnIndexes = columnIndexes;
/* 048 */   }
/* 049 */
/* 050 */
/* 051 */   private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) {
/* 052 */     byte[] buffer = batch.buffers()[columnIndexes[idx]];
/* 053 */     return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder));
/* 054 */   }
/* 055 */
/* 056 */
/* 057 */
/* 058 */
/* 059 */
/* 060 */
/* 061 */   public boolean hasNext() {
/* 062 */     if (currentRow < numRowsInBatch) {
/* 063 */       return true;
/* 064 */     }
/* 065 */     if (!input.hasNext()) {
/* 066 */       return false;
/* 067 */     }
/* 068 */
/* 069 */     batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
/* 070 */     currentRow = 0;
/* 071 */     numRowsInBatch = batch.numRows();
/* 072 */     accessor = getIntColumnAccessor(0);
/* 073 */     accessor1 = getIntColumnAccessor(1);
/* 074 */
/* 075 */     return hasNext();
/* 076 */   }
/* 077 */
/* 078 */   public InternalRow next() {
/* 079 */     currentRow += 1;
/* 080 */     bufferHolder.reset();
/* 081 */     rowWriter.zeroOutNullBytes();
/* 082 */     accessor.extractTo(mutableRow, 0);
/* 083 */     accessor1.extractTo(mutableRow, 1);
/* 084 */     unsafeRow.setTotalSize(bufferHolder.totalSize());
/* 085 */     return unsafeRow;
/* 086 */   }

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

group a lot of calls into a method

SparkQA · 2016-03-27T04:29:52Z

Test build #54274 has finished for PR 11984 at commit fea2a52.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-27T04:55:19Z

Test build #54276 has finished for PR 11984 at commit e56406e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-03-27T05:35:05Z

Jenkins, retest this please

SparkQA · 2016-03-27T08:11:43Z

Test build #54279 has finished for PR 11984 at commit 60f6719.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-27T08:15:00Z

Test build #54277 has finished for PR 11984 at commit 60f6719.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-27T13:41:06Z

Test build #54287 has finished for PR 11984 at commit 226bad5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…umns is enough large

SparkQA · 2016-03-27T15:49:18Z

Test build #54288 has finished for PR 11984 at commit 9346793.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-03-28T19:58:13Z

@davies , would it be possible to have a chance to look this?

davies · 2016-03-28T20:06:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala

+    }
+
+    /* 4000 = 64000 bytes / 16 (up to 16 bytes per one call)) */
+    val numberOfStatementsThreshold = 4000


A java method will not be JITted if it's over 8K, so we may need smaller threshold here. Could you also manual check that (for performance)?

@davies , thank you for your comment. I did not know the limitation. Now, I confirmed these methods are compiled as follows:

hotspot_pid19296.log:<nmethod compile_id='10059' compiler='C1' level='3' entry='0x00007f03a9574500' size='3024' address='0x00007f03a95742d0' relocation_offset='296' insts_offset='560' stub_offset='2096' scopes_data_offset='2472' scopes_pcs_offset='2680' dependencies_offset='2984' nul_chk_table_offset='2992' oops_offset='2424' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator hasNext ()Z' bytes='92' count='384' iicount='384' stamp='25.140'/>

hotspot_pid19296.log:<nmethod compile_id='11143' compiler='C1' level='3' entry='0x00007f03a8ec0680' size='3656' address='0x00007f03a8ec0450' relocation_offset='296' insts_offset='560' stub_offset='2352' scopes_data_offset='2752' scopes_pcs_offset='3192' dependencies_offset='3592' nul_chk_table_offset='3600' oops_offset='2664' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator next ()Lorg/apache/spark/sql/catalyst/InternalRow;' bytes='88' count='384' iicount='384' stamp='34.011'/>

hotspot_pid19296.log:<nmethod compile_id='11144' compiler='C1' level='3' entry='0x00007f03a9a5dbc0' size='226544' address='0x00007f03a9a5a890' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163488' scopes_pcs_offset='198456' dependencies_offset='222520' nul_chk_table_offset='222528' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='6867' count='391' iicount='391' stamp='34.105'/> hotspot_pid19296.log:<nmethod compile_id='11255' compiler='C1' level='3' entry='0x00007f03aa327e00' size='226632' address='0x00007f03aa324ad0' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198544' dependencies_offset='222608' nul_chk_table_offset='222616' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='521' iicount='521' stamp='37.163'/> hotspot_pid19296.log:<nmethod compile_id='11256' compiler='C1' level='3' entry='0x00007f03aa35f380' size='226664' address='0x00007f03aa35c050' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198552' dependencies_offset='222632' nul_chk_table_offset='222640' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='530' iicount='530' stamp='37.286'/> hotspot_pid19296.log:<nmethod compile_id='11257' compiler='C1' level='3' entry='0x00007f03aa396900' size='226664' address='0x00007f03aa3935d0' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198552' dependencies_offset='222632' nul_chk_table_offset='222640' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='541' iicount='541' stamp='37.427'/> hotspot_pid19296.log:<nmethod compile_id='11263' compiler='C1' level='3' entry='0x00007f03aa3cde80' size='226632' address='0x00007f03aa3cab50' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198544' dependencies_offset='222608' nul_chk_table_offset='222616' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='547' iicount='547' stamp='37.516'/> hotspot_pid19296.log:<nmethod compile_id='11264' compiler='C1' level='3' entry='0x00007f03aa405400' size='226664' address='0x00007f03aa4020d0' relocation_offset='296' insts_offset='13104' stub_offset='155280' scopes_data_offset='163472' scopes_pcs_offset='198552' dependencies_offset='222632' nul_chk_table_offset='222640' oops_offset='163432' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='7001' count='555' iicount='555' stamp='37.607'/>

hotspot_pid19296.log:<nmethod compile_id='10750' compiler='C1' level='3' entry='0x00007f03a9fe9dc0' size='340208' address='0x00007f03a9fe5790' relocation_offset='296' insts_offset='17968' stub_offset='182384' scopes_data_offset='193120' scopes_pcs_offset='304680' dependencies_offset='335480' handler_table_offset='335488' nul_chk_table_offset='337840' oops_offset='192920' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5367' count='231' iicount='231' stamp='29.373'/> hotspot_pid19296.log:<nmethod compile_id='10751' compiler='C1' level='3' entry='0x00007f03aa03cec0' size='340528' address='0x00007f03aa038890' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='239' iicount='239' stamp='29.464'/> hotspot_pid19296.log:<nmethod compile_id='11053' compiler='C1' level='3' entry='0x00007f03aa1c74c0' size='340528' address='0x00007f03aa1c2e90' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='367' iicount='367' stamp='32.552'/> hotspot_pid19296.log:<nmethod compile_id='11056' compiler='C1' level='3' entry='0x00007f03aa227c40' size='340528' address='0x00007f03aa223610' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='370' iicount='370' stamp='32.699'/> hotspot_pid19296.log:<nmethod compile_id='11054' compiler='C1' level='3' entry='0x00007f03aa27ec40' size='340528' address='0x00007f03aa27a610' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='370' iicount='370' stamp='32.948'/> hotspot_pid19296.log:<nmethod compile_id='11055' compiler='C1' level='3' entry='0x00007f03aa2d1e80' size='340528' address='0x00007f03aa2cd850' relocation_offset='296' insts_offset='17968' stub_offset='182608' scopes_data_offset='193296' scopes_pcs_offset='305000' dependencies_offset='335800' handler_table_offset='335808' nul_chk_table_offset='338160' oops_offset='193144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5501' count='369' iicount='369' stamp='33.083'/>

I confirmed this by running the following program:

val df = sc.parallelize(1 to 100).toDF() val aggr = {1 to 3000}.map(colnum => avg(df.col("_1")).as(s"col_$colnum")) val res = df.groupBy("_1").agg(count("_1"), aggr: _*).cache() var i = 0 for (i <- 0 to 110) res.collect()

They could be jitted after decrease it to 500, right?

…le the method

davies · 2016-03-30T17:39:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala

+        val shortCls = accessorCls.substring(accessorCls.lastIndexOf(".") + 1)
+        dt match {
+          case t if ctx.isPrimitiveType(dt) =>
+            s"$accessorName = get${accessorClasses.getOrElseUpdate(accessorCls, shortCls)}($index);"


I think we should just call ColumnAccessor.apply() (making it accessible in generated java code)

Could you please explain this in detail? I cannot understand your suggestion. Should this call we put ``ColumnAccessor.apply()` in this method or generated java code?

$accessorName = ($accessorCls) ColumnAccessor.apply(columnTypes[$index], ByteBuffer.wrap(batch.buffers()[columnIndexes[$index]]));

This may change the number of bytes per columns, you need to redo the calculation and test it.

For example, a generated method getIntColumnAccessor() still calls ``ColumnAccessor.apply()`.

Do you want to directly call ColumnAccessor.apply() from a method hasNext() instead of calling it thru getIntColumnAccessor?

It's better to call ColumnAccessor.apply() to avoid these complicity.

I understand your motivation. I will revert my changes to avoid these complicity for reducing bytecode size.

When you directly call ColumnAccessor.apply, I think we don't need getXXXColumnAccessor anymore?

SparkQA · 2016-03-30T18:34:13Z

Test build #54538 has finished for PR 11984 at commit beb9840.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T05:59:44Z

Test build #54599 has finished for PR 11984 at commit 16cf602.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-03-31T07:16:25Z

Again, I confirmed whether these method are compiled.

hotspot_pid30419.log:<nmethod compile_id='8066' compiler='C1' level='2' entry='0x00007fa2fa4d5a40' size='2480' address='0x00007fa2fa4d5850' relocation_offset='296' insts_offset='496' stub_offset='1328' scopes_data_offset='1664' scopes_pcs_offset='2008' dependencies_offset='2408' nul_chk_table_offset='2416' oops_offset='1624' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator hasNext ()Z' bytes='116'

hotspot_pid30419.log:<nmethod compile_id='10941' compiler='C1' level='3' entry='0x00007fa2fad77b20' size='3552' address='0x00007fa2fad77910' relocation_offset='296' insts_offset='528' stub_offset='2288' scopes_data_offset='2672' scopes_pcs_offset='3104' dependencies_offset='3488' nul_chk_table_offset='3496' oops_offset='2584' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator next ()Lorg/apache/spark/sql/catalyst/InternalRow;' bytes='84' count='638' iicount='638' stamp='13.693'/>

hotspot_pid30419.log:<nmethod compile_id='10335' compiler='C1' level='3' entry='0x00007fa2fa9b8fe0' size='412608' address='0x00007fa2fa9b3e10' relocation_offset='296' insts_offset='20944' stub_offset='240080' scopes_data_offset='251056' scopes_pcs_offset='361672' dependencies_offset='404712' handler_table_offset='404720' nul_chk_table_offset='407792' oops_offset='250888' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5667' count='333' iicount='333' stamp='11.349'/>
hotspot_pid30419.log:<nmethod compile_id='10336' compiler='C1' level='3' entry='0x00007fa2f9ffe220' size='9792' address='0x00007fa2f9ffdf50' relocation_offset='296' insts_offset='720' stub_offset='5840' scopes_data_offset='6256' scopes_pcs_offset='8760' dependencies_offset='9624' handler_table_offset='9632' nul_chk_table_offset='9728' oops_offset='6120' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='59' count='340' iicount='340' stamp='11.369'/>
hotspot_pid30419.log:<nmethod compile_id='10340' compiler='C1' level='3' entry='0x00007fa2faa1dbe0' size='412848' address='0x00007fa2faa18a10' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='368' iicount='368' stamp='11.481'/>
hotspot_pid30419.log:<nmethod compile_id='10341' compiler='C1' level='3' entry='0x00007fa2faa828a0' size='412848' address='0x00007fa2faa7d6d0' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='368' iicount='368' stamp='11.634'/>
hotspot_pid30419.log:<nmethod compile_id='10601' compiler='C1' level='3' entry='0x00007fa2fab22a60' size='412848' address='0x00007fa2fab1d890' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='459' iicount='459' stamp='12.269'/>
hotspot_pid30419.log:<nmethod compile_id='10602' compiler='C1' level='3' entry='0x00007fa2fab89360' size='412848' address='0x00007fa2fab84190' relocation_offset='296' insts_offset='20944' stub_offset='240336' scopes_data_offset='251280' scopes_pcs_offset='361912' dependencies_offset='404952' handler_table_offset='404960' nul_chk_table_offset='408032' oops_offset='251144' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator accessors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='5801' count='465' iicount='465' stamp='12.353'/>

hotspot_pid30419.log:<nmethod compile_id='10548' compiler='C1' level='3' entry='0x00007fa2fa477d60' size='1696' address='0x00007fa2fa477bd0' relocation_offset='296' insts_offset='400' stub_offset='1136' scopes_data_offset='1360' scopes_pcs_offset='1496' dependencies_offset='1656' nul_chk_table_offset='1664' oops_offset='1320' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors5$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='29' count='385' iicount='385' stamp='11.913'/>
hotspot_pid30419.log:<nmethod compile_id='10550' compiler='C1' level='3' entry='0x00007fa2fa23f440' size='90160' address='0x00007fa2fa23df90' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors2$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='388' iicount='388' stamp='11.935'/>
hotspot_pid30419.log:<nmethod compile_id='10552' compiler='C1' level='3' entry='0x00007fa2faae3840' size='90160' address='0x00007fa2faae2390' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors1$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='395' iicount='395' stamp='11.958'/>
hotspot_pid30419.log:<nmethod compile_id='10553' compiler='C1' level='3' entry='0x00007fa2faaf9880' size='90160' address='0x00007fa2faaf83d0' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors3$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='398' iicount='398' stamp='11.981'/>
hotspot_pid30419.log:<nmethod compile_id='10786' compiler='C1' level='3' entry='0x00007fa2facdf700' size='90160' address='0x00007fa2facde250' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65648' scopes_pcs_offset='78872' dependencies_offset='88536' nul_chk_table_offset='88544' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors4$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2801' count='542' iicount='542' stamp='12.974'/>
hotspot_pid30419.log:<nmethod compile_id='10943' compiler='C1' level='3' entry='0x00007fa2fad7e140' size='90064' address='0x00007fa2fad7cc90' relocation_offset='296' insts_offset='5296' stub_offset='62256' scopes_data_offset='65664' scopes_pcs_offset='78776' dependencies_offset='88440' nul_chk_table_offset='88448' oops_offset='65608' method='org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator extractors0$ (Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificColumnarIterator;)V' bytes='2667' count='647' iicount='647' stamp='13.722'/>

davies · 2016-03-31T07:26:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala

+     * We should keep less than 8000
+     */
+    val numberOfStatementsThreshold = 200
+    val (initializerAccessorFuncs, initializerAccessorCalls, extractorFuncs, extractorCalls) =


we could use ctx.addFunction to simplify these

SparkQA · 2016-03-31T08:23:38Z

Test build #54606 has finished for PR 11984 at commit a310bfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T08:42:22Z

Test build #54607 has finished for PR 11984 at commit c1acf82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T19:15:05Z

Test build #54654 has finished for PR 11984 at commit 60cebd5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T21:20:29Z

Test build #54659 has finished for PR 11984 at commit 3a05ddf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-31T22:05:25Z

LGTM, merging this into master, and 1.6 (if no conflict).

…xceed JVM size limit for cached DataFrames ## What changes were proposed in this pull request? This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using two approaches: 1. Generate and call ```getTYPEColumnAccessor()``` for each type, which is actually used, for instantiating accessors 2. Group a lot of method calls (more than 4000) into a method ## How was this patch tested? Added a new unit test to ```InMemoryColumnarQuerySuite``` Here is generate code ```java /* 033 */ private org.apache.spark.sql.execution.columnar.CachedBatch batch = null; /* 034 */ /* 035 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor; /* 036 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1; /* 037 */ /* 038 */ public SpecificColumnarIterator() { /* 039 */ this.nativeOrder = ByteOrder.nativeOrder(); /* 030 */ this.mutableRow = new MutableUnsafeRow(rowWriter); /* 041 */ } /* 042 */ /* 043 */ public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes, /* 044 */ boolean columnNullables[]) { /* 044 */ this.input = input; /* 046 */ this.columnTypes = columnTypes; /* 047 */ this.columnIndexes = columnIndexes; /* 048 */ } /* 049 */ /* 050 */ /* 051 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) { /* 052 */ byte[] buffer = batch.buffers()[columnIndexes[idx]]; /* 053 */ return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder)); /* 054 */ } /* 055 */ /* 056 */ /* 057 */ /* 058 */ /* 059 */ /* 060 */ /* 061 */ public boolean hasNext() { /* 062 */ if (currentRow < numRowsInBatch) { /* 063 */ return true; /* 064 */ } /* 065 */ if (!input.hasNext()) { /* 066 */ return false; /* 067 */ } /* 068 */ /* 069 */ batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next(); /* 070 */ currentRow = 0; /* 071 */ numRowsInBatch = batch.numRows(); /* 072 */ accessor = getIntColumnAccessor(0); /* 073 */ accessor1 = getIntColumnAccessor(1); /* 074 */ /* 075 */ return hasNext(); /* 076 */ } /* 077 */ /* 078 */ public InternalRow next() { /* 079 */ currentRow += 1; /* 080 */ bufferHolder.reset(); /* 081 */ rowWriter.zeroOutNullBytes(); /* 082 */ accessor.extractTo(mutableRow, 0); /* 083 */ accessor1.extractTo(mutableRow, 1); /* 084 */ unsafeRow.setTotalSize(bufferHolder.totalSize()); /* 085 */ return unsafeRow; /* 086 */ } ``` (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #11984 from kiszk/SPARK-14138.

davies · 2016-03-31T22:09:37Z

@kiszk Could you create another patch for 1.6 ?

kiszk · 2016-03-31T22:34:25Z

@davies , thank you for merging it. Do you want me to create another patch for master? This is because you have merged this into 1.6.
I will do that.

davies · 2016-03-31T22:55:07Z

@kiszk Just realized that this PR is against 1.6 branch, could you always create PR for master first?

davies · 2016-03-31T22:56:15Z

@kiszk Please send a PR for master, and close this PR, thanks!

kiszk · 2016-03-31T23:11:14Z

@davies , sorry for made a mistake. I will first create a PR for the master next time.
I will send a PR regarding this issue for master this afternoon in Japan, and will close this PR.

…xceed JVM size limit for cached DataFrames ## What changes were proposed in this pull request? This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using two approaches: 1. Generate and call ```getTYPEColumnAccessor()``` for each type, which is actually used, for instantiating accessors 2. Group a lot of method calls (more than 4000) into a method ## How was this patch tested? Added a new unit test to ```InMemoryColumnarQuerySuite``` Here is generate code ```java /* 033 */ private org.apache.spark.sql.execution.columnar.CachedBatch batch = null; /* 034 */ /* 035 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor; /* 036 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor accessor1; /* 037 */ /* 038 */ public SpecificColumnarIterator() { /* 039 */ this.nativeOrder = ByteOrder.nativeOrder(); /* 030 */ this.mutableRow = new MutableUnsafeRow(rowWriter); /* 041 */ } /* 042 */ /* 043 */ public void initialize(Iterator input, DataType[] columnTypes, int[] columnIndexes, /* 044 */ boolean columnNullables[]) { /* 044 */ this.input = input; /* 046 */ this.columnTypes = columnTypes; /* 047 */ this.columnIndexes = columnIndexes; /* 048 */ } /* 049 */ /* 050 */ /* 051 */ private org.apache.spark.sql.execution.columnar.IntColumnAccessor getIntColumnAccessor(int idx) { /* 052 */ byte[] buffer = batch.buffers()[columnIndexes[idx]]; /* 053 */ return new org.apache.spark.sql.execution.columnar.IntColumnAccessor(ByteBuffer.wrap(buffer).order(nativeOrder)); /* 054 */ } /* 055 */ /* 056 */ /* 057 */ /* 058 */ /* 059 */ /* 060 */ /* 061 */ public boolean hasNext() { /* 062 */ if (currentRow < numRowsInBatch) { /* 063 */ return true; /* 064 */ } /* 065 */ if (!input.hasNext()) { /* 066 */ return false; /* 067 */ } /* 068 */ /* 069 */ batch = (org.apache.spark.sql.execution.columnar.CachedBatch) input.next(); /* 070 */ currentRow = 0; /* 071 */ numRowsInBatch = batch.numRows(); /* 072 */ accessor = getIntColumnAccessor(0); /* 073 */ accessor1 = getIntColumnAccessor(1); /* 074 */ /* 075 */ return hasNext(); /* 076 */ } /* 077 */ /* 078 */ public InternalRow next() { /* 079 */ currentRow += 1; /* 080 */ bufferHolder.reset(); /* 081 */ rowWriter.zeroOutNullBytes(); /* 082 */ accessor.extractTo(mutableRow, 0); /* 083 */ accessor1.extractTo(mutableRow, 1); /* 084 */ unsafeRow.setTotalSize(bufferHolder.totalSize()); /* 085 */ return unsafeRow; /* 086 */ } ``` (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#11984 from kiszk/SPARK-14138. (cherry picked from commit f12f11e)

kiszk added 2 commits March 27, 2016 13:15

make code size of hasNext() smaller by preparing get*Acceessor() methods

ab67d33

group a lot of calls into a method

add test case

fea2a52

fix scalastyle errors

e56406e

fix scalastyle error

60f6719

use String as a key instead of DataType object

226bad5

kiszk added 2 commits March 27, 2016 22:59

fix boundary condition

f3307a7

add a case that does not generate a new method, but the number of col…

9346793

…umns is enough large

davies reviewed Mar 28, 2016
View reviewed changes

reduce threadhold of # of calls in a method to allow HotSpot to compi…

beb9840

…le the method

davies reviewed Mar 30, 2016
View reviewed changes

adressed Davies's comments

16cf602

fix scala style errors

a310bfc

nit: fix number in a comment

c1acf82

davies reviewed Mar 31, 2016
View reviewed changes

addessed Davies's comments

60cebd5

fix scala style errors

3a05ddf

kiszk closed this Apr 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

kiszk commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

kiszk commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

kiszk commented Mar 28, 2016

davies Mar 28, 2016

kiszk Mar 30, 2016

kiszk Mar 30, 2016

davies Mar 30, 2016

davies Mar 30, 2016

kiszk Mar 30, 2016

davies Mar 30, 2016

davies Mar 30, 2016

kiszk Mar 31, 2016

davies Mar 31, 2016

kiszk Mar 31, 2016

viirya Mar 31, 2016

SparkQA commented Mar 30, 2016

SparkQA commented Mar 31, 2016

kiszk commented Mar 31, 2016

davies Mar 31, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

davies commented Mar 31, 2016

davies commented Mar 31, 2016

kiszk commented Mar 31, 2016

davies commented Mar 31, 2016

davies commented Mar 31, 2016

kiszk commented Mar 31, 2016

[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

[Spark-14138][SQL] Fix generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames #11984

Conversation

kiszk commented Mar 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

kiszk commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

SparkQA commented Mar 27, 2016

kiszk commented Mar 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 30, 2016

SparkQA commented Mar 31, 2016

kiszk commented Mar 31, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

davies commented Mar 31, 2016

davies commented Mar 31, 2016

kiszk commented Mar 31, 2016

davies commented Mar 31, 2016

davies commented Mar 31, 2016

kiszk commented Mar 31, 2016