PARQUET-580: Switch int[] initialization in IntList to be lazy #339

piyushnarang · 2016-04-15T00:16:08Z

Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

rdblue · 2016-04-15T21:41:54Z

parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java

-  public IntList() {
-    initSlab();
-  }
+  public IntList() {}


We don't need this constructor if it doesn't do anything.

rdblue · 2016-04-15T21:42:18Z

Thanks for looking into this! This looks good other than one minor issue.

piyushnarang · 2016-04-15T21:56:46Z

@rdblue thanks for taking a look. I'll get rid of the ctor.

What do you think about the other idea - Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation).

…small when data sets are small One of the follow up items from PR - #339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that). Author: Piyush Narang <pnarang@twitter.com> Closes #341 from piyushnarang/master and squashes the following commits: 0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList 9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr 3ecc577 [Piyush Narang] Remove redundant IntList ctor f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy

Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time. Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that.. Author: Piyush Narang <pnarang@twitter.com> Closes apache#339 from piyushnarang/master and squashes the following commits: 3ecc577 [Piyush Narang] Remove redundant IntList ctor f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy

…small when data sets are small One of the follow up items from PR - apache#339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that). Author: Piyush Narang <pnarang@twitter.com> Closes apache#341 from piyushnarang/master and squashes the following commits: 0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList 9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr 3ecc577 [Piyush Narang] Remove redundant IntList ctor f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy