Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-580: Switch int[] initialization in IntList to be lazy #339

Closed
wants to merge 2 commits into from

Conversation

piyushnarang
Copy link

@piyushnarang piyushnarang commented Apr 15, 2016

Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

@piyushnarang piyushnarang changed the title Switch int[] initialization in IntList to be lazy PARQUET-580: Switch int[] initialization in IntList to be lazy Apr 15, 2016
public IntList() {
initSlab();
}
public IntList() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this constructor if it doesn't do anything.

@rdblue
Copy link
Contributor

rdblue commented Apr 15, 2016

Thanks for looking into this! This looks good other than one minor issue.

@piyushnarang
Copy link
Author

@rdblue thanks for taking a look. I'll get rid of the ctor.

What do you think about the other idea - Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation).

@asfgit asfgit closed this in ac62c1c Apr 17, 2016
asfgit pushed a commit that referenced this pull request Apr 21, 2016
…small when data sets are small

One of the follow up items from PR - #339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that).

Author: Piyush Narang <pnarang@twitter.com>

Closes #341 from piyushnarang/master and squashes the following commits:

0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init
d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList
9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually
ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr
3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Apr 27, 2016
Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#339 from piyushnarang/master and squashes the following commits:

3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Apr 27, 2016
…small when data sets are small

One of the follow up items from PR - apache#339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that).

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#341 from piyushnarang/master and squashes the following commits:

0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init
d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList
9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually
ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr
3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#339 from piyushnarang/master and squashes the following commits:

3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
…small when data sets are small

One of the follow up items from PR - apache#339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that).

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#341 from piyushnarang/master and squashes the following commits:

0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init
d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList
9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually
ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr
3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#339 from piyushnarang/master and squashes the following commits:

3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
…small when data sets are small

One of the follow up items from PR - apache#339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that).

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#341 from piyushnarang/master and squashes the following commits:

0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init
d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList
9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually
ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr
3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#339 from piyushnarang/master and squashes the following commits:

3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
…small when data sets are small

One of the follow up items from PR - apache#339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that).

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#341 from piyushnarang/master and squashes the following commits:

0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init
d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList
9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually
ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr
3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time.

Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that..

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#339 from piyushnarang/master and squashes the following commits:

3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
…small when data sets are small

One of the follow up items from PR - apache#339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off the bat. This PR updates the code to start with a 4K array then keeps doubling till 64K (and stays at 64K after that).

Author: Piyush Narang <pnarang@twitter.com>

Closes apache#341 from piyushnarang/master and squashes the following commits:

0bc6b84 [Piyush Narang] Fix review comments - add spaces, check slab size, fix slab init
d1b4df1 [Piyush Narang] Make IntListTest values relative to constants in IntList
9617015 [Piyush Narang] Update IntList slab creation to keep bumping up size gradually
ebf1c58 [Piyush Narang] Merge branch 'master' of https://github.com/apache/parquet-mr
3ecc577 [Piyush Narang] Remove redundant IntList ctor
f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants