Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2429: Reduce direct input buffer churn #1270

Merged
merged 5 commits into from Apr 23, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Feb 8, 2024

Addresses https://issues.apache.org/jira/browse/PARQUET-2429.

Currently input buffers are grown one chunk at a time as the compressor or decompressor receives successive setInput calls. When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory. By growing the buffer 2x each time, we avoid this and instead use a modest number of allocations.

@gianm gianm changed the title Grow input buffers by doubling in NonBlockedCompressor and NonBlockedDecompressor. PARQUET-2429: Reduce direct input buffer churn Feb 8, 2024
Currently input buffers are grown one chunk at a time as the
compressor or decompressor receives successive setInput calls. When
decompressing a 64MB block using a 4KB chunk size, this leads to
thousands of allocations and deallocations totaling GBs of memory.
By growing the buffer 2x each time, we avoid this and instead use a
modest number of allocations.
@wgtmac
Copy link
Member

wgtmac commented Feb 18, 2024

When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory.

Is that a real use case? Usually we don't expect a page to be as large as this.

@gianm
Copy link
Contributor Author

gianm commented Mar 1, 2024

When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory.

Is that a real use case? Usually we don't expect a page to be as large as this.

I did encounter this in the real world, on some Snappy-compressed Parquet files that were written by Spark. I don't have access to the Spark cluster or job info, though, so unfortunately I don't have more details than that.

@wgtmac
Copy link
Member

wgtmac commented Mar 3, 2024

Could you please make the CI happy?

cc @gszadovszky @shangxinli

@gszadovszky
Copy link
Contributor

@gianm, I agree with @wgtmac's concern about the expected size. For compression/decompression we are targeting the page size. The page size is limited by two configs, parquet.page.size and parquet.page.row.count.limit. (See details here.) One may configure both to higher values but it does not really make sense to have 64M pages.
I would not use a hadoop config for the default size of compression buffers. Hadoop typically compresses whole files. Probably the default page size would be a better choice here.
I like the idea of keeping the last size in the codec so the nexttime you don't need the multiple re-allocations. The catch here might be in the case of writing Parquet files with different page size configurations so we might allocate more than actually required. But I don't think this would be a real-life scenario.

@gianm
Copy link
Contributor Author

gianm commented Mar 5, 2024

I agree with @wgtmac's concern about the expected size. For compression/decompression we are targeting the page size. The page size is limited by two configs, parquet.page.size and parquet.page.row.count.limit. (See details here.) One may configure both to higher values but it does not really make sense to have 64M pages.

I did encounter these in the real world, although it's always possible that they were built with some abnormally large values for some reason.

I would not use a hadoop config for the default size of compression buffers. Hadoop typically compresses whole files. Probably the default page size would be a better choice here.

I'm ok with doing whichever. FWIW, the setting io.file.buffer.size I used in the most recent patch (which was recommended here: #1270 (comment)) defaults to 4096 bytes. I am not really a Parquet expert so I am willing to use whatever y'all recommend. Is there another property that would be better?

@gszadovszky
Copy link
Contributor

@gianm,
I haven't question whether you faced such large pages in the real world but whether they make sense. Anyway, our code needs to handle these cases for sure. What I want to avoid is occupying much larger buffers than actually needed.

Page size is managed by ParquetProperties.getPageSizeThreshold(), the default value is ParquetProperties.DEFAULT_PAGE_SIZE.

@gianm
Copy link
Contributor Author

gianm commented Mar 27, 2024

@gszadovszky I'm trying to switch the codecs to use ParquetProperties#getPageSizeThreshold() as the initial buffer size but am running into some issues with seeing how to structure that. It looks like the various codecs (SnappyCodec, Lz4RawCodec) are stashed in a static final map called CODEC_BY_NAME in CodecFactory. Before they are stashed in the map, they are configured by a Hadoop Configuration object. Presumably that needs to be consistent across the entire classloader, since the configured codecs are getting stashed in a static final map.

I don't see a way to get the relevant ParquetProperties at the time the codecs are created. (I'm also not sure if it even really makes sense; is ParquetProperties something that is consistent across the entire classloader like a Hadoop Configuration would be?)

Any suggestions are welcome. I could also go back to the approach where the initial buffer size isn't configurable, and hard-code it at 4KB or 1MB or what seems most reasonable. With the doubling-every-allocation approach introduced in this patch, it isn't going to be the end of the world if the initial size is too small.

@gszadovszky
Copy link
Contributor

@gszadovszky I'm trying to switch the codecs to use ParquetProperties#getPageSizeThreshold() as the initial buffer size but am running into some issues with seeing how to structure that. It looks like the various codecs (SnappyCodec, Lz4RawCodec) are stashed in a static final map called CODEC_BY_NAME in CodecFactory. Before they are stashed in the map, they are configured by a Hadoop Configuration object. Presumably that needs to be consistent across the entire classloader, since the configured codecs are getting stashed in a static final map.

I don't see a way to get the relevant ParquetProperties at the time the codecs are created. (I'm also not sure if it even really makes sense; is ParquetProperties something that is consistent across the entire classloader like a Hadoop Configuration would be?)

Any suggestions are welcome. I could also go back to the approach where the initial buffer size isn't configurable, and hard-code it at 4KB or 1MB or what seems most reasonable. With the doubling-every-allocation approach introduced in this patch, it isn't going to be the end of the world if the initial size is too small.

In this case I wouldn't spend to much time on actually passing the configured value, and as you said, it might not even possible because of the caching.
I think, you are right to start with a small size and reach the target quickly.

@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2024

In this case I wouldn't spend to much time on actually passing the configured value, and as you said, it might not even possible because of the caching.
I think, you are right to start with a small size and reach the target quickly.

OK, thanks for the feedback. I have pushed up a change to start with the max of 4KB and the initial chunk passed to setInput, then reach the target through doubling.

if (inputBuffer.capacity() == 0) {
newBufferSize = Math.max(INITIAL_INPUT_BUFFER_SIZE, len);
} else {
newBufferSize = Math.max(inputBuffer.position() + len, inputBuffer.capacity() * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set an upper bound to it instead of blindly doubling the capacity? In the new code, we may see much larger peak memory compared to the past.

Copy link
Contributor Author

@gianm gianm Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some analysis:

With doubling, peak memory usage could be up to about double the size of really required memory.

If the target size is 64MB (the abnormally large size that I encountered in the wild), starting at 4KB and doubling gets us there in 14 iterations, allocating and deallocating 134MB of total memory.

We could set an upper bound for each allocation at 1MB, so peak memory usage would be at most 1MB more than the amount of really required memory. If we start at 4KB and double up to 1MB, then go in 1MB increments, we get there in 71 iterations, allocating and deallocating 2GB of total memory.

We could also use * 1.2 instead of * 2, which would make the peak memory usage at most 20% of the amount of really required memory. Starting at 4KB and increasing by 20% each allocation gets us there in 53 iterations, allocating and deallocating 380MB of total memory.

Perhaps 20% growth is a good balance, since it still gets us to target pretty quickly compared to using a 1MB cap, and peak memory usage is at most 20% higher than what is really needed. Please let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the through analysis! * 1.2 sounds reasonable to me. Did you have the time spent on different strides?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't measured it, but probably the time spent is a function of the number of iterations and the total amount of memory allocated and deallocated. Compared to what I was seeing without any minimum-increase factor at all, * 1.2 and * 2 are both really big improvements.

I just changed the patch to do * 1.2.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@gianm
Copy link
Contributor Author

gianm commented Apr 17, 2024

Hi- would it be possible to commit this, now that it's approved?

@wgtmac wgtmac merged commit a89083c into apache:master Apr 23, 2024
9 checks passed
@wgtmac
Copy link
Member

wgtmac commented Apr 23, 2024

Sorry for the delay. I just merged this. Thanks!

@gianm gianm deleted the buf-sizing-doubling branch April 24, 2024 04:18
@gianm
Copy link
Contributor Author

gianm commented Apr 24, 2024

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants