PARQUET-2429: Reduce direct input buffer churn #1270

gianm · 2024-02-08T00:01:29Z

Addresses https://issues.apache.org/jira/browse/PARQUET-2429.

Currently input buffers are grown one chunk at a time as the compressor or decompressor receives successive setInput calls. When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory. By growing the buffer 2x each time, we avoid this and instead use a modest number of allocations.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedCompressor.java

wgtmac · 2024-02-18T05:14:30Z

When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory.

Is that a real use case? Usually we don't expect a page to be as large as this.

gianm · 2024-03-01T16:51:41Z

When decompressing a 64MB block using a 4KB chunk size, this leads to thousands of allocations and deallocations totaling GBs of memory.

Is that a real use case? Usually we don't expect a page to be as large as this.

I did encounter this in the real world, on some Snappy-compressed Parquet files that were written by Spark. I don't have access to the Spark cluster or job info, though, so unfortunately I don't have more details than that.

wgtmac · 2024-03-03T13:02:33Z

Could you please make the CI happy?

cc @gszadovszky @shangxinli

gszadovszky · 2024-03-04T08:26:37Z

@gianm, I agree with @wgtmac's concern about the expected size. For compression/decompression we are targeting the page size. The page size is limited by two configs, parquet.page.size and parquet.page.row.count.limit. (See details here.) One may configure both to higher values but it does not really make sense to have 64M pages.
I would not use a hadoop config for the default size of compression buffers. Hadoop typically compresses whole files. Probably the default page size would be a better choice here.
I like the idea of keeping the last size in the codec so the nexttime you don't need the multiple re-allocations. The catch here might be in the case of writing Parquet files with different page size configurations so we might allocate more than actually required. But I don't think this would be a real-life scenario.

gianm · 2024-03-05T01:14:46Z

I agree with @wgtmac's concern about the expected size. For compression/decompression we are targeting the page size. The page size is limited by two configs, parquet.page.size and parquet.page.row.count.limit. (See details here.) One may configure both to higher values but it does not really make sense to have 64M pages.

I did encounter these in the real world, although it's always possible that they were built with some abnormally large values for some reason.

I would not use a hadoop config for the default size of compression buffers. Hadoop typically compresses whole files. Probably the default page size would be a better choice here.

I'm ok with doing whichever. FWIW, the setting io.file.buffer.size I used in the most recent patch (which was recommended here: #1270 (comment)) defaults to 4096 bytes. I am not really a Parquet expert so I am willing to use whatever y'all recommend. Is there another property that would be better?

gszadovszky · 2024-03-05T08:09:27Z

@gianm,
I haven't question whether you faced such large pages in the real world but whether they make sense. Anyway, our code needs to handle these cases for sure. What I want to avoid is occupying much larger buffers than actually needed.

Page size is managed by ParquetProperties.getPageSizeThreshold(), the default value is ParquetProperties.DEFAULT_PAGE_SIZE.

gianm · 2024-03-27T18:35:05Z

@gszadovszky I'm trying to switch the codecs to use ParquetProperties#getPageSizeThreshold() as the initial buffer size but am running into some issues with seeing how to structure that. It looks like the various codecs (SnappyCodec, Lz4RawCodec) are stashed in a static final map called CODEC_BY_NAME in CodecFactory. Before they are stashed in the map, they are configured by a Hadoop Configuration object. Presumably that needs to be consistent across the entire classloader, since the configured codecs are getting stashed in a static final map.

I don't see a way to get the relevant ParquetProperties at the time the codecs are created. (I'm also not sure if it even really makes sense; is ParquetProperties something that is consistent across the entire classloader like a Hadoop Configuration would be?)

Any suggestions are welcome. I could also go back to the approach where the initial buffer size isn't configurable, and hard-code it at 4KB or 1MB or what seems most reasonable. With the doubling-every-allocation approach introduced in this patch, it isn't going to be the end of the world if the initial size is too small.

gszadovszky · 2024-03-28T08:22:02Z

@gszadovszky I'm trying to switch the codecs to use ParquetProperties#getPageSizeThreshold() as the initial buffer size but am running into some issues with seeing how to structure that. It looks like the various codecs (SnappyCodec, Lz4RawCodec) are stashed in a static final map called CODEC_BY_NAME in CodecFactory. Before they are stashed in the map, they are configured by a Hadoop Configuration object. Presumably that needs to be consistent across the entire classloader, since the configured codecs are getting stashed in a static final map.

I don't see a way to get the relevant ParquetProperties at the time the codecs are created. (I'm also not sure if it even really makes sense; is ParquetProperties something that is consistent across the entire classloader like a Hadoop Configuration would be?)

Any suggestions are welcome. I could also go back to the approach where the initial buffer size isn't configurable, and hard-code it at 4KB or 1MB or what seems most reasonable. With the doubling-every-allocation approach introduced in this patch, it isn't going to be the end of the world if the initial size is too small.

In this case I wouldn't spend to much time on actually passing the configured value, and as you said, it might not even possible because of the caching.
I think, you are right to start with a small size and reach the target quickly.

This reverts commit 996a1e9.

gianm · 2024-03-28T16:27:53Z

In this case I wouldn't spend to much time on actually passing the configured value, and as you said, it might not even possible because of the caching.
I think, you are right to start with a small size and reach the target quickly.

OK, thanks for the feedback. I have pushed up a change to start with the max of 4KB and the initial chunk passed to setInput, then reach the target through doubling.

wgtmac · 2024-03-29T01:26:10Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedCompressor.java

+      if (inputBuffer.capacity() == 0) {
+        newBufferSize = Math.max(INITIAL_INPUT_BUFFER_SIZE, len);
+      } else {
+        newBufferSize = Math.max(inputBuffer.position() + len, inputBuffer.capacity() * 2);


Should we set an upper bound to it instead of blindly doubling the capacity? In the new code, we may see much larger peak memory compared to the past.

Some analysis:

With doubling, peak memory usage could be up to about double the size of really required memory.

If the target size is 64MB (the abnormally large size that I encountered in the wild), starting at 4KB and doubling gets us there in 14 iterations, allocating and deallocating 134MB of total memory.

We could set an upper bound for each allocation at 1MB, so peak memory usage would be at most 1MB more than the amount of really required memory. If we start at 4KB and double up to 1MB, then go in 1MB increments, we get there in 71 iterations, allocating and deallocating 2GB of total memory.

We could also use * 1.2 instead of * 2, which would make the peak memory usage at most 20% of the amount of really required memory. Starting at 4KB and increasing by 20% each allocation gets us there in 53 iterations, allocating and deallocating 380MB of total memory.

Perhaps 20% growth is a good balance, since it still gets us to target pretty quickly compared to using a 1MB cap, and peak memory usage is at most 20% higher than what is really needed. Please let me know what you think.

Thanks for the through analysis! * 1.2 sounds reasonable to me. Did you have the time spent on different strides?

I haven't measured it, but probably the time spent is a function of the number of iterations and the total amount of memory allocated and deallocated. Compared to what I was seeing without any minimum-increase factor at all, * 1.2 and * 2 are both really big improvements.

I just changed the patch to do * 1.2.

wgtmac

Thanks! LGTM

gianm · 2024-04-17T17:31:05Z

Hi- would it be possible to commit this, now that it's approved?

wgtmac · 2024-04-23T05:05:36Z

Sorry for the delay. I just merged this. Thanks!

gianm · 2024-04-24T04:18:17Z

thank you!

gianm changed the title ~~Grow input buffers by doubling in NonBlockedCompressor and NonBlockedDecompressor.~~ PARQUET-2429: Reduce direct input buffer churn Feb 8, 2024

gianm force-pushed the buf-sizing-doubling branch from 4880577 to 2eaca26 Compare February 8, 2024 04:16

wgtmac reviewed Feb 18, 2024

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedCompressor.java Show resolved Hide resolved

Use io.file.buffer.size to size initial buffers.

996a1e9

Revert "Use io.file.buffer.size to size initial buffers."

d8d4bd4

This reverts commit 996a1e9.

wgtmac reviewed Mar 29, 2024

View reviewed changes

gianm added 2 commits April 2, 2024 13:10

Grow by 1.2x instead of 2x.

4146d53

Fix style.

2affd9b

wgtmac approved these changes Apr 3, 2024

View reviewed changes

wgtmac merged commit a89083c into apache:master Apr 23, 2024
9 checks passed

gianm deleted the buf-sizing-doubling branch April 24, 2024 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2429: Reduce direct input buffer churn #1270

PARQUET-2429: Reduce direct input buffer churn #1270

gianm commented Feb 8, 2024

wgtmac commented Feb 18, 2024

gianm commented Mar 1, 2024 •

edited

wgtmac commented Mar 3, 2024

gszadovszky commented Mar 4, 2024

gianm commented Mar 5, 2024

gszadovszky commented Mar 5, 2024

gianm commented Mar 27, 2024

gszadovszky commented Mar 28, 2024

gianm commented Mar 28, 2024

wgtmac Mar 29, 2024

gianm Mar 29, 2024 •

edited

wgtmac Apr 2, 2024

gianm Apr 2, 2024

wgtmac left a comment

gianm commented Apr 17, 2024

wgtmac commented Apr 23, 2024

gianm commented Apr 24, 2024

PARQUET-2429: Reduce direct input buffer churn #1270

PARQUET-2429: Reduce direct input buffer churn #1270

Conversation

gianm commented Feb 8, 2024

wgtmac commented Feb 18, 2024

gianm commented Mar 1, 2024 • edited

wgtmac commented Mar 3, 2024

gszadovszky commented Mar 4, 2024

gianm commented Mar 5, 2024

gszadovszky commented Mar 5, 2024

gianm commented Mar 27, 2024

gszadovszky commented Mar 28, 2024

gianm commented Mar 28, 2024

wgtmac Mar 29, 2024

Choose a reason for hiding this comment

gianm Mar 29, 2024 • edited

Choose a reason for hiding this comment

wgtmac Apr 2, 2024

Choose a reason for hiding this comment

gianm Apr 2, 2024

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

gianm commented Apr 17, 2024

wgtmac commented Apr 23, 2024

gianm commented Apr 24, 2024

gianm commented Mar 1, 2024 •

edited

gianm Mar 29, 2024 •

edited