Huge memory taken for each field when exporting #1240

LouisClt · 2022-08-30T15:15:19Z

Hello,
Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records.
Even for 100 fields, that could take 100Mo+.
The "issue" seems to be coming from here :

orc/c++/src/ColumnWriter.cc

Line 59 in 432a7aa

1 * 1024 * 1024,

When we create a writer with the "createWriter" (

orc/c++/src/Writer.cc

Lines 681 to 684 in 432a7aa

    
           std::unique_ptr<Writer> createWriter( 
        
                                                const Type& type, 
        
                                                OutputStream* stream, 
        
                                                const WriterOptions& options) {

), a stream (compressor) is created for each field. As we allocate a Buffer of 1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for each field.

Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ?
Thanks

dongjoon-hyun · 2022-08-31T03:56:40Z

cc @wgtmac , @stiga-huang , @coderex2522

coderex2522 · 2022-08-31T16:18:17Z

@LouisClt To support the zero-copy mechanism, class BufferedOutputStream will have an internal data buffer. And the default capacity of the internal data buffer is 1MB. This default capacity size should be able to be modified, but here's a hint that if the buffer capacity is set too small, it may cause the buffer to expand and trigger memcpy function frequently.

wgtmac · 2022-09-13T02:07:13Z

We may replace the DataBuffer by a new Buffer implementation with a much smarter memory management to automatically grow and shrink its size according to actual usage. This management can happen on the column basis.

LouisClt · 2022-09-22T14:04:55Z

Thanks everyone for your answers. I understand the possible performances issues linked with lowering too much the size of the buffer (on my testing it was OK in my case though).
I think the solution given by @wgtmac would be fine for me, and better than passing by global variables, if it is feasible.

wgtmac · 2022-09-27T02:03:21Z

I have created a JIRA to track the progress: https://issues.apache.org/jira/browse/ORC-1280

coderex2522 · 2022-09-27T02:34:24Z

@dongjoon-hyun @wgtmac @LouisClt I will follow up on this issues(ORC-1280) and implement a much smarter memory management.

dongjoon-hyun · 2022-09-27T23:37:27Z

Thank you, @coderex2522 .

…OutputStream ### What changes were proposed in this pull request? This PR can solve the huge memory taken by BufferedOutputStream and refactor the write data logic in class CompressionBase. ### Why are the changes needed? This patch use BlockBuffer to replace DataBuffer of class BufferedOutputStream in order to solve the [issue](#1240). ### How was this patch tested? The UTs in TestBufferedOutputStream.cc and TestCompression.cc can cover this patch. Add the TestBlockBuffer.write_to UT. Closes #1275 from coderex2522/ORC-1280-PART2. Authored-by: coderex2522 <rex6817@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

LouisClt · 2023-01-23T16:50:48Z

Hello, it seems there were commits referencing this issue. Is this issue now fixed ?

wgtmac · 2023-01-24T11:22:07Z

Hello, it seems there were commits referencing this issue. Is this issue now fixed ?

@LouisClt Thanks for your follow-up.

We have implemented a block-based buffer called BlockBuffer (by @coderex2522) and used it to replace the output buffer in the CompressionStream. It can decrease the memory footprint to some extent.

IMO, the next step is to use it to replace the input buffer of the CompressionStream which has the size of compressionBlockSize per stream.

wgtmac · 2023-02-03T07:09:43Z

Hello, it seems there were commits referencing this issue. Is this issue now fixed ?

@LouisClt Thanks for your follow-up.

We have implemented a block-based buffer called BlockBuffer (by @coderex2522) and used it to replace the output buffer in the CompressionStream. It can decrease the memory footprint to some extent.

IMO, the next step is to use it to replace the input buffer of the CompressionStream which has the size of compressionBlockSize per stream.

To be precise, the rawInputBuffer of every CompressionStream is fixed to the compression block size which is 1M by default. Writer with many columns will suffer from large memory footprint and nothing can be done to alleviate it.

I have created a JIRA to track it: https://issues.apache.org/jira/browse/ORC-1365

cc @coderex2522

LouisClt · 2023-02-06T09:47:34Z

Thanks for your reply @wgtmac and the implementation of the BlockBuffer.
I'll wait for the replacement of the rawInputBuffer by the BlockBuffer in every compression stream then. Do you think it will take long ?

dongjoon-hyun · 2023-02-06T11:23:54Z

Hi, @LouisClt . FYI, according to the Apache ORC release cycle, newly developed features will be delivered via v1.9.0 on September 2023 (if they are merged to Apache ORC before.)

https://github.com/apache/orc/milestones

LouisClt · 2023-02-06T13:38:22Z

Understood, and thanks for your answer !

luffy-zh · 2023-02-07T01:46:13Z

I will work on it.

…OutputStream ### What changes were proposed in this pull request? This PR can solve the huge memory taken by BufferedOutputStream and refactor the write data logic in class CompressionBase. ### Why are the changes needed? This patch use BlockBuffer to replace DataBuffer of class BufferedOutputStream in order to solve the [issue](apache#1240). ### How was this patch tested? The UTs in TestBufferedOutputStream.cc and TestCompression.cc can cover this patch. Add the TestBlockBuffer.write_to UT. Closes apache#1275 from coderex2522/ORC-1280-PART2. Authored-by: coderex2522 <rex6817@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

This was referenced Oct 6, 2022

ORC-1280:[C++] Implement block-based buffer(Part I) #1271

Merged

ORC-1286: [C++] replace DataBuffer with BlockBuffer in class BufferedOutputStream #1275

Closed

coderex2522 mentioned this issue Feb 1, 2023

ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream #1394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge memory taken for each field when exporting #1240

Huge memory taken for each field when exporting #1240

LouisClt commented Aug 30, 2022

dongjoon-hyun commented Aug 31, 2022

coderex2522 commented Aug 31, 2022

wgtmac commented Sep 13, 2022

LouisClt commented Sep 22, 2022

wgtmac commented Sep 27, 2022

coderex2522 commented Sep 27, 2022 •

edited

Loading

dongjoon-hyun commented Sep 27, 2022

LouisClt commented Jan 23, 2023

wgtmac commented Jan 24, 2023

wgtmac commented Feb 3, 2023

LouisClt commented Feb 6, 2023

dongjoon-hyun commented Feb 6, 2023

LouisClt commented Feb 6, 2023

luffy-zh commented Feb 7, 2023

Huge memory taken for each field when exporting #1240

Huge memory taken for each field when exporting #1240

Comments

LouisClt commented Aug 30, 2022

dongjoon-hyun commented Aug 31, 2022

coderex2522 commented Aug 31, 2022

wgtmac commented Sep 13, 2022

LouisClt commented Sep 22, 2022

wgtmac commented Sep 27, 2022

coderex2522 commented Sep 27, 2022 • edited Loading

dongjoon-hyun commented Sep 27, 2022

LouisClt commented Jan 23, 2023

wgtmac commented Jan 24, 2023

wgtmac commented Feb 3, 2023

LouisClt commented Feb 6, 2023

dongjoon-hyun commented Feb 6, 2023

LouisClt commented Feb 6, 2023

luffy-zh commented Feb 7, 2023

coderex2522 commented Sep 27, 2022 •

edited

Loading