-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream #1394
Conversation
c++/include/orc/Writer.hh
Outdated
@@ -262,6 +262,17 @@ namespace orc { | |||
* @return if not set, the default is false | |||
*/ | |||
bool getUseTightNumericVector() const; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is detected by the CI format check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
} | ||
|
||
TEST(WriterTest, setOutputBufferCapacity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
c++/test/TestWriter.cc
Outdated
TEST(WriterTest, setOutputBufferCapacity) { | ||
testSetOutputBufferCapacity(1024); | ||
testSetOutputBufferCapacity(1024 * 1024); | ||
testSetOutputBufferCapacity(1024 * 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is 1GB
, I'd remove this from unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1GB case has been deleted.
cc @wgtmac , @stiga-huang , @williamhyun , @guiyanakuang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check and fix Windows
failure.
unknown file: error: C++ exception with description "bad allocation" thrown in the test body.
[ FAILED ] WriterTest.setOutputBufferCapacity (44714 ms)
c++/include/orc/Writer.hh
Outdated
@@ -262,6 +262,17 @@ namespace orc { | |||
* @return if not set, the default is false | |||
*/ | |||
bool getUseTightNumericVector() const; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is detected by the CI format check.
c++/test/TestWriter.cc
Outdated
TEST(WriterTest, setOutputBufferCapacity) { | ||
testSetOutputBufferCapacity(1024); | ||
testSetOutputBufferCapacity(1024 * 1024); | ||
testSetOutputBufferCapacity(1024 * 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
c++/test/TestWriter.cc
Outdated
Type::buildTypeFromString("struct<col1:int,col2:int>")); | ||
WriterOptions options; | ||
options.setStripeSize(1024 * 1024) | ||
.setCompressionBlockSize(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change setCompressionBlockSize
accordingly? It seems that even you have changed the output buffer size but the page is very small and no resize will happen actually.
c++/include/orc/Writer.hh
Outdated
@@ -262,6 +262,19 @@ namespace orc { | |||
* @return if not set, the default is false | |||
*/ | |||
bool getUseTightNumericVector() const; | |||
|
|||
/** | |||
* Set the initial buffer capacity of output stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Set the initial buffer capacity of output stream. | |
* Set the initial capacity to the output buffer of the compressed stream. |
c++/include/orc/Writer.hh
Outdated
|
||
/** | ||
* Set the initial buffer capacity of output stream. | ||
* Each column contains multiple output streams with buffers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Each column contains multiple output streams with buffers, | |
* Each column contains one or more compressed streams depending on its type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class CompressionStream is a subclass of class BufferedOutputStream which contains the buffer. And BufferedOutputStream has no logic associated with the compression algorithm, so it may not be appropriate to use compressed stream to describe it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it makes sense to me to use BufferedOutputStream
explicitly. We have defined several output streams (e.g. one from OrcFile.hh
). It should be clear to user which one we are referring to.
c++/include/orc/Writer.hh
Outdated
/** | ||
* Set the initial buffer capacity of output stream. | ||
* Each column contains multiple output streams with buffers, | ||
* and these buffers will automatically expand when memory is exhausted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* and these buffers will automatically expand when memory is exhausted. | |
* and these buffers will automatically expand when more memory is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the used size of the buffer reaches the initial capacity, the buffer will continue to expand in compression block size units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CompressionStream
, will the output buffer exceed the compression block size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outputBuffer(char* variable) of the class CompressionStreamBase doesn't exceed the compression block size, and the dataBuffer(BlockBuffer variable) of the class BufferedOutputStream will exceed the compression block size.
c++/include/orc/Writer.hh
Outdated
WriterOptions& setOutputBufferCapacity(uint64_t capacity); | ||
|
||
/** | ||
* Get the buffer capacity of output stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Get the buffer capacity of output stream. | |
* Get the initial capacity of output buffer in the class BufferedOutputStream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. My point is to explicitly describe which buffer applies the new options. It is kind of vague to me by simply saying output buffer and output stream.
c++/include/orc/Writer.hh
Outdated
|
||
/** | ||
* Set the initial buffer capacity of output stream. | ||
* Each column contains multiple output streams with buffers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it makes sense to me to use BufferedOutputStream
explicitly. We have defined several output streams (e.g. one from OrcFile.hh
). It should be clear to user which one we are referring to.
c++/include/orc/Writer.hh
Outdated
/** | ||
* Set the initial buffer capacity of output stream. | ||
* Each column contains multiple output streams with buffers, | ||
* and these buffers will automatically expand when memory is exhausted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CompressionStream
, will the output buffer exceed the compression block size?
+1 for @wgtmac 's advice. BTW,
|
It seems to be caused by googletest package download failure. Compile successfully after retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @coderex2522 and @wgtmac .
@coderex2522 I will leave it to you to merge it to test your permission. |
Congratulation, @coderex2522 . :) |
FYI, we provide merge script like Apache Spark community which preserves the committer identities when you merge someone else's PR. |
Hey Xin, I've made an official announcement in the ORC dev and user mailing list welcoming you as our new committer. In addition, Could you make a PR to update the ORC website to add you as a committer?
|
Thanks, I've subscribed to these mailing lists before. And I will submit a PR to update the ORC website as soon as possible. |
…OutputStream (apache#1394) * ORC-1312: [C++] Support setting the capacity of output buffer in the class BufferedOutputStream
What changes were proposed in this pull request?
This pr provides the API to set the initial buffer capacity of BufferedOutputStream.
Why are the changes needed?
It's convenient for users to adjust the buffer size of the BufferedOutputStream according to the written data.
How was this patch tested?
Add the WriterTest.setOutputBufferCapacity to test different buffer capacity of BufferedOutputStream.
In addition, I did a simple test similar to this issue