New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
Hi @ahmet-uyar , can you ensure the PR title is properly formatted? See the automated comment above for guidelines. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the submission @ahmet-uyar ! Here are some comments below. Feel free to ask questions if not everything is clear.
cpp/src/arrow/ipc/writer.h
Outdated
|
||
/// compression ratio for the body of all record batches serialized | ||
/// this is equivalent to: | ||
/// serialized_body_length / raw_body_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is equivalent, I don't think there is any value in exposing it. People can trivially calculate it themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed compression ratio from WriteStats.
cpp/src/arrow/ipc/writer.h
Outdated
/// initial and serialized (may be compressed) body lengths for record batches | ||
/// these values show the total sizes of all record batch body lengths | ||
int64_t raw_body_length = 0; | ||
int64_t serialized_body_length = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Arrow C++ APIs, "length" generally points to the logical number of elements (see e.g. Array::length()
), while "size" points to the physical size in bytes (as in Buffer::size()
). So I think this should be:
int64_t total_raw_body_size = 0;
int64_t total_compressed_body_size = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes it is not compressed, only serialized. Should we name it as total_serialized_body_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, why not.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME)); | ||
|
||
// pre-computed compression ratios for record batches with Compression::LZ4_FRAME | ||
std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to hard-code this. The values can vary depending on the version of the lz4 library, or internal details of how we initialize the compressor. Just testing that some compression happens should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added raw record batch sizes instead.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
// ARROW-8823: Calculating the compression ratio | ||
FileWriterHelper helper; | ||
IpcWriteOptions write_options1 = IpcWriteOptions::Defaults(); | ||
IpcWriteOptions write_options2 = IpcWriteOptions::Defaults(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give these clearer names, e.g. options_uncompressed
and options_compressed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/arrow/ipc/writer.h
Outdated
@@ -75,6 +76,19 @@ struct WriteStats { | |||
/// Number of replaced dictionaries (i.e. where a dictionary batch replaces | |||
/// an existing dictionary with an unrelated new dictionary). | |||
int64_t num_replaced_dictionaries = 0; | |||
|
|||
/// initial and serialized (may be compressed) body lengths for record batches | |||
/// these values show the total sizes of all record batch body lengths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this also include the dictionary batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
cpp/src/arrow/ipc/read_write_test.cc
Outdated
batches[2] = | ||
RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array}); | ||
|
||
for(size_t i = 0; i < batches.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a suggestion for these tests, it would be nice to check that:
- the raw body size is accurate (it can be hard-coded, since it should be stable and deterministic)
- the compressed body size is equal to or smaller than the raw body size, depending on the compression parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as suggested.
But there is a slight difference. When a record-batch is serialized, buffer-sizes complemented to the multiple of 8. So when there is no compression, serialized record batch sizes can be slightly larger. In that case, raw-sizes are less than or equal to the serialized size.
In addition, when compression is used, if there is very little data (a few hundred bytes maybe), compressed size can actually be larger than the raw size. But I have not put this case in to the test case. So this is not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, the padding can make the size slightly larger. Can you add a comment explaining this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just some minor nits now.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) { | |||
ASSERT_TRUE(out_metadata->Equals(*metadata)); | |||
} | |||
|
|||
TEST_F(TestWriteRecordBatch, CompressionRatio) { | |||
// ARROW-8823: Calculating the compression ratio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rename this test and update the comment now that we don't compute a ratio anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ASSERT_OK(helper.WriteBatch(batches[i])); | ||
ASSERT_OK(helper.Finish()); | ||
ASSERT_GE(helper.writer_->stats().total_raw_body_size, | ||
helper.writer_->stats().total_serialized_body_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be ASSERT_GT
instead or will it fail the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equality is needed, because one of the record batches has zero rows and both total_raw_body_size and total_serialized_body_size are zero.
improving documentation Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Benchmark runs are scheduled for baseline = 9cf4275 and contender = 77722d9. 77722d9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Benchmark runs are scheduled for baseline = 9cf4275 and contender = 77722d9. 77722d9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Add statistics about both the original buffer sizes and the padded/compressed sizes.