Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872

Closed
wants to merge 15 commits into from

Conversation

ahmet-uyar
Copy link
Contributor

@ahmet-uyar ahmet-uyar commented Dec 6, 2021

Add statistics about both the original buffer sizes and the padded/compressed sizes.

@github-actions
Copy link

github-actions bot commented Dec 6, 2021

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou
Copy link
Member

pitrou commented Dec 6, 2021

Hi @ahmet-uyar , can you ensure the PR title is properly formatted? See the automated comment above for guidelines. Thank you!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission @ahmet-uyar ! Here are some comments below. Feel free to ask questions if not everything is clear.


/// compression ratio for the body of all record batches serialized
/// this is equivalent to:
/// serialized_body_length / raw_body_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is equivalent, I don't think there is any value in exposing it. People can trivially calculate it themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed compression ratio from WriteStats.

/// initial and serialized (may be compressed) body lengths for record batches
/// these values show the total sizes of all record batch body lengths
int64_t raw_body_length = 0;
int64_t serialized_body_length = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Arrow C++ APIs, "length" generally points to the logical number of elements (see e.g. Array::length()), while "size" points to the physical size in bytes (as in Buffer::size()). So I think this should be:

int64_t total_raw_body_size = 0;
int64_t total_compressed_body_size = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes it is not compressed, only serialized. Should we name it as total_serialized_body_size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, why not.

ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));

// pre-computed compression ratios for record batches with Compression::LZ4_FRAME
std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to hard-code this. The values can vary depending on the version of the lz4 library, or internal details of how we initialize the compressor. Just testing that some compression happens should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added raw record batch sizes instead.

// ARROW-8823: Calculating the compression ratio
FileWriterHelper helper;
IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give these clearer names, e.g. options_uncompressed and options_compressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -75,6 +76,19 @@ struct WriteStats {
/// Number of replaced dictionaries (i.e. where a dictionary batch replaces
/// an existing dictionary with an unrelated new dictionary).
int64_t num_replaced_dictionaries = 0;

/// initial and serialized (may be compressed) body lengths for record batches
/// these values show the total sizes of all record batch body lengths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also include the dictionary batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

batches[2] =
RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});

for(size_t i = 0; i < batches.size(); ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a suggestion for these tests, it would be nice to check that:

  • the raw body size is accurate (it can be hard-coded, since it should be stable and deterministic)
  • the compressed body size is equal to or smaller than the raw body size, depending on the compression parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as suggested.
But there is a slight difference. When a record-batch is serialized, buffer-sizes complemented to the multiple of 8. So when there is no compression, serialized record batch sizes can be slightly larger. In that case, raw-sizes are less than or equal to the serialized size.

In addition, when compression is used, if there is very little data (a few hundred bytes maybe), compressed size can actually be larger than the raw size. But I have not put this case in to the test case. So this is not a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, the padding can make the size slightly larger. Can you add a comment explaining this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ahmet-uyar ahmet-uyar changed the title Arrow-8823 Calculating compression ratio ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages Dec 6, 2021
@github-actions
Copy link

github-actions bot commented Dec 6, 2021

@github-actions
Copy link

github-actions bot commented Dec 6, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just some minor nits now.

@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
ASSERT_TRUE(out_metadata->Equals(*metadata));
}

TEST_F(TestWriteRecordBatch, CompressionRatio) {
// ARROW-8823: Calculating the compression ratio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename this test and update the comment now that we don't compute a ratio anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ASSERT_OK(helper.WriteBatch(batches[i]));
ASSERT_OK(helper.Finish());
ASSERT_GE(helper.writer_->stats().total_raw_body_size,
helper.writer_->stats().total_serialized_body_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be ASSERT_GT instead or will it fail the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality is needed, because one of the record batches has zero rows and both total_raw_body_size and total_serialized_body_size are zero.

cpp/src/arrow/ipc/writer.h Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/writer.h Outdated Show resolved Hide resolved
ahmet-uyar and others added 2 commits December 7, 2021 18:38
improving documentation

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@pitrou pitrou changed the title ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages ARROW-8823: [C++] Add total written size of batch buffers to IPC write statistics Dec 7, 2021
@pitrou pitrou changed the title ARROW-8823: [C++] Add total written size of batch buffers to IPC write statistics ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics Dec 7, 2021
@pitrou pitrou closed this in 77722d9 Dec 7, 2021
@ursabot
Copy link

ursabot commented Dec 7, 2021

Benchmark runs are scheduled for baseline = 9cf4275 and contender = 77722d9. 77722d9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Scheduled] ursa-i9-9960x
[Scheduled] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Dec 8, 2021

Benchmark runs are scheduled for baseline = 9cf4275 and contender = 77722d9. 77722d9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️2.22% ⬆️0.74%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.18%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants