Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1286: [C++] replace DataBuffer with BlockBuffer in class BufferedOutputStream #1275

Closed
wants to merge 9 commits into from

Conversation

coderex2522
Copy link
Contributor

@coderex2522 coderex2522 commented Oct 12, 2022

What changes were proposed in this pull request?

This PR can solve the huge memory taken by BufferedOutputStream and refactor the write data logic in class CompressionBase.

Why are the changes needed?

This patch use BlockBuffer to replace DataBuffer of class BufferedOutputStream in order to solve the issue.

How was this patch tested?

The UTs in TestBufferedOutputStream.cc and TestCompression.cc can cover this patch. Add the TestBlockBuffer.write_to UT.

@github-actions github-actions bot added the CPP label Oct 12, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderex2522 . Could you use a new JIRA instead of using PART 1 and PART II?
Apache community uses ORC JIRA ID for trace-ability.

@coderex2522 coderex2522 changed the title ORC-1280: [C++] replace DataBuffer with BlockBuffer in class BufferedOutputStream(PART II) ORC-1286: [C++] replace DataBuffer with BlockBuffer in class BufferedOutputStream Oct 12, 2022
@coderex2522
Copy link
Contributor Author

@coderex2522 . Could you use a new JIRA instead of using PART 1 and PART II? Apache community uses ORC JIRA ID for trace-ability.

@dongjoon-hyun Thank you for your suggestion. I create a new JIRA ORC-1286.

c++/src/Compression.cc Outdated Show resolved Hide resolved
c++/src/Compression.cc Outdated Show resolved Hide resolved
c++/src/Compression.cc Show resolved Hide resolved
c++/src/Compression.cc Outdated Show resolved Hide resolved
c++/src/io/OutputStream.cc Outdated Show resolved Hide resolved
c++/src/io/OutputStream.cc Outdated Show resolved Hide resolved
c++/src/io/OutputStream.cc Show resolved Hide resolved
@dongjoon-hyun
Copy link
Member

Thank you for updating the PR and creating JIRA, @coderex2522 .

c++/src/io/OutputStream.cc Show resolved Hide resolved
c++/src/io/OutputStream.cc Outdated Show resolved Hide resolved
if (mergeBlock.data + mergeBlock.size == curBlock.data) {
mergeBlock.size += curBlock.size;
} else {
outputStream->write(mergeBlock.data, mergeBlock.size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also limit each write not to exceed outputStream->getNaturalWriteSize()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeTo() of BlockBuffer will limit each write not to exceed NaturalWriteSize of outputStream.

if (chunkSize == 0) {
throw std::logic_error("Natural write size cannot be zero");
}
char* chunk = memoryPool.malloc(chunkSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If blockNumber == 1, we should avoid this unnecessary allocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if natural write size less than block size, here need to malloc chunk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however you can still get rid of the allocation when blockNumber == 1 && naturalWriteSize >= block.size. BTW, we can keep the current implementation for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocation can be avoided here when blockNumber == 1 && currentSize <= NaturalWriteSize.

SCOPED_STOPWATCH(metrics, IOBlockingLatencyUs, IOCount);
outputStream->write(dataBuffer->data(), dataSize);
SCOPED_STOPWATCH(metrics, IOBlockingLatencyUs, nullptr);
uint64_t IOCount = dataBuffer->writeTo(outputStream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return IOCount is weird. Better to pass the pointer of metrics into the writeTo function as a parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (chunkSize == 0) {
throw std::logic_error("Natural write size cannot be zero");
}
char* chunk = memoryPool.malloc(chunkSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however you can still get rid of the allocation when blockNumber == 1 && naturalWriteSize >= block.size. BTW, we can keep the current implementation for now

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache ORC 1.9.

@dongjoon-hyun dongjoon-hyun added this to the 1.9.0 milestone Oct 26, 2022
@dongjoon-hyun
Copy link
Member

Merged to main for Apache ORC 1.9.

cc @williamhyun , too

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…OutputStream

### What changes were proposed in this pull request?
This PR can solve the huge memory taken by BufferedOutputStream and refactor the write data logic in class CompressionBase.

### Why are the changes needed?
This patch use BlockBuffer to replace DataBuffer  of class BufferedOutputStream in order to solve the [issue](apache#1240).

### How was this patch tested?
The UTs in TestBufferedOutputStream.cc and TestCompression.cc can cover this patch. Add the TestBlockBuffer.write_to UT.

Closes apache#1275 from coderex2522/ORC-1280-PART2.

Authored-by: coderex2522 <rex6817@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants