Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unified compression API and lz4_frame/lz4_raw/lz4_hadoop codec #7589

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

marin-ma
Copy link
Contributor

Initial implementation of the proposed unified compression API. This patch defines the Compression Codec API inspired by Apache Arrow and adds missing functions used in Velox. Adds support for codecs LZ4_FRAME, LZ4_RAW, and LZ4_HADOOP. Include unit tests.

Discussion: #7471

Copy link

netlify bot commented Nov 15, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0e886a4
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/662a23924b9cec0008a467bf

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2023
@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch 4 times, most recently from 70fd61a to 26dbdea Compare November 16, 2023 01:45
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review? Thanks!

@FelixYBW
Copy link
Contributor

velox/common/compression/v2/Compression.h Outdated Show resolved Hide resolved
velox/common/compression/v2/HadoopCompressionFormat.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/HadoopCompressionFormat.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/HadoopCompressionFormat.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/Lz4Compression.cpp Outdated Show resolved Hide resolved

ret = LZ4F_createCompressionContext(&ctx_, LZ4F_VERSION);
if (LZ4F_isError(ret)) {
lz4Error(ret, "LZ4 init failed: ");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any missing content for the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lz4Error will try to expand the error message for the return code. Switched the param order to make it clear for the reader.

velox/common/compression/v2/Lz4Compression.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/Lz4Compression.cpp Outdated Show resolved Hide resolved
velox/common/compression/v2/tests/CompressionTest.cpp Outdated Show resolved Hide resolved
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review? Thanks!

@yaqi-zhao
Copy link
Contributor

Asynchronous Compression API is not is this PR, right?

@marin-ma
Copy link
Contributor Author

Asynchronous Compression API is not is this PR, right?

@yaqi-zhao Yes. As suggested here #7471 (comment) We shall add synchronous API first.

@george-gu-2021
Copy link

Asynchronous Compression API is not is this PR, right?

@yaqi-zhao Yes. As suggested here #7471 (comment) We shall add synchronous API first.

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

@FelixYBW
Copy link
Contributor

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

I don't think so. Let's have good design then implement the features step by step as we talked. It's bad idea to merge an unmature code firstly then refactor it. it doesn't block the customer PoC if they want to use Yaqi's draft one, currently the issue blocked the customer PoC is how to generate the data which IAA can decompress, not the PRs.

@yaqi-zhao
Copy link
Contributor

Hi @pedroerp @mbasmanova , @marin-ma , may we prioritize the async mode interface implementation at the same time? Some key partners are waiting for the optimization to do Velox PoC validation. Thanks!

I don't think so. Let's have good design then implement the features step by step as we talked. It's bad idea to merge an unmature code firstly then refactor it. it doesn't block the customer PoC if they want to use Yaqi's draft one, currently the issue blocked the customer PoC is how to generate the data which IAA can decompress, not the PRs.

@FelixYBW The blocked issue is not the data generation, #7437 is merged and there is no block in this issue.

@majetideepak
Copy link
Collaborator

@marin-ma Some high-level comments.
Why create a new folder V2? Why not update the existing compression.h/.cpp files?
Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel?
I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review.
What do you think?

@marin-ma
Copy link
Contributor Author

@marin-ma Some high-level comments. Why create a new folder V2? Why not update the existing compression.h/.cpp files? Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel? I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review. What do you think?

@majetideepak Thank you for the review.

Why create a new folder V2? Why not update the existing compression.h/.cpp files?

This was based on the discussion in #7471 (comment) The new folder "v2" is intended to introduce the new API first. Next, I will replace the compression API used in parquet and dwio module with the new API. Meanwhile, common/compression will be replaced by common/compression/v2.

Do we need all the Arrow API? Where will we use API such as maximumCompressionLevel?

This is a user-level API, which can be useful when users want to set different compression levels, such as in a Parquet writer. Given that the minimum and maximum compression levels can vary among different compression codecs, maximumCompressionLevel provides users with a boundary to ensure that a valid compression level is used. However, this approach also makes the API complicated. If the APIs related to "compression level" are unnecessary, I can remove them, which would then require users to refer to the documentation of the compression library for this information.

I feel it will be easier to start with a single codec say Snappy that both DWRF and Parquet/Arrow use and consolidate with them. Just the decompress path first will be easier to review.

This should indeed make the review process more straightforward. However, I'm unsure about the practicality of replacing only one codec with the new API while keeping the original ones for the rest. Does this mean that we should temporarily disable other codecs until they can be integrated individually? Or perhaps you have a more efficient suggestion for this replacement process?

@majetideepak
Copy link
Collaborator

This was based on the discussion in #7471 (comment)

Thanks for this pointer. Let's continue with the steps outlined in that issue. I will make a pass today.

@FelixYBW
Copy link
Contributor

FelixYBW commented Dec 1, 2023

@pedroerp @mbasmanova Can you review the PR?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the main Compression.h/cpp API.
I will look at the Hadoop and Lz4 compression code next.

uint64_t bytesWritten;
bool outputTooSmall;
};
struct FlushResult {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a new line above struct FlushResult and struct EndResult.

};

/// Compress some input.
/// If bytes_read is 0 on return, then a larger output buffer should be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompressResult.bytesRead

uint8_t* output) = 0;

/// Flush part of the compressed output.
/// If outputTooSmall is true on return, flush() should be called again
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlushResult.outputTooSmall

virtual FlushResult flush(uint64_t outputLength, uint8_t* output) = 0;

/// End compressing, doing whatever is necessary to end the stream.
/// If outputTooSmall is true on return, end() should be called again
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndResult.outputTooSmall

/// If outputTooSmall is true on return, end() should be called again
/// with a larger buffer. Otherwise, the Compressor should not be used
/// anymore.
/// end() implies flush().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what end() implies flush() means?

}
auto actualLength =
doGetUncompressedLength(inputLength, input, uncompressedLength);
if (actualLength) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actualLength > 0

auto actualLength =
doGetUncompressedLength(inputLength, input, uncompressedLength);
if (actualLength) {
if (uncompressedLength) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncompressedLength > 0

VELOX_USER_CHECK_EQ(
*actualLength,
*uncompressedLength,
"Invalid uncompressed length: {}.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that expected uncompressed length {uncompressedLength} = {actualLength}

/// be written in this call will be written in subsequent calls to this
/// function. This is useful when fixed-size compression blocks are required
/// by the caller.
/// Note: Only Gzip and Zstd codec supports this function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an API supportsPartialCompression?

/// function. This is useful when fixed-size compression blocks are required
/// by the caller.
/// Note: Only Gzip and Zstd codec supports this function.
virtual uint64_t compressPartial(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API name compressPartial is misleading since this is still a one-shot compression.
But I can't think of an alternative name either :)

@@ -76,6 +76,10 @@ std::string compressionKindToString(CompressionKind kind) {
return "lz4";
case CompressionKind_GZIP:
return "gzip";
case CompressionKind_LZ4RAW:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please group these two with the lz4 above?

@@ -0,0 +1,29 @@
# Copyright (c) Facebook, Inc. and its affiliates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is v2 folder supposed to replace its parent in the future? This structure is confusing, for example, you put Lz4Compression in this folder, while the original lzoDecompressor is in the parent directory. Also, how are you going to organize compressionKindToCodec, etc? If the intention is to replace the current compress/decompress interface, it would be better to just make changes to the parent (velox/common/compression) folder, so we can see what the structure of the interfaces are.

/// If bytes_read is 0 on return, then a larger output buffer should be
/// supplied.
virtual CompressResult compress(
uint64_t inputLength,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Velox convention is to have the array first, length second. Same for both input and output.

std::numeric_limits<int32_t>::min();

// Streaming compressor interface.
class Compressor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think the naming of Compressor and Codec is a bit confusing unless you are familiar with Arrow. In velox::common, there is "encode" folder which contains integer encoding/decoding. Compression codecs don't belong there, but they are also named as codec. And people won't think "Compressor" is actually for streaming compression use. With Velox naming conventions, I think it's better to name them as StreamingCompressor and Compressor. Actually even in Arrow, they would call a codec "decompressor" or "compressor". e.g. in column_reader

decompressor_ = GetCodec(codec);

uint64_t compressedSize = 0;
compressed.resize(10);
bool doFlush = false;
// Generate small random input buffer size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line above all comments in this function to improve readability

uint64_t remaining = compressed.size();
uint64_t decompressedSize = 0;
decompressed.resize(10);
// Generate small random input buffer size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line above all comments in this function to improve readability


// Check the streaming compressor against one-shot decompression.
void checkStreamingCompressor(Codec* codec, const std::vector<uint8_t>& data) {
// Run streaming compression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line above all comments in this function to improve readability. Same for the next function

@yingsu00
Copy link
Collaborator

yingsu00 commented Dec 4, 2023

This was based on the discussion in #7471 (comment) The new folder "v2" is intended to introduce the new API first. Next, I will replace the compression API used in parquet and dwio module with the new API. Meanwhile, common/compression will be replaced by common/compression/v2.

I actually find this approach not very crystal clear. For one thing, we don't know which file will be in which folder. Also I think the Compression.h/cpp should be merged with the ones in the velox/common/compression folder. I think you can achieve the same goal without disrupting Velox code base by just putting the code to where they should belong to.

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch 2 times, most recently from be335e4 to 6e1b203 Compare December 7, 2023 04:00
Copy link

stale bot commented Mar 6, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Mar 6, 2024
@george-gu-2021
Copy link

The PR becomes stale? Any further plan regarding the PR from authors or reviewers? Thanks!

@stale stale bot removed the stale label Mar 7, 2024
@yingsu00
Copy link
Collaborator

yingsu00 commented Apr 4, 2024

@marin-ma Are you still working on this PR?

@FelixYBW
Copy link
Contributor

FelixYBW commented Apr 4, 2024

@marin-ma Are you still working on this PR?

Paused a while to align with Pedro's changes. Since it isn't issue now. Rong will pick this up again. Let's unify the codec.

@marin-ma marin-ma force-pushed the unify-compression-api-lz4 branch 2 times, most recently from f04ea23 to ecfc156 Compare April 22, 2024 06:37
@marin-ma
Copy link
Contributor Author

@rui-mo @PHILO-HE Could you help to review first? Thanks!

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/common/compression/Compression.cpp Outdated Show resolved Hide resolved
velox/common/compression/Compression.cpp Outdated Show resolved Hide resolved
velox/common/compression/Compression.cpp Show resolved Hide resolved
velox/common/compression/Compression.h Outdated Show resolved Hide resolved
@marin-ma
Copy link
Contributor Author

@yingsu00 @majetideepak Could you help to review again? Thanks!

Copy link

stale bot commented Aug 19, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Aug 19, 2024
@stale stale bot closed this Sep 2, 2024
@majetideepak majetideepak reopened this Oct 31, 2024
@stale stale bot removed stale labels Oct 31, 2024
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Apologies for not reviewing this earlier. I will work with you on this. Can you please rebase?
I left a couple of comments. Thanks!


/// Propagate any non-successful Status wrapped in folly::Unexpected to the
/// caller.
#define VELOX_RETURN_UNEXPECTED_NOT_OK(status) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this when we need it.

/// Common API for extracting Status from either Status or Result<T> (the latter
/// is defined in Result.h).
/// Useful for status check macros such as VELOX_RETURN_NOT_OK.
#define VELOX_RETURN_UNEXPECTED(expected) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this anywhere, if not let's add it when we need it.


/// Return the largest supported compression level for the kind
/// Note: This function creates a temporary Codec instance.
static folly::Expected<int32_t, Status> maximumCompressionLevel(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd that these static functions have to create an instance to get the result.
The Arrow PR that added this felt there would be a usage, but I don't see this being used outside of tests.
Velox will not be exposing the compression API as a public API.
Let's move these static functions to a test utility or remove some of them that are not being used.

@majetideepak
Copy link
Collaborator

majetideepak commented Nov 4, 2024

@marin-ma, Are you able to continue this work? Would you mind if I pushed to this branch and addressed some of the comments?

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 4, 2024

@marin-ma, Are you able to continue this work? Would you mind if I pushed to this branch and addressed some of the comments?

@majetideepak go ahead to push changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants