Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to compress data in PrestoVectorSerializer #5544

Closed

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented Jul 6, 2023

Compress the data by folly when flush to OutputStream. And decompress the data under deserialization.
Will use in spill compression.

Resolve:#5313

@netlify
Copy link

netlify bot commented Jul 6, 2023

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit c48af6b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64bdd330b522500008b081a5
😎 Deploy Preview https://deploy-preview-5544--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 6, 2023
@jinchengchenghh jinchengchenghh changed the title Add support to compress buffer in VectorSerializer Add support to compress data in VectorSerializer Jul 7, 2023
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change % minors!

case CompressionKind_LZ4:
return getCodec(folly::io::CodecType::LZ4);
}
VELOX_UNREACHABLE();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log out the unsupported kind in default case to help debugging? Thanks!

velox/dwio/common/Common.h Outdated Show resolved Hide resolved
velox/common/memory/ByteStream.cpp Outdated Show resolved Hide resolved
velox/common/memory/ByteStream.cpp Outdated Show resolved Hide resolved
velox/serializers/PrestoSerializer.cpp Outdated Show resolved Hide resolved
auto children = &(*result)->children();
auto childTypes = type->as<TypeKind::ROW>().children();
readColumns(source, pool, childTypes, children, useLosslessTimestamp);
if (codec->type() == folly::io::CodecType::NO_COMPRESSION) {
VELOX_CHECK(!isCompressedBitSet(pageCodecMarker));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move the check above at l1845?

VELOX_CHECK_EQ(codec->type() == folly::io::CodecType::NO_COMPRESSION, !isCompressedBitSet(pageCodecMarker), "error message");

for (auto& stream : streams_) {
stream->flush(&out);
}
int32_t uncompressedSize = out.tellp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

velox/serializers/PrestoSerializer.cpp Outdated Show resolved Hide resolved
crc = computeChecksum(listener, codec, numRows, compressedSize);
}
output->seekp(crcOffset);
writeInt64(output, crc); // Write zero checksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not valid?

velox/serializers/tests/PrestoSerializerTest.cpp Outdated Show resolved Hide resolved
@jinchengchenghh jinchengchenghh changed the title Add support to compress data in VectorSerializer Add support to compress data in PrestoVectorSerializer Jul 10, 2023
@jinchengchenghh
Copy link
Contributor Author

Fixed all the comments, can you help review again? @xiaoxmeng

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. LGTM. Could you draw a diagram to show the serialized page layout with and without compression in header?

velox/dwio/common/Common.cpp Outdated Show resolved Hide resolved
velox/serializers/tests/PrestoSerializerTest.cpp Outdated Show resolved Hide resolved
TEST_F(PrestoSerializerTest, ioBufRoundTrip) {
serializer::presto::PrestoVectorSerde::registerVectorSerde();
TEST_P(PrestoSerializerTest, ioBufRoundTrip) {
if (!isRegisteredVectorSerde()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you register once in SetUpTestCase? Thanks!

@@ -275,27 +297,31 @@ TEST_F(PrestoSerializerTest, multiPage) {
auto byteStream = toByteStream(bytes);

RowVectorPtr deserialized;
dwio::common::CompressionKind kind = GetParam();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a method in test class: getSerdeOptions() which construct an options based on the test parameter?

#include "velox/vector/VectorStream.h"

namespace facebook::velox::serializer::presto {
class PrestoVectorSerde : public VectorSerde {
public:
// Input options that the serializer recognizes.
struct PrestoOptions : VectorSerde::Options {
explicit PrestoOptions(bool useLosslessTimestamp)
: useLosslessTimestamp(useLosslessTimestamp) {}
explicit PrestoOptions() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrestoOptions() = default;

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, then we can simplify the function, otherwise, we get nullptr Options most of the case, we should construct the PrestoOptions with its default value.

const PrestoVectorSerde::PrestoOptions toPrestoOptions(
    const VectorSerde::Options* options) {
  if (options == nullptr) {
    return PrestoVectorSerde::PrestoOptions();
  }
  return *(static_cast<const PrestoVectorSerde::PrestoOptions*>(options));
}

explicit PrestoOptions() {}

explicit PrestoOptions(
bool useLosslessTimestamp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/useLosslessTimestamp/_useLosslessTimestamp/
s/compressionKind/_compressionKind/

@@ -136,6 +136,18 @@ std::string typeToEncodingName(const TypePtr& type) {
}
}

const PrestoVectorSerde::PrestoOptions toPrestoOptions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/const PrestoVectorSerde::PrestoOptions/PrestoVectorSerde::PrestoOptions/

Drop the const as return by value?

private:
static const int32_t kSizeInBytesOffset{4 + 1};
static const int32_t kHeaderSize{kSizeInBytesOffset + 4 + 4 + 8};

const StreamArena* streamArena_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamArena* const streamArena_;
const std::unique_ptr<folly::io::Codec> codec_;

velox/serializers/PrestoSerializer.cpp Outdated Show resolved Hide resolved
bool useLosslessTimestamp) {
bool useLosslessTimestamp,
dwio::common::CompressionKind compressionKind) {
streamArena_ = streamArena;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for the streamArena_ init

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jinchengchenghh
Copy link
Contributor Author

jinchengchenghh commented Jul 13, 2023

Thanks for the update. LGTM. Could you draw a diagram to show the serialized page layout with and without compression in header?

The serialized page layout is same with and without compression in header.
Just change the data layout.
image

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Looks great. Thanks for iterations!

velox/dwio/common/Common.cpp Outdated Show resolved Hide resolved
readColumns(source, pool, childTypes, children, useLosslessTimestamp);
if (!needCompression(*codec)) {
// skip number of columns
source->skip(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a check if the number of columns match? We currently see some serialized pages corruption in meta internal test. Thanks!

ByteStream uncompressedSource;
uncompressedSource.resetInput({byteRange});
// skip number of columns
uncompressedSource.skip(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@xiaoxmeng
Copy link
Contributor

Thanks for the update. LGTM. Could you draw a diagram to show the serialized page layout with and without compression in header?

The serialized page layout is same with and without compression in header. Just change the data layout. image

Thanks. I meant to add this diagram in the code. But we can do this followup if you'd like to help. Thanks!

ASSERT_EQ(
folly::io::CodecType::LZ4,
compressionKindToCodec(CompressionKind::CompressionKind_LZ4)->type());
EXPECT_THROW(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use VELOX_ASSERT_THROW

@@ -14,7 +14,7 @@
add_library(velox_presto_serializer PrestoSerializer.cpp
UnsafeRowSerializer.cpp)

target_link_libraries(velox_presto_serializer velox_vector)
target_link_libraries(velox_presto_serializer velox_dwio_common velox_vector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is a good idea to add this dependency. What do we need from dwio::common? Can we refactor to extract what we need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this file https://github.com/facebookincubator/velox/blob/main/velox/dwio/common/Common.h, I would suggest to rename this file to Compression.h, and extract it to velox_compression.

@jinchengchenghh jinchengchenghh force-pushed the compressup branch 2 times, most recently from d2cc78a to 0370416 Compare July 17, 2023 05:17
@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

add_subdirectory(tests)
endif()

add_library(velox_dwio_common_compression Compression.cpp LzoDecompressor.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be under dwio? Looks like it will be used for non-DWIO code. Maybe move it somewhere else. It would be nice to do this refactoring in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would you recommend to put it? Maybe velox/common/compression or velox/compression? I can help refactor it in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

velox/common/compression sounds good. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh I think you have moved to the wrong location: it should be velox/common/compression/Compression.h but not velox/dwio/common/compression/Compression.h? @mbasmanova

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refactor in a separate PR as this discussion suggested. #5544 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Thank you. Just to clarify, we'll wait for a refactoring PR to proceed.

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh could you rebase the PR to clear the ci test failure? Thanks!

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

case folly::io::CodecType::LZ4:
return CompressionKind_LZ4;
default:
VELOX_UNSUPPORTED("Not support folly codec type {}", type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type.toString()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a enum class folly::io::CodecType, it does not have toString function

case CompressionKind_LZ4:
return "lz4";
}
return folly::to<std::string>("unknown - ", kind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put inside switch default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code, I just move it to another directory.

@@ -15,20 +15,29 @@
*/
#pragma once
#include "velox/common/base/Crc.h"
#include "velox/dwio/common/compression/Compression.h"
#include "velox/vector/VectorStream.h"

namespace facebook::velox::serializer::presto {
class PrestoVectorSerde : public VectorSerde {
public:
// Input options that the serializer recognizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is not within this PR but can we make the comments triple slash as it's public? THanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could open a new PR to do that since it is no relevant change.

dwio::common::CompressionKind _compressionKind)
: useLosslessTimestamp(_useLosslessTimestamp),
compressionKind(_compressionKind) {}

// Currently presto only supports millisecond precision and the serializer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

PrestoVectorSerde::PrestoOptions toPrestoOptions(
const VectorSerde::Options* options) {
if (options == nullptr) {
return PrestoVectorSerde::PrestoOptions();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does returning NONE mean no compression or something wrong? I see a needCompression() method down there already can tell if compression is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
@jinchengchenghh I know this is due to current code behavior but could you help either fix all the code sites to pass non-null VectorSerde::Options or have a unit test to make sure PrestoVectorSerde::PrestoOptions() returns common::CompressionKind::CompressionKind_NONE.

@@ -1580,6 +1595,9 @@ class PrestoVectorSerializer : public VectorSerializer {
}
}

// The SerializedPage layout is:
// numRows(4) | codec(1) | uncompressedSize(4) | compressedSize(4) |
// checksum(8) | data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comments. Can we make them triple slash /// as it's public?

velox/serializers/PrestoSerializer.cpp Outdated Show resolved Hide resolved
velox/serializers/PrestoSerializer.cpp Show resolved Hide resolved
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh file move PR is now committed. Can you rebase? Thanks!

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng merged this pull request in e80df93.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants