Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1819: [C++] Refactor decoding #6685

Closed
wants to merge 5 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Mar 23, 2020

Also add an additional size check before reading the length of a byte array.

Also add an additional size check before reading the length of a byte array.
@pitrou
Copy link
Member Author

pitrou commented Mar 23, 2020

cc @fsaintjacques @wesm

@github-actions
Copy link

@fsaintjacques
Copy link
Contributor

Could you keep this in the same file to simplify the review, and once it's approved, extract the decoding part in a new file in a subsequent commit? I'm not in the mood to read 1500 lines of parquet decoding.

@fsaintjacques fsaintjacques self-requested a review March 23, 2020 12:24
@pitrou
Copy link
Member Author

pitrou commented Mar 23, 2020

Ok, the diff should be much shorter now.

@pitrou
Copy link
Member Author

pitrou commented Mar 23, 2020

@pitrou
Copy link
Member Author

pitrou commented Mar 23, 2020

@fsaintjacques Would be nice if you could give this a quick review, so that I can move forward with #6690.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just a few comments

ParquetException::EofException();
}
const uint32_t len = arrow::util::SafeLoadAs<uint32_t>(data);
const int64_t increment = static_cast<int64_t>(4 + len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int64_t increment = static_cast<int64_t>(4 + len);
const int64_t consumed_length = static_cast<int64_t>(4 + len);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Comment on lines +477 to +481
// Visit a null bitmap, in order, without overhead.
//
// The given `VisitFunc` should be a callable with either of these signatures:
// - void(bool is_valid)
// - Status(bool is_valid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem worthwhile given that we have internal::VisitBits and internal::VisitBitsUnrolled. The Status return signature is not useful in parquet:: since we can throw an exception to effect early termination

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this function also gracefully separates the case where null_count is 0 (and the null bitmap pointer potentially null), both for correctness and better performance.


auto decode_value = [&](bool is_valid) {
if (is_valid) {
if (ARROW_PREDICT_FALSE(len_ < 4)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: please reintroduce the constant (named kPrefixSize maybe) or use

Suggested change
if (ARROW_PREDICT_FALSE(len_ < 4)) {
if (ARROW_PREDICT_FALSE(len_ < sizeof(int32_t))) {

rather than 4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1. sizeof(int32_t) is pointless pedantry.

@@ -2486,6 +2304,8 @@ class ByteStreamSplitDecoder : public DecoderImpl, virtual public TypedDecoder<D

private:
int num_values_in_buffer{0U};

static constexpr size_t kNumStreams = sizeof(T);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


num_values_ -= values_decoded;
len_ -= sizeof(num_streams) * values_decoded;
len_ -= sizeof(T) * values_decoded;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
len_ -= sizeof(T) * values_decoded;
len_ -= kNumStreams * values_decoded;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(T) is logically more exact, though it's the same value (the length in bytes is decreased by the number of values decoded times the value size).

if (ARROW_PREDICT_FALSE(data_size < 4)) {
ParquetException::EofException();
}
const uint32_t len = arrow::util::SafeLoadAs<uint32_t>(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be int32_t instead of uint32_t with a positive check.

builder->UnsafeAppend(data_);
data_ += descr_->type_length();
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that with this new lambda+visitor form, the compiler can't hoist descr_->type_length() (and maybe it didn't before because it's a double indirection). I'd say move this out of the loop just to be sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds dubious to me.

int32_t index;
if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) {
throw ParquetException("");
}
builder->UnsafeAppend(dict_values[index].ptr);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related, to your change, but dict_values[index] should check for the index bounds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thank you.

int32_t index;
if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) {
throw ParquetException("");
}
PARQUET_THROW_NOT_OK(builder->Append(dict_values[index].ptr));
} else {
PARQUET_THROW_NOT_OK(builder->AppendNull());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto with bounds.

int32_t index;
if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) {
throw ParquetException("");
}
builder->UnsafeAppend(dict_values[index]);
} else {
builder->UnsafeAppendNull();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this long overdue improvement. I don't have anything major to add, but I'm going to check the benchmarks locally

@wesm
Copy link
Member

wesm commented Mar 24, 2020

Benchmarks show no perceptible difference based on a quick glance

@pitrou
Copy link
Member Author

pitrou commented Mar 24, 2020

+1, will merge if CI green.

@pitrou pitrou closed this in 4fb888f Mar 24, 2020
@pitrou pitrou deleted the PARQUET-1819-refactor branch March 24, 2020 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants