-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1819: [C++] Refactor decoding #6685
Conversation
Also add an additional size check before reading the length of a byte array.
Could you keep this in the same file to simplify the review, and once it's approved, extract the decoding part in a new file in a subsequent commit? I'm not in the mood to read 1500 lines of parquet decoding. |
Ok, the diff should be much shorter now. |
AppVeyor build: https://ci.appveyor.com/project/pitrou/arrow/builds/31651804 |
8b06307
to
d24005d
Compare
@fsaintjacques Would be nice if you could give this a quick review, so that I can move forward with #6690. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, just a few comments
cpp/src/parquet/encoding.cc
Outdated
ParquetException::EofException(); | ||
} | ||
const uint32_t len = arrow::util::SafeLoadAs<uint32_t>(data); | ||
const int64_t increment = static_cast<int64_t>(4 + len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int64_t increment = static_cast<int64_t>(4 + len); | |
const int64_t consumed_length = static_cast<int64_t>(4 + len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
// Visit a null bitmap, in order, without overhead. | ||
// | ||
// The given `VisitFunc` should be a callable with either of these signatures: | ||
// - void(bool is_valid) | ||
// - Status(bool is_valid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem worthwhile given that we have internal::VisitBits
and internal::VisitBitsUnrolled
. The Status return signature is not useful in parquet::
since we can throw an exception to effect early termination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this function also gracefully separates the case where null_count
is 0 (and the null bitmap pointer potentially null), both for correctness and better performance.
|
||
auto decode_value = [&](bool is_valid) { | ||
if (is_valid) { | ||
if (ARROW_PREDICT_FALSE(len_ < 4)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coding style: please reintroduce the constant (named kPrefixSize
maybe) or use
if (ARROW_PREDICT_FALSE(len_ < 4)) { | |
if (ARROW_PREDICT_FALSE(len_ < sizeof(int32_t))) { |
rather than 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1. sizeof(int32_t)
is pointless pedantry.
@@ -2486,6 +2304,8 @@ class ByteStreamSplitDecoder : public DecoderImpl, virtual public TypedDecoder<D | |||
|
|||
private: | |||
int num_values_in_buffer{0U}; | |||
|
|||
static constexpr size_t kNumStreams = sizeof(T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
num_values_ -= values_decoded; | ||
len_ -= sizeof(num_streams) * values_decoded; | ||
len_ -= sizeof(T) * values_decoded; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len_ -= sizeof(T) * values_decoded; | |
len_ -= kNumStreams * values_decoded; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeof(T)
is logically more exact, though it's the same value (the length in bytes is decreased by the number of values decoded times the value size).
cpp/src/parquet/encoding.cc
Outdated
if (ARROW_PREDICT_FALSE(data_size < 4)) { | ||
ParquetException::EofException(); | ||
} | ||
const uint32_t len = arrow::util::SafeLoadAs<uint32_t>(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be int32_t
instead of uint32_t
with a positive check.
builder->UnsafeAppend(data_); | ||
data_ += descr_->type_length(); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that with this new lambda+visitor form, the compiler can't hoist descr_->type_length()
(and maybe it didn't before because it's a double indirection). I'd say move this out of the loop just to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds dubious to me.
int32_t index; | ||
if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) { | ||
throw ParquetException(""); | ||
} | ||
builder->UnsafeAppend(dict_values[index].ptr); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related, to your change, but dict_values[index]
should check for the index bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thank you.
int32_t index; | ||
if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) { | ||
throw ParquetException(""); | ||
} | ||
PARQUET_THROW_NOT_OK(builder->Append(dict_values[index].ptr)); | ||
} else { | ||
PARQUET_THROW_NOT_OK(builder->AppendNull()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto with bounds.
int32_t index; | ||
if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) { | ||
throw ParquetException(""); | ||
} | ||
builder->UnsafeAppend(dict_values[index]); | ||
} else { | ||
builder->UnsafeAppendNull(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this long overdue improvement. I don't have anything major to add, but I'm going to check the benchmarks locally
Benchmarks show no perceptible difference based on a quick glance |
+1, will merge if CI green. |
Also add an additional size check before reading the length of a byte array.