PARQUET-1819: [C++] Refactor decoding #6685

pitrou · 2020-03-23T11:54:54Z

Also add an additional size check before reading the length of a byte array.

pitrou · 2020-03-23T11:55:48Z

github-actions · 2020-03-23T12:02:30Z

https://issues.apache.org/jira/browse/PARQUET-1819

fsaintjacques · 2020-03-23T12:24:36Z

Could you keep this in the same file to simplify the review, and once it's approved, extract the decoding part in a new file in a subsequent commit? I'm not in the mood to read 1500 lines of parquet decoding.

pitrou · 2020-03-23T12:39:20Z

Ok, the diff should be much shorter now.

pitrou · 2020-03-23T13:17:35Z

AppVeyor build: https://ci.appveyor.com/project/pitrou/arrow/builds/31651804

pitrou · 2020-03-23T19:16:14Z

@fsaintjacques Would be nice if you could give this a quick review, so that I can move forward with #6690.

bkietz

This looks good, just a few comments

bkietz · 2020-03-23T19:19:54Z

cpp/src/parquet/encoding.cc

+    ParquetException::EofException();
+  }
+  const uint32_t len = arrow::util::SafeLoadAs<uint32_t>(data);
+  const int64_t increment = static_cast<int64_t>(4 + len);


Suggested change

const int64_t increment = static_cast<int64_t>(4 + len);

const int64_t consumed_length = static_cast<int64_t>(4 + len);

Fair enough.

bkietz · 2020-03-23T19:25:04Z

cpp/src/arrow/visitor_inline.h

+// Visit a null bitmap, in order, without overhead.
+//
+// The given `VisitFunc` should be a callable with either of these signatures:
+// - void(bool is_valid)
+// - Status(bool is_valid)


This doesn't seem worthwhile given that we have internal::VisitBits and internal::VisitBitsUnrolled. The Status return signature is not useful in parquet:: since we can throw an exception to effect early termination

Well, this function also gracefully separates the case where null_count is 0 (and the null bitmap pointer potentially null), both for correctness and better performance.

bkietz · 2020-03-23T19:29:00Z

cpp/src/parquet/encoding.cc


+    auto decode_value = [&](bool is_valid) {
+      if (is_valid) {
+        if (ARROW_PREDICT_FALSE(len_ < 4)) {


Coding style: please reintroduce the constant (named kPrefixSize maybe) or use

Suggested change

if (ARROW_PREDICT_FALSE(len_ < 4)) {

if (ARROW_PREDICT_FALSE(len_ < sizeof(int32_t))) {

rather than 4

-1. sizeof(int32_t) is pointless pedantry.

bkietz · 2020-03-23T19:29:43Z

cpp/src/parquet/encoding.cc

@@ -2486,6 +2304,8 @@ class ByteStreamSplitDecoder : public DecoderImpl, virtual public TypedDecoder<D

 private:
  int num_values_in_buffer{0U};
+
+  static constexpr size_t kNumStreams = sizeof(T);


bkietz · 2020-03-23T19:30:55Z

cpp/src/parquet/encoding.cc


  num_values_ -= values_decoded;
-  len_ -= sizeof(num_streams) * values_decoded;
+  len_ -= sizeof(T) * values_decoded;


Suggested change

len_ -= sizeof(T) * values_decoded;

len_ -= kNumStreams * values_decoded;

sizeof(T) is logically more exact, though it's the same value (the length in bytes is decreased by the number of values decoded times the value size).

fsaintjacques · 2020-03-23T19:52:21Z

cpp/src/parquet/encoding.cc

+  if (ARROW_PREDICT_FALSE(data_size < 4)) {
+    ParquetException::EofException();
+  }
+  const uint32_t len = arrow::util::SafeLoadAs<uint32_t>(data);


Needs to be int32_t instead of uint32_t with a positive check.

fsaintjacques · 2020-03-23T19:59:09Z

cpp/src/parquet/encoding.cc

      builder->UnsafeAppend(data_);
      data_ += descr_->type_length();
+    } else {


It's possible that with this new lambda+visitor form, the compiler can't hoist descr_->type_length() (and maybe it didn't before because it's a double indirection). I'd say move this out of the loop just to be sure.

That sounds dubious to me.

fsaintjacques · 2020-03-23T20:11:36Z

cpp/src/parquet/encoding.cc

      int32_t index;
      if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) {
        throw ParquetException("");
      }
      builder->UnsafeAppend(dict_values[index].ptr);
+    } else {


Not related, to your change, but dict_values[index] should check for the index bounds.

Will do, thank you.

fsaintjacques · 2020-03-23T20:11:50Z

cpp/src/parquet/encoding.cc

      int32_t index;
      if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) {
        throw ParquetException("");
      }
      PARQUET_THROW_NOT_OK(builder->Append(dict_values[index].ptr));
+    } else {
+      PARQUET_THROW_NOT_OK(builder->AppendNull());


Ditto with bounds.

fsaintjacques · 2020-03-23T20:12:11Z

cpp/src/parquet/encoding.cc

      int32_t index;
      if (ARROW_PREDICT_FALSE(!idx_decoder_.Get(&index))) {
        throw ParquetException("");
      }
      builder->UnsafeAppend(dict_values[index]);
+    } else {
+      builder->UnsafeAppendNull();


wesm

Thanks for doing this long overdue improvement. I don't have anything major to add, but I'm going to check the benchmarks locally

wesm · 2020-03-24T00:54:45Z

Benchmarks show no perceptible difference based on a quick glance

pitrou · 2020-03-24T10:29:35Z

+1, will merge if CI green.

PARQUET-1819: [C++] Refactor decoding

30919e7

Also add an additional size check before reading the length of a byte array.

Try to fix MSVC failures

49b5be2

fsaintjacques self-requested a review March 23, 2020 12:24

Make diff nicer by not splitting source file

d24005d

pitrou force-pushed the PARQUET-1819-refactor branch from 8b06307 to d24005d Compare March 23, 2020 19:14

pitrou mentioned this pull request Mar 23, 2020

PARQUET-1824: [C++] Fix crashes and undefined behaviour on invalid input #6690

Closed

bkietz requested changes Mar 23, 2020

View reviewed changes

Nit

057c7af

fsaintjacques requested changes Mar 23, 2020

View reviewed changes

wesm reviewed Mar 24, 2020

View reviewed changes

Apply review comments

ae4e418

pitrou closed this in 4fb888f Mar 24, 2020

pitrou deleted the PARQUET-1819-refactor branch March 24, 2020 10:52

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Fix crashes on corrupt IPC input (OSS-Fuzz) #42940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1819: [C++] Refactor decoding #6685

PARQUET-1819: [C++] Refactor decoding #6685

pitrou commented Mar 23, 2020

pitrou commented Mar 23, 2020

github-actions bot commented Mar 23, 2020

fsaintjacques commented Mar 23, 2020

pitrou commented Mar 23, 2020

pitrou commented Mar 23, 2020

pitrou commented Mar 23, 2020

bkietz left a comment

bkietz Mar 23, 2020

pitrou Mar 23, 2020

bkietz Mar 23, 2020

pitrou Mar 23, 2020

bkietz Mar 23, 2020

pitrou Mar 23, 2020

bkietz Mar 23, 2020

bkietz Mar 23, 2020

pitrou Mar 23, 2020

fsaintjacques Mar 23, 2020

fsaintjacques Mar 23, 2020

pitrou Mar 24, 2020

fsaintjacques Mar 23, 2020

pitrou Mar 24, 2020

fsaintjacques Mar 23, 2020

fsaintjacques Mar 23, 2020

wesm left a comment

wesm commented Mar 24, 2020

pitrou commented Mar 24, 2020

	const int64_t increment = static_cast<int64_t>(4 + len);
	const int64_t consumed_length = static_cast<int64_t>(4 + len);

	if (ARROW_PREDICT_FALSE(len_ < 4)) {
	if (ARROW_PREDICT_FALSE(len_ < sizeof(int32_t))) {

	len_ -= sizeof(T) * values_decoded;
	len_ -= kNumStreams * values_decoded;

PARQUET-1819: [C++] Refactor decoding #6685

PARQUET-1819: [C++] Refactor decoding #6685

Conversation

pitrou commented Mar 23, 2020

pitrou commented Mar 23, 2020

github-actions bot commented Mar 23, 2020

fsaintjacques commented Mar 23, 2020

pitrou commented Mar 23, 2020

pitrou commented Mar 23, 2020

pitrou commented Mar 23, 2020

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

wesm commented Mar 24, 2020

pitrou commented Mar 24, 2020