PARQUET-1095: [C++] Read and write Arrow decimal values #403

cpcloud · 2017-09-25T19:26:38Z

This depends on:

ARROW-1607
ARROW-1656
ARROW-1588
Add tests for writing different sizes of values

xhochy · 2017-09-25T19:33:31Z

src/parquet/arrow/schema.cc

@@ -565,12 +576,12 @@ Status FieldToNode(const std::shared_ptr<Field>& field,
      auto struct_type = std::static_pointer_cast<::arrow::StructType>(field->type());
      return StructToNode(struct_type, field->name(), field->nullable(), properties,
                          arrow_properties, out);
-    } break;


Accidental deletion?

No, the break isn't necessary because there's a return right?

Ah, didn't see that in the quick review.

xhochy · 2017-09-25T19:33:39Z

src/parquet/arrow/schema.cc

    case ArrowType::LIST: {
      auto list_type = std::static_pointer_cast<::arrow::ListType>(field->type());
      return ListToNode(list_type, field->name(), field->nullable(), properties,
                        arrow_properties, out);
-    } break;


Accidental deletion?

cpcloud · 2017-10-02T22:04:00Z

I'm working on getting my windows VM setup with parquet-cpp, haven't forgotten about this.

wesm · 2017-09-28T23:39:11Z

src/parquet/arrow/arrow-reader-writer-test.cc

 const std::string test_traits<::arrow::StringType>::value("Test");              // NOLINT
 const std::string test_traits<::arrow::BinaryType>::value("\x00\x01\x02\x03");  // NOLINT
 const std::string test_traits<::arrow::FixedSizeBinaryType>::value("Fixed");    // NOLINT
+const ::arrow::Decimal128 test_traits<::arrow::DecimalType>::value(
+    "-83095209205923957.2323995");  // NOLINT


These static values are still an eyesore; we should try to generate unique values for all data types

wesm · 2017-09-28T23:43:04Z

src/parquet/arrow/reader.cc

+    // raw bytes that we can write to
+    uint8_t* out_ptr = data->mutable_data();
+
+    auto raw_bytes_to_decimal_bytes = [byte_width](const uint8_t* value,


Unclear whether this will inline, in case you care

wesm · 2017-09-28T23:44:06Z

src/parquet/arrow/reader.cc

+    if (null_count > 0) {
+      for (int64_t i = 0; i < length; ++i, out_ptr += type_length) {
+        if (!fixed_size_binary_array.IsNull(i)) {
+          raw_bytes_to_decimal_bytes(fixed_size_binary_array.GetValue(i), out_ptr);


Do we care that the unwritten slots will have undefined memory (as compared with a memset on the new buffer)?

The new buffer will have defined slots because I take a view on the bytes (int64_t* for high bits, uint64_t* for low bits) and assign values to *high/*low. That happens in BytesToIntegerPair.

wesm · 2017-09-28T23:45:49Z

src/parquet/arrow/schema.cc

@@ -617,5 +629,73 @@ Status ToParquetSchema(const ::arrow::Schema* arrow_schema,
                         out);
 }

+int32_t DecimalSize(int32_t precision) {


Maybe add a comment to explain the origin of this monstrosity

Yep, will do.

wesm · 2017-09-28T23:46:28Z

src/parquet/arrow/schema.h

@@ -85,6 +85,8 @@ ::arrow::Status PARQUET_EXPORT ToParquetSchema(const ::arrow::Schema* arrow_sche
                                               const WriterProperties& properties,
                                               std::shared_ptr<SchemaDescriptor>* out);

+int32_t PARQUET_EXPORT DecimalSize(int32_t precision);


Does this need to be exported in the DLL?

Only if dependents want to use it. We can not export it for now as it's only used internally right now.

wesm · 2017-09-28T23:47:10Z

src/parquet/arrow/test-util.h

+    }
+  }
+  return builder.Finish(out);
+}


We would do well to generate some random data (though we are not doing this enough elsewhere)

wesm · 2017-09-28T23:49:17Z

src/parquet/arrow/writer.cc

+      const uint8_t* raw_value = data.GetValue(i);
+      auto unsigned_64_bit = reinterpret_cast<const uint64_t*>(raw_value);
+      const uint64_t value[] = {::arrow::BitUtil::ToBigEndian(unsigned_64_bit[0]),
+                                ::arrow::BitUtil::ToBigEndian(unsigned_64_bit[1])};


Are we byte swapping on the way out?

Do you mean on the way in? These lines are performing the byte swapping on the way out.

xhochy · 2017-10-09T07:09:53Z

cmake_modules/ThirdpartyToolchain.cmake

@@ -366,7 +366,7 @@ if (NOT ARROW_FOUND)
    -DARROW_BUILD_TESTS=OFF)

  if ("$ENV{PARQUET_ARROW_VERSION}" STREQUAL "")
-    set(ARROW_VERSION "8309556c7d2b0e14df1422baa574cf2de8c1bd3b")


Do we need more than 0.7.1 here?

If yes, I would like to make a parquet-cpp 1.3.1 release otherwise we can wait with this PR to be merged.

Yes, we should probably release 1.3.1 without this and then focus on getting Arrow 0.8.0 out by the end of the month. There is also https://issues.apache.org/jira/browse/PARQUET-1122 -- I am not sure if I want to block 1.3.1 over this but we will need to take care of that soon

This PR needs some very recent fixes in Arrow, so yes it needs more than 0.7.1. I'm fine waiting for arrow 0.8.0 then parquet-cpp 1.4.0

cpcloud · 2017-10-09T15:35:41Z

This patch now includes explicit support for reading Decimals written in other systems as int32/int64 as per the parquet spec. The patch includes small test datasets written in spark.

wesm · 2017-10-25T16:15:49Z

Now that ARROW-1588 is merged, we can get this fixed up and then plan to release 1.4.0 with decimal support after Arrow 0.8.0 final is out (so this would be latter half of November)?

cpcloud · 2017-10-25T17:36:46Z

Sounds good to me. One last thing is to add tests for different byte width decimals (using random data generation).

xhochy · 2017-11-10T10:46:26Z

For the moment, this looks good. Reping me, once this is ready to merge, than I can do a final pass.

wesm · 2017-11-10T14:40:17Z

Aside: I think we should call DecimalArray instead Decimal16Array (or Decimal128Array, perhaps for consistency with using bits instead of bytes) in Arrow to permit smaller 4/8-byte decimals in the future without breaking the parquet-cpp API. I'll open a JIRA

wesm · 2017-11-10T14:42:09Z

see https://issues.apache.org/jira/browse/ARROW-1794

see related comments about this topic in the Kudu JIRA https://issues.apache.org/jira/browse/KUDU-721?focusedCommentId=16213209&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16213209

cpcloud · 2017-11-12T22:04:22Z

I'm working on ARROW-1794. Let's get that in before merging this patch.

wesm · 2017-11-11T22:57:25Z

src/parquet/arrow/arrow-reader-writer-test.cc

+  ASSERT_EQ(values->length(), expected.length());
+
+  // TODO(phillipc): Is there a better way to compare these two arrays?
+  // AssertArraysEqual requires the same type, but we only care about values in this case


We should create one, like "compare the array data"

wesm · 2017-11-11T22:58:09Z

src/parquet/arrow/arrow-reader-writer-test.cc

+    if (value_is_valid) {
+      uint32_t value = values->Value(i);
+      int64_t expected_value = expected.Value(i);
+      ASSERT_EQ(value, expected_value);


I'm surprised this doesn't cause a compiler warning, I guess equality-comparisons for signed-unsigned is ok?

It looks like this is standards-compliant behavior: http://en.cppreference.com/w/cpp/language/operator_arithmetic#Conversions. Specifically, this language:

Otherwise, if the signed operand's type can represent all values of the unsigned operand, the unsigned operand is converted to the signed operand's type

Since UINT32_MAX <= INT64_MAX then this is well defined behavior.

That said, I'll add an explicit cast. I think it's more readable that way. The operands to ASSERT_EQ should also be reversed.

wesm · 2017-11-14T03:15:52Z

src/parquet/arrow/arrow-reader-writer-test.cc

+
+  std::shared_ptr<Array> expected_array;
+
+  ::arrow::DecimalBuilder builder(decimal_type, pool);


Do we also need to rename this builder?

Done in apache/arrow#1321

wesm · 2017-11-14T03:27:41Z

src/parquet/arrow/reader.cc

+    out_ptr_view[0] = ToLittleEndian(static_cast<uint64_t>(value));
+
+    // no need to byteswap here because we're either all ones or all zeros
+    out_ptr_view[1] = static_cast<uint64_t>(value < 0 ? -1 : 0);


The sign bit is in the same place in both cases (big vs little endian)? I'm clearly out of my depth on these details

Thanks for pointing this out. There are actually two bugs here, revisiting this after reading your comment and rereading my code.

I should be calling FromLittleEndian just before this line, not ToLittleEndian because parquet-mr writes all primitive values in little endian order. The current code works on little endian architectures but not on big endian.

value needs to be sign/zero extended if it's of type int32_t. That's done by simply upcasting to int64_t.

This particular line is performing sign/zero extension to the other 8 bytes of the 16 byte decimal value, which means it's either 64 ones if value is negative, or 64 zeros if it's zero or positive.

Let me know if this explanation doesn't make sense.

makes sense at a glance

I will add an example in the code to make this very concrete. It's not necessarily obvious unless one is familiar with the details of sign extension (which is a consequence of using a two's complement representation).

wesm · 2017-11-14T03:29:18Z

src/parquet/arrow/reader.cc

+        } break;
+        case ::parquet::Type::FIXED_LEN_BYTE_ARRAY: {
+          TRANSFER_DATA(::arrow::DecimalType, FLBAType);
+        } break;


Do any systems that we know of use BYTE_ARRAY?

I'm not aware of any. parquet-mr does support this, but both hive and spark always use FIXED_LEN_BYTE_ARRAY. (spark uses int32 or int64, optionally, in later versions). I can implement this now, or in a follow up patch.

wesm · 2017-11-14T03:31:10Z

src/parquet/arrow/schema.h

@@ -85,6 +85,8 @@ ::arrow::Status PARQUET_EXPORT ToParquetSchema(const ::arrow::Schema* arrow_sche
                                               const WriterProperties& properties,
                                               std::shared_ptr<SchemaDescriptor>* out);

+int32_t DecimalSize(int32_t precision);


I presume this does not need to be exported. It could go in a private header, also

No, this shouldn't be available to third parties.

wesm · 2017-11-14T03:35:36Z

src/parquet/arrow/writer.cc

+}
+
+template <>
+Status FileWriter::Impl::TypedWriteBatch<FLBAType, ::arrow::DecimalType>(


Decimal128Type

I didn't actually rename DecimalBuilder or DecimalType, just DecimalArray because I assumed we'd be using DecimalBuilder and DecimalType independent of the underlying storage. I think it makes sense to delimit these as well. I'll open a JIRA.

wesm · 2017-11-14T03:37:31Z

src/parquet/arrow/writer.cc

+
+  // TODO(phillipc): Look into whether our compilers will perform loop unswitching so we
+  // don't have to keep writing two loops to handle the case where we know there are no
+  // nulls


It seems like MSVC didn't used to do loop unswitching but now maybe it does from MSVC 2013 onward (I guess this used to only be available in the Pro/Enterprise version of visual C++). Did a bit of googling but didn't get a definitive answer

FWIW it looks like LLVM has a heuristic about whether it will unswitch: http://llvm.org/doxygen/LoopUnswitch_8cpp_source.html. I am not sure I like that. my inclination has generally been to unswitch by hand

Yeah, the heuristic is based on the number of basic blocks (points where control flow can take a different path depending on the value of a conditional) and the cost of instructions that might be generated (this seems difficult to predict without knowing the details of the cost of particular instructions) because loop unswitching doubles the number of loops every time there's an unswitching opportunity.

Since most or all of our unswitch opportunities are based on one condition and therefore double only once, I would be extremely surprised if we ever came close to the threshold.

In any event it doesn't look like this optimization is reliably implemented in all of the compilers we want to support to start writing switched loops, I just wanted to note this here for posterity.

wesm · 2017-11-19T18:45:10Z

@cpcloud is this merge-ready once the build passes? I just opened https://issues.apache.org/jira/browse/ARROW-1836 about fixing the warning I just triaged here

cpcloud · 2017-11-19T19:17:20Z

@wesm Yep, was just about to ping. This is good to go.

cpcloud · 2017-11-19T22:28:47Z

@wesm I have one more change, which removes the loop from BytesToInteger and uses integer arithmetic operations only, even for byte sizes that are not powers of two.

wesm · 2017-11-19T22:36:44Z

Cool. +1, will merge when build passes

cpcloud · 2017-11-19T22:37:19Z

Thanks for fixing the warning!

wesm · 2017-11-20T04:19:58Z

Nice work! Now we can add some Python tests in Arrow and call it a wrap

xhochy reviewed Sep 25, 2017

View reviewed changes

xhochy approved these changes Sep 25, 2017

View reviewed changes

wesm reviewed Oct 3, 2017

View reviewed changes

xhochy reviewed Oct 9, 2017

View reviewed changes

wesm reviewed Nov 14, 2017

View reviewed changes

cpcloud added 14 commits November 19, 2017 17:29

PARQUET-1095: [C++] Read and write Arrow decimal values

2917a62

Do not use std::copy when reinterpret_cast will suffice

613255e

Clean up uint32 test

46dff15

Remove garbage values

028fb03

Checkpoint [ci skip]

3d243d5

Use arrow

1782da0

Proper dcheck call

5c9292b

Allocate scratch space to hold the byteswapped values

e162ca1

Fix deprecated API call

659fbc1

Bump arrow version

8808e4c

Remove specific randint call

1eee6a9

Remove specific template parameters

9ff7eb4

Use arrow random_decimals

30655d6

Parameterize on precision

7ab2e5c

cpcloud and others added 16 commits November 19, 2017 17:29

Reduce the number of decimal test cases

6c9e2a7

Copy from arrow for now

64748a8

IWYU

b2e0290

Update for ARROW-1794: rename DecimalArray to Decimal128Array

9f97c1d

Update arrow version

920832a

Cleanup iteration a bit

32a4abe

Fix issues

c5c4294

ARROW-1811

6036ca5

Reverse operand order and explicit cast

16935de

Update for ARROW-1811

da0a7eb

Fix reader writer test for unique kernel addition

e25c59b

Min commit that contains the unique kernel in arrow

51965cd

Add last_value_ init

83948ec

Refactor types.h

e4b02d3

Suppress C4996 due to arrow/util/variant.h

63018bc

Remove loop from BytesToInteger

8c3d222

wesm closed this in 6a2ed4f Nov 20, 2017

cpcloud deleted the PARQUET-1095 branch November 20, 2017 04:23

wjones127 mentioned this pull request Jan 19, 2023

MINOR: [C++][Parquet] Rephrase decimal annotation apache/arrow#33694

Merged

This was referenced Jun 23, 2024

[C++][Parquet] Read and write Arrow decimal values apache/arrow#42336

Closed

[C++][Parquet] ompatibility with C++ iterators apache/arrow#42774

Closed


		std::shared_ptr<Array> expected_array;

		::arrow::DecimalBuilder builder(decimal_type, pool);

PARQUET-1095: [C++] Read and write Arrow decimal values #403

PARQUET-1095: [C++] Read and write Arrow decimal values #403

Conversation

cpcloud commented Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Oct 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud Oct 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Oct 9, 2017

wesm commented Oct 25, 2017

cpcloud commented Oct 25, 2017

xhochy commented Nov 10, 2017

wesm commented Nov 10, 2017 • edited Loading

wesm commented Nov 10, 2017

cpcloud commented Nov 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud Nov 14, 2017 • edited Loading

Choose a reason for hiding this comment

wesm commented Nov 19, 2017

cpcloud commented Nov 19, 2017

cpcloud commented Nov 19, 2017

wesm commented Nov 19, 2017

cpcloud commented Nov 19, 2017

wesm commented Nov 20, 2017

cpcloud commented Sep 25, 2017 •

edited

Loading

cpcloud Oct 3, 2017 •

edited

Loading

wesm commented Nov 10, 2017 •

edited

Loading

cpcloud Nov 14, 2017 •

edited

Loading