-
Notifications
You must be signed in to change notification settings - Fork 193
PARQUET-1095: [C++] Read and write Arrow decimal values #403
Conversation
@@ -565,12 +576,12 @@ Status FieldToNode(const std::shared_ptr<Field>& field, | |||
auto struct_type = std::static_pointer_cast<::arrow::StructType>(field->type()); | |||
return StructToNode(struct_type, field->name(), field->nullable(), properties, | |||
arrow_properties, out); | |||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the break
isn't necessary because there's a return
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't see that in the quick review.
case ArrowType::LIST: { | ||
auto list_type = std::static_pointer_cast<::arrow::ListType>(field->type()); | ||
return ListToNode(list_type, field->name(), field->nullable(), properties, | ||
arrow_properties, out); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental deletion?
I'm working on getting my windows VM setup with parquet-cpp, haven't forgotten about this. |
const std::string test_traits<::arrow::StringType>::value("Test"); // NOLINT | ||
const std::string test_traits<::arrow::BinaryType>::value("\x00\x01\x02\x03"); // NOLINT | ||
const std::string test_traits<::arrow::FixedSizeBinaryType>::value("Fixed"); // NOLINT | ||
const ::arrow::Decimal128 test_traits<::arrow::DecimalType>::value( | ||
"-83095209205923957.2323995"); // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These static values are still an eyesore; we should try to generate unique values for all data types
src/parquet/arrow/reader.cc
Outdated
// raw bytes that we can write to | ||
uint8_t* out_ptr = data->mutable_data(); | ||
|
||
auto raw_bytes_to_decimal_bytes = [byte_width](const uint8_t* value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear whether this will inline, in case you care
src/parquet/arrow/reader.cc
Outdated
if (null_count > 0) { | ||
for (int64_t i = 0; i < length; ++i, out_ptr += type_length) { | ||
if (!fixed_size_binary_array.IsNull(i)) { | ||
raw_bytes_to_decimal_bytes(fixed_size_binary_array.GetValue(i), out_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care that the unwritten slots will have undefined memory (as compared with a memset on the new buffer)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new buffer will have defined slots because I take a view on the bytes (int64_t*
for high bits, uint64_t*
for low bits) and assign values to *high
/*low
. That happens in BytesToIntegerPair
.
src/parquet/arrow/schema.cc
Outdated
@@ -617,5 +629,73 @@ Status ToParquetSchema(const ::arrow::Schema* arrow_schema, | |||
out); | |||
} | |||
|
|||
int32_t DecimalSize(int32_t precision) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to explain the origin of this monstrosity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/parquet/arrow/schema.h
Outdated
@@ -85,6 +85,8 @@ ::arrow::Status PARQUET_EXPORT ToParquetSchema(const ::arrow::Schema* arrow_sche | |||
const WriterProperties& properties, | |||
std::shared_ptr<SchemaDescriptor>* out); | |||
|
|||
int32_t PARQUET_EXPORT DecimalSize(int32_t precision); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be exported in the DLL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if dependents want to use it. We can not export it for now as it's only used internally right now.
src/parquet/arrow/test-util.h
Outdated
} | ||
} | ||
return builder.Finish(out); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would do well to generate some random data (though we are not doing this enough elsewhere)
src/parquet/arrow/writer.cc
Outdated
const uint8_t* raw_value = data.GetValue(i); | ||
auto unsigned_64_bit = reinterpret_cast<const uint64_t*>(raw_value); | ||
const uint64_t value[] = {::arrow::BitUtil::ToBigEndian(unsigned_64_bit[0]), | ||
::arrow::BitUtil::ToBigEndian(unsigned_64_bit[1])}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we byte swapping on the way out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean on the way in? These lines are performing the byte swapping on the way out.
@@ -366,7 +366,7 @@ if (NOT ARROW_FOUND) | |||
-DARROW_BUILD_TESTS=OFF) | |||
|
|||
if ("$ENV{PARQUET_ARROW_VERSION}" STREQUAL "") | |||
set(ARROW_VERSION "8309556c7d2b0e14df1422baa574cf2de8c1bd3b") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need more than 0.7.1 here?
If yes, I would like to make a parquet-cpp 1.3.1 release otherwise we can wait with this PR to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should probably release 1.3.1 without this and then focus on getting Arrow 0.8.0 out by the end of the month. There is also https://issues.apache.org/jira/browse/PARQUET-1122 -- I am not sure if I want to block 1.3.1 over this but we will need to take care of that soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR needs some very recent fixes in Arrow, so yes it needs more than 0.7.1. I'm fine waiting for arrow 0.8.0 then parquet-cpp 1.4.0
This patch now includes explicit support for reading Decimals written in other systems as int32/int64 as per the parquet spec. The patch includes small test datasets written in spark. |
Now that ARROW-1588 is merged, we can get this fixed up and then plan to release 1.4.0 with decimal support after Arrow 0.8.0 final is out (so this would be latter half of November)? |
Sounds good to me. One last thing is to add tests for different byte width decimals (using random data generation). |
For the moment, this looks good. Reping me, once this is ready to merge, than I can do a final pass. |
Aside: I think we should call |
see https://issues.apache.org/jira/browse/ARROW-1794 see related comments about this topic in the Kudu JIRA https://issues.apache.org/jira/browse/KUDU-721?focusedCommentId=16213209&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16213209 |
I'm working on ARROW-1794. Let's get that in before merging this patch. |
ASSERT_EQ(values->length(), expected.length()); | ||
|
||
// TODO(phillipc): Is there a better way to compare these two arrays? | ||
// AssertArraysEqual requires the same type, but we only care about values in this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create one, like "compare the array data"
if (value_is_valid) { | ||
uint32_t value = values->Value(i); | ||
int64_t expected_value = expected.Value(i); | ||
ASSERT_EQ(value, expected_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this doesn't cause a compiler warning, I guess equality-comparisons for signed-unsigned is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is standards-compliant behavior: http://en.cppreference.com/w/cpp/language/operator_arithmetic#Conversions. Specifically, this language:
Otherwise, if the signed operand's type can represent all values of the unsigned operand, the unsigned operand is converted to the signed operand's type
Since UINT32_MAX <= INT64_MAX
then this is well defined behavior.
That said, I'll add an explicit cast. I think it's more readable that way. The operands to ASSERT_EQ
should also be reversed.
|
||
std::shared_ptr<Array> expected_array; | ||
|
||
::arrow::DecimalBuilder builder(decimal_type, pool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to rename this builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in apache/arrow#1321
src/parquet/arrow/reader.cc
Outdated
out_ptr_view[0] = ToLittleEndian(static_cast<uint64_t>(value)); | ||
|
||
// no need to byteswap here because we're either all ones or all zeros | ||
out_ptr_view[1] = static_cast<uint64_t>(value < 0 ? -1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sign bit is in the same place in both cases (big vs little endian)? I'm clearly out of my depth on these details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. There are actually two bugs here, revisiting this after reading your comment and rereading my code.
- I should be calling
FromLittleEndian
just before this line, notToLittleEndian
because parquet-mr writes all primitive values in little endian order. The current code works on little endian architectures but not on big endian. value
needs to be sign/zero extended if it's of typeint32_t
. That's done by simply upcasting toint64_t
.
This particular line is performing sign/zero extension to the other 8 bytes of the 16 byte decimal value, which means it's either 64 ones if value
is negative, or 64 zeros if it's zero or positive.
Let me know if this explanation doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense at a glance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add an example in the code to make this very concrete. It's not necessarily obvious unless one is familiar with the details of sign extension (which is a consequence of using a two's complement representation).
src/parquet/arrow/reader.cc
Outdated
} break; | ||
case ::parquet::Type::FIXED_LEN_BYTE_ARRAY: { | ||
TRANSFER_DATA(::arrow::DecimalType, FLBAType); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any systems that we know of use BYTE_ARRAY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any. parquet-mr does support this, but both hive and spark always use FIXED_LEN_BYTE_ARRAY
. (spark uses int32 or int64, optionally, in later versions). I can implement this now, or in a follow up patch.
@@ -85,6 +85,8 @@ ::arrow::Status PARQUET_EXPORT ToParquetSchema(const ::arrow::Schema* arrow_sche | |||
const WriterProperties& properties, | |||
std::shared_ptr<SchemaDescriptor>* out); | |||
|
|||
int32_t DecimalSize(int32_t precision); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this does not need to be exported. It could go in a private header, also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this shouldn't be available to third parties.
src/parquet/arrow/writer.cc
Outdated
} | ||
|
||
template <> | ||
Status FileWriter::Impl::TypedWriteBatch<FLBAType, ::arrow::DecimalType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decimal128Type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually rename DecimalBuilder
or DecimalType
, just DecimalArray
because I assumed we'd be using DecimalBuilder
and DecimalType
independent of the underlying storage. I think it makes sense to delimit these as well. I'll open a JIRA.
|
||
// TODO(phillipc): Look into whether our compilers will perform loop unswitching so we | ||
// don't have to keep writing two loops to handle the case where we know there are no | ||
// nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like MSVC didn't used to do loop unswitching but now maybe it does from MSVC 2013 onward (I guess this used to only be available in the Pro/Enterprise version of visual C++). Did a bit of googling but didn't get a definitive answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it looks like LLVM has a heuristic about whether it will unswitch: http://llvm.org/doxygen/LoopUnswitch_8cpp_source.html. I am not sure I like that. my inclination has generally been to unswitch by hand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the heuristic is based on the number of basic blocks (points where control flow can take a different path depending on the value of a conditional) and the cost of instructions that might be generated (this seems difficult to predict without knowing the details of the cost of particular instructions) because loop unswitching doubles the number of loops every time there's an unswitching opportunity.
Since most or all of our unswitch opportunities are based on one condition and therefore double only once, I would be extremely surprised if we ever came close to the threshold.
In any event it doesn't look like this optimization is reliably implemented in all of the compilers we want to support to start writing switched loops, I just wanted to note this here for posterity.
@cpcloud is this merge-ready once the build passes? I just opened https://issues.apache.org/jira/browse/ARROW-1836 about fixing the warning I just triaged here |
@wesm Yep, was just about to ping. This is good to go. |
@wesm I have one more change, which removes the loop from |
Cool. +1, will merge when build passes |
Thanks for fixing the warning! |
Nice work! Now we can add some Python tests in Arrow and call it a wrap |
This depends on: