Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

PARQUET-1095: [C++] Read and write Arrow decimal values #403

Closed
wants to merge 30 commits into from
Closed

PARQUET-1095: [C++] Read and write Arrow decimal values #403

wants to merge 30 commits into from

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Sep 25, 2017

This depends on:

@@ -565,12 +576,12 @@ Status FieldToNode(const std::shared_ptr<Field>& field,
auto struct_type = std::static_pointer_cast<::arrow::StructType>(field->type());
return StructToNode(struct_type, field->name(), field->nullable(), properties,
arrow_properties, out);
} break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the break isn't necessary because there's a return right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't see that in the quick review.

case ArrowType::LIST: {
auto list_type = std::static_pointer_cast<::arrow::ListType>(field->type());
return ListToNode(list_type, field->name(), field->nullable(), properties,
arrow_properties, out);
} break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental deletion?

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 2, 2017

I'm working on getting my windows VM setup with parquet-cpp, haven't forgotten about this.

const std::string test_traits<::arrow::StringType>::value("Test"); // NOLINT
const std::string test_traits<::arrow::BinaryType>::value("\x00\x01\x02\x03"); // NOLINT
const std::string test_traits<::arrow::FixedSizeBinaryType>::value("Fixed"); // NOLINT
const ::arrow::Decimal128 test_traits<::arrow::DecimalType>::value(
"-83095209205923957.2323995"); // NOLINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These static values are still an eyesore; we should try to generate unique values for all data types

// raw bytes that we can write to
uint8_t* out_ptr = data->mutable_data();

auto raw_bytes_to_decimal_bytes = [byte_width](const uint8_t* value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear whether this will inline, in case you care

if (null_count > 0) {
for (int64_t i = 0; i < length; ++i, out_ptr += type_length) {
if (!fixed_size_binary_array.IsNull(i)) {
raw_bytes_to_decimal_bytes(fixed_size_binary_array.GetValue(i), out_ptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care that the unwritten slots will have undefined memory (as compared with a memset on the new buffer)?

Copy link
Contributor Author

@cpcloud cpcloud Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new buffer will have defined slots because I take a view on the bytes (int64_t* for high bits, uint64_t* for low bits) and assign values to *high/*low. That happens in BytesToIntegerPair.

@@ -617,5 +629,73 @@ Status ToParquetSchema(const ::arrow::Schema* arrow_schema,
out);
}

int32_t DecimalSize(int32_t precision) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment to explain the origin of this monstrosity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -85,6 +85,8 @@ ::arrow::Status PARQUET_EXPORT ToParquetSchema(const ::arrow::Schema* arrow_sche
const WriterProperties& properties,
std::shared_ptr<SchemaDescriptor>* out);

int32_t PARQUET_EXPORT DecimalSize(int32_t precision);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exported in the DLL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if dependents want to use it. We can not export it for now as it's only used internally right now.

}
}
return builder.Finish(out);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would do well to generate some random data (though we are not doing this enough elsewhere)

const uint8_t* raw_value = data.GetValue(i);
auto unsigned_64_bit = reinterpret_cast<const uint64_t*>(raw_value);
const uint64_t value[] = {::arrow::BitUtil::ToBigEndian(unsigned_64_bit[0]),
::arrow::BitUtil::ToBigEndian(unsigned_64_bit[1])};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we byte swapping on the way out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean on the way in? These lines are performing the byte swapping on the way out.

@@ -366,7 +366,7 @@ if (NOT ARROW_FOUND)
-DARROW_BUILD_TESTS=OFF)

if ("$ENV{PARQUET_ARROW_VERSION}" STREQUAL "")
set(ARROW_VERSION "8309556c7d2b0e14df1422baa574cf2de8c1bd3b")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more than 0.7.1 here?

If yes, I would like to make a parquet-cpp 1.3.1 release otherwise we can wait with this PR to be merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should probably release 1.3.1 without this and then focus on getting Arrow 0.8.0 out by the end of the month. There is also https://issues.apache.org/jira/browse/PARQUET-1122 -- I am not sure if I want to block 1.3.1 over this but we will need to take care of that soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs some very recent fixes in Arrow, so yes it needs more than 0.7.1. I'm fine waiting for arrow 0.8.0 then parquet-cpp 1.4.0

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 9, 2017

This patch now includes explicit support for reading Decimals written in other systems as int32/int64 as per the parquet spec. The patch includes small test datasets written in spark.

@wesm
Copy link
Member

wesm commented Oct 25, 2017

Now that ARROW-1588 is merged, we can get this fixed up and then plan to release 1.4.0 with decimal support after Arrow 0.8.0 final is out (so this would be latter half of November)?

@cpcloud
Copy link
Contributor Author

cpcloud commented Oct 25, 2017

Sounds good to me. One last thing is to add tests for different byte width decimals (using random data generation).

@xhochy
Copy link
Member

xhochy commented Nov 10, 2017

For the moment, this looks good. Reping me, once this is ready to merge, than I can do a final pass.

@wesm
Copy link
Member

wesm commented Nov 10, 2017

Aside: I think we should call DecimalArray instead Decimal16Array (or Decimal128Array, perhaps for consistency with using bits instead of bytes) in Arrow to permit smaller 4/8-byte decimals in the future without breaking the parquet-cpp API. I'll open a JIRA

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 12, 2017

I'm working on ARROW-1794. Let's get that in before merging this patch.

ASSERT_EQ(values->length(), expected.length());

// TODO(phillipc): Is there a better way to compare these two arrays?
// AssertArraysEqual requires the same type, but we only care about values in this case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create one, like "compare the array data"

if (value_is_valid) {
uint32_t value = values->Value(i);
int64_t expected_value = expected.Value(i);
ASSERT_EQ(value, expected_value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this doesn't cause a compiler warning, I guess equality-comparisons for signed-unsigned is ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is standards-compliant behavior: http://en.cppreference.com/w/cpp/language/operator_arithmetic#Conversions. Specifically, this language:

Otherwise, if the signed operand's type can represent all values of the unsigned operand, the unsigned operand is converted to the signed operand's type

Since UINT32_MAX <= INT64_MAX then this is well defined behavior.

That said, I'll add an explicit cast. I think it's more readable that way. The operands to ASSERT_EQ should also be reversed.


std::shared_ptr<Array> expected_array;

::arrow::DecimalBuilder builder(decimal_type, pool);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to rename this builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out_ptr_view[0] = ToLittleEndian(static_cast<uint64_t>(value));

// no need to byteswap here because we're either all ones or all zeros
out_ptr_view[1] = static_cast<uint64_t>(value < 0 ? -1 : 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sign bit is in the same place in both cases (big vs little endian)? I'm clearly out of my depth on these details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. There are actually two bugs here, revisiting this after reading your comment and rereading my code.

  1. I should be calling FromLittleEndian just before this line, not ToLittleEndian because parquet-mr writes all primitive values in little endian order. The current code works on little endian architectures but not on big endian.
  2. value needs to be sign/zero extended if it's of type int32_t. That's done by simply upcasting to int64_t.

This particular line is performing sign/zero extension to the other 8 bytes of the 16 byte decimal value, which means it's either 64 ones if value is negative, or 64 zeros if it's zero or positive.

Let me know if this explanation doesn't make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense at a glance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add an example in the code to make this very concrete. It's not necessarily obvious unless one is familiar with the details of sign extension (which is a consequence of using a two's complement representation).

} break;
case ::parquet::Type::FIXED_LEN_BYTE_ARRAY: {
TRANSFER_DATA(::arrow::DecimalType, FLBAType);
} break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any systems that we know of use BYTE_ARRAY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any. parquet-mr does support this, but both hive and spark always use FIXED_LEN_BYTE_ARRAY. (spark uses int32 or int64, optionally, in later versions). I can implement this now, or in a follow up patch.

@@ -85,6 +85,8 @@ ::arrow::Status PARQUET_EXPORT ToParquetSchema(const ::arrow::Schema* arrow_sche
const WriterProperties& properties,
std::shared_ptr<SchemaDescriptor>* out);

int32_t DecimalSize(int32_t precision);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this does not need to be exported. It could go in a private header, also

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this shouldn't be available to third parties.

}

template <>
Status FileWriter::Impl::TypedWriteBatch<FLBAType, ::arrow::DecimalType>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decimal128Type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually rename DecimalBuilder or DecimalType, just DecimalArray because I assumed we'd be using DecimalBuilder and DecimalType independent of the underlying storage. I think it makes sense to delimit these as well. I'll open a JIRA.


// TODO(phillipc): Look into whether our compilers will perform loop unswitching so we
// don't have to keep writing two loops to handle the case where we know there are no
// nulls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like MSVC didn't used to do loop unswitching but now maybe it does from MSVC 2013 onward (I guess this used to only be available in the Pro/Enterprise version of visual C++). Did a bit of googling but didn't get a definitive answer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it looks like LLVM has a heuristic about whether it will unswitch: http://llvm.org/doxygen/LoopUnswitch_8cpp_source.html. I am not sure I like that. my inclination has generally been to unswitch by hand

Copy link
Contributor Author

@cpcloud cpcloud Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the heuristic is based on the number of basic blocks (points where control flow can take a different path depending on the value of a conditional) and the cost of instructions that might be generated (this seems difficult to predict without knowing the details of the cost of particular instructions) because loop unswitching doubles the number of loops every time there's an unswitching opportunity.

Since most or all of our unswitch opportunities are based on one condition and therefore double only once, I would be extremely surprised if we ever came close to the threshold.

In any event it doesn't look like this optimization is reliably implemented in all of the compilers we want to support to start writing switched loops, I just wanted to note this here for posterity.

@wesm
Copy link
Member

wesm commented Nov 19, 2017

@cpcloud is this merge-ready once the build passes? I just opened https://issues.apache.org/jira/browse/ARROW-1836 about fixing the warning I just triaged here

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2017

@wesm Yep, was just about to ping. This is good to go.

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2017

@wesm I have one more change, which removes the loop from BytesToInteger and uses integer arithmetic operations only, even for byte sizes that are not powers of two.

@wesm
Copy link
Member

wesm commented Nov 19, 2017

Cool. +1, will merge when build passes

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2017

Thanks for fixing the warning!

@wesm wesm closed this in 6a2ed4f Nov 20, 2017
@wesm
Copy link
Member

wesm commented Nov 20, 2017

Nice work! Now we can add some Python tests in Arrow and call it a wrap

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants