ARROW-655: [C++/Python] Implement DecimalArray #403

cpcloud · 2017-03-19T21:51:15Z

Adds Decimal support for C++ and Python.

TODOs:

Tighten up some of the GIL acquisition. E.g., we may not need to hold it when importing the decimal module if we acquire it where we import the decimal module.
Investigate FreeBSD issue (manifesting on OS X) where typeinfo symbols for __int128_t are not exported: https://bugs.llvm.org//show_bug.cgi?id=26156.
See if there's a better way to visit scalar decimals, rather than keeping extra state on the class. Seems like an unacceptable hack.

wesm · 2017-03-19T22:55:11Z

Nice! I will review this in more detail when I can. I did some refactoring in #404 which will introduce some minor rebase conflicts under arrow/ipc, but it shouldn't be too bad to fix.

wesm · 2017-03-19T22:57:44Z

cpp/src/arrow/util/decimal.h

+
+namespace arrow {
+
+typedef __int128_t int128_t;


MSVC doesn't seem to have int128_t, does boost have something to backfill this? https://msdn.microsoft.com/en-us/library/cc953fe1.aspx

When I first started this patch, I was using boost::multiprecision::int128_t, which would work across platforms. It's significantly more complex. Of course "more complex and working" is infinitely better than "less complex and not working at all" :)

gotcha, i reckon that can be tackled in a follow up patch. as long as this patch compiles in Appveyor we are good

cpcloud · 2017-03-20T04:37:12Z

Going to redo parts of the patch with boost. Lots of issues are showing up with __int128_t. Seems to only reliably work with GCC.

wesm · 2017-03-20T14:05:17Z

Sorry about the rebase conflicts. They shouldn't be too bad, so to summarize

ipc-adapter-test consolidated into ipc-read-write-test.cc
metadata-internal.cc consolidated into metadata.cc

tebeka · 2017-03-21T09:59:21Z

cpp/src/arrow/util/decimal.h

+  // TODO(phillipc): There's probably a more intelligent way to do this
+  size_t i = 0;
+  *out = static_cast<int128_t>(0);
+  for (auto iter = fractional.crbegin(); iter != fractional.crend(); ++iter) {


Create int128 atoi128(std::string s) utility function?

I'm going to refactor this to use boost multiprecision, which has conversion functions so there's no need for this.

cpcloud · 2017-04-02T04:32:46Z

Not quite ready for review, need to make sure I'm not including anything from previous commits and fix some of the styling crap.

cpcloud · 2017-04-02T04:47:12Z

We need to use at least boost 1.60 for this PR to work.

cpcloud · 2017-04-02T04:47:53Z

@wesm Is there a hard requirement to maintain compatibility with boost < 1.60?

cpcloud · 2017-04-02T05:50:52Z

Actually, I think I can work around it for now.

wesm · 2017-04-02T15:26:52Z

hm, boost on Ubuntu 14.04 is 1.57, which would be useful to support, but i understand there's been ongoing work in boost::multiprecision in recent releases

cpcloud · 2017-04-02T21:31:28Z

cpp/src/arrow/array-decimal-test.cc

+  int precision = GetParam();
+  std::vector<Decimal128> draw = {Decimal128(1), Decimal128(2), Decimal128(-1),
+      Decimal128(4), Decimal128(-1), Decimal128(1), Decimal128(2),
+      Decimal128("230342903942.234234"), Decimal128("-23049302932.235234")};


I need to redo this test, because we don't actually check here that values are all compatible with their respective types (i.e., same precision and scale).

cpcloud · 2017-04-02T21:34:19Z

cpp/src/arrow/builder.cc

+  // TODO(phillipc): calculate the proper storage size here (do we have a function to do
+  // this)?
+  // TODO(phillipc): Reserve number of elements
+  RETURN_NOT_OK(sign_bitmap_->Reserve(1));


I think this is wrong, and needs to be fixed. I need to poke around to see if there are some functions lying around to reserve memory for bitmaps.

I don't think so, but that would be useful -- there's other places in the builders where bitmaps need to get expanded

Yep, I'm right in the middle of writing them :)

wesm · 2017-04-03T22:07:02Z

This needs a rebase, sorry about that.

wesm

Wow, this was hairy, nice work. I mostly noted C++ style nitpicks, questions around symbol visibility (we'll see what Travis CI / Appveyor have to say about that), toolchain questions, with some error checking stuff

wesm · 2017-04-07T02:20:16Z

ci/travis_before_script_cpp.sh

@@ -31,7 +31,8 @@ CPP_DIR=$TRAVIS_BUILD_DIR/cpp

 CMAKE_COMMON_FLAGS="\
 -DARROW_BUILD_BENCHMARKS=ON \
-DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL"
+-DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL \
+-DCMAKE_CXX_STANDARD=11"


Does this do something other than -std=c++11?

Indeed it does: https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html#prop_tgt:CXX_STANDARD

OK, we should either

a) set this variable inside our CMakeLists.txt so it doesn't have to be specified on the command line
b) rely on the existing -std=c++11 being set in the CXX_COMMON_FLAGS https://github.com/apache/arrow/blob/master/cpp/cmake_modules/SetupCxxFlags.cmake#L47

but not both

agreed. I will remove it here and add a follow up patch to fix it more generically, and enforce that we must use c++11

wesm · 2017-04-07T02:20:52Z

ci/travis_script_python.sh

@@ -112,7 +113,7 @@ python_version_tests() {
  # Other stuff pip install
  pip install -r requirements.txt

-  python setup.py build_ext --inplace --with-parquet --with-jemalloc
+  python setup.py build_ext --inplace --with-parquet --with-jemalloc --extra-cmake-args="-DCMAKE_CXX_STANDARD=11"


Same questions as above, in theory this should be handled internally in the CMAKE_CXX_FLAGS

See the above link

removing, see above comment.

Hm I thought I removed this.

wesm · 2017-04-07T02:23:46Z

cpp/src/arrow/array-decimal-test.cc

+    }
+
+    std::vector<uint8_t> raw_bytes(data(draw, byte_width, expected_sign_bitmap_data));
+    auto expected_data(std::make_shared<Buffer>(raw_bytes.data(), size * byte_width));


auto expected_data = std::make_shared<Buffer>(...) is a bit more typical

wesm · 2017-04-07T02:24:07Z

cpp/src/arrow/array-decimal-test.cc

+
+    std::vector<uint8_t> raw_bytes(data(draw, byte_width, expected_sign_bitmap_data));
+    auto expected_data(std::make_shared<Buffer>(raw_bytes.data(), size * byte_width));
+    auto expected_null_bitmap(test::bytes_to_null_buffer(valid_bytes));


same with assignment here (auto foo = test::...)

wesm · 2017-04-07T02:24:30Z

cpp/src/arrow/array-decimal-test.cc

+    int64_t expected_null_count = test::null_count(valid_bytes);
+    auto expected(std::make_shared<DecimalArray>(type, size, expected_data,
+        expected_null_bitmap, expected_null_count, offset,
+        std::dynamic_pointer_cast<PoolBuffer>(expected_sign_bitmap)));


is this std::dynamic_pointer_cast needed?

nope, fixed.

wesm · 2017-04-07T03:01:01Z

cpp/src/arrow/util/decimal-test.cc

+  ToBytes(expected, &bytes);
+
+  Decimal32 result;
+  FromBytes(bytes, &result);


wesm · 2017-04-07T03:02:13Z

cpp/src/arrow/util/decimal.cc

+
+Status FromBytes(const uint8_t* bytes, bool is_negative, Decimal128* decimal) {
+  auto& decimal_value(decimal->value);
+  int128_t::backend_type& backend(decimal_value.backend());


const backend_type& backend = decimal_value.backend()?

i can't do that because i'm mutating the backend with resize and memcpy below.

wesm · 2017-04-07T03:03:09Z

cpp/src/arrow/util/decimal.cc

+}
+
+Status ToBytes(const Decimal64& value, uint8_t** bytes) {
+  std::memcpy(*bytes, reinterpret_cast<const uint8_t*>(&value.value), sizeof(int64_t));


you could also assign to *reinterpret_cast<int32_t*>(*bytes)

wesm · 2017-04-07T03:04:08Z

cpp/src/arrow/util/decimal.h

+
+typedef Decimal<int32_t> Decimal32;
+typedef Decimal<int64_t> Decimal64;
+typedef Decimal<int128_t> Decimal128;


use using instead of typedef

wesm · 2017-04-07T03:10:39Z

python/pyarrow/schema.pyx

+    cdef shared_ptr[CDataType] decimal_type
+    decimal_type.reset(new CDecimalType(precision, scale))
+    out.init(decimal_type)
+    return out


you could use box_data_type here

wesm · 2017-04-07T20:07:12Z

Oops, looks like 1c66097 broke glib, fixing now

wesm · 2017-04-08T13:31:26Z

cpp/src/arrow/CMakeLists.txt

 ADD_ARROW_BENCHMARK(column-benchmark)
+ARROW_BENCHMARK_LINK_LIBRARIES(column-benchmark ${BOOST_REGEX_LIBRARY})


You shouldn't have to link this library here, or in any library that links to libarrow. I think you only need to add boost_regex to these lists of link libs: https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L695

wesm · 2017-04-08T13:32:41Z

cpp/src/arrow/array.h

@@ -29,6 +29,7 @@
 #include "arrow/type_fwd.h"
 #include "arrow/type_traits.h"
 #include "arrow/util/bit-util.h"
+#include "arrow/util/decimal.h"


You're leaking boost/multiprecision headers in the public API here, can this be removed?

Yep I can fix that.

wesm · 2017-04-08T13:33:06Z

cpp/src/arrow/builder.h

@@ -30,6 +30,7 @@
 #include "arrow/type.h"
 #include "arrow/type_traits.h"
 #include "arrow/util/bit-util.h"
+#include "arrow/util/decimal.h"


Leaking boost headers in public API, we can't do that

wesm · 2017-04-08T13:33:42Z

cpp/src/arrow/ipc/CMakeLists.txt

@@ -181,4 +187,5 @@ endif()

 ADD_ARROW_BENCHMARK(ipc-read-write-benchmark)
 ARROW_BENCHMARK_LINK_LIBRARIES(ipc-read-write-benchmark
-  ${ARROW_IPC_TEST_LINK_LIBS})
+  ${ARROW_IPC_TEST_LINK_LIBS}
+  ${BOOST_REGEX_LIBRARY})


Shouldn't be needed in any of these linker statements

wesm · 2017-04-08T13:39:39Z

Based on whether ARROW_BOOST_USE_SHARED is set, you need to add appropriate link libraries to

https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L798

wesm · 2017-04-09T17:30:24Z

cpp/src/arrow/type.h

@@ -28,6 +28,7 @@

 #include "arrow/status.h"
 #include "arrow/type_fwd.h"
+#include "arrow/util/decimal.h"


Sadly, this is leaking boost/multiprecision

ah ok, will fix

cpcloud · 2017-04-09T18:12:45Z

Looks like I'm leaking in helpers.h as well.

wesm · 2017-04-09T18:13:38Z

indeed, I am OK with making helpers.h an internal header also, maybe remove it from api.h https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/api.h

cpcloud · 2017-04-09T18:24:56Z

cool, removing.

This reverts commit 7f6ea7e.

cpcloud · 2017-04-09T18:47:08Z

I'll remove the helpers.h header in a follow up patch. I'd like to move the C++ specific things that help with python interop that aren't used in Python into a header file that isn't in python/api.h all at once.

wesm

+1, really great to get this done. Will merge once the build passes

wesm · 2017-04-09T19:19:41Z

Build passed except for conda s3 timeout. Any idea why this keeps happening?

cpcloud · 2017-04-09T19:28:09Z

Hm, not sure. I wonder if this might be a symptom of rate limiting.

This depends on: - [x] [ARROW-1607](apache#1128) - [x] [ARROW-1656](apache#1184) - [x] [ARROW-1588](apache#1211) - [x] Add tests for writing different sizes of values Author: Phillip Cloud <cpcloud@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#403 from cpcloud/PARQUET-1095 and squashes the following commits: 8c3d222 [Phillip Cloud] Remove loop from BytesToInteger 63018bc [Wes McKinney] Suppress C4996 due to arrow/util/variant.h e4b02d3 [Phillip Cloud] Refactor types.h 83948ec [Phillip Cloud] Add last_value_ init 51965cd [Phillip Cloud] Min commit that contains the unique kernel in arrow e25c59b [Phillip Cloud] Fix reader writer test for unique kernel addition da0a7eb [Phillip Cloud] Update for ARROW-1811 16935de [Phillip Cloud] Reverse operand order and explicit cast 6036ca5 [Phillip Cloud] ARROW-1811 c5c4294 [Phillip Cloud] Fix issues 32a4abe [Phillip Cloud] Cleanup iteration a bit 920832a [Phillip Cloud] Update arrow version 9f97c1d [Phillip Cloud] Update for ARROW-1794: rename DecimalArray to Decimal128Array b2e0290 [Phillip Cloud] IWYU 64748a8 [Phillip Cloud] Copy from arrow for now 6c9e2a7 [Phillip Cloud] Reduce the number of decimal test cases 7ab2e5c [Phillip Cloud] Parameterize on precision 30655d6 [Phillip Cloud] Use arrow random_decimals 9ff7eb4 [Phillip Cloud] Remove specific template parameters 1eee6a9 [Phillip Cloud] Remove specific randint call 8808e4c [Phillip Cloud] Bump arrow version 659fbc1 [Phillip Cloud] Fix deprecated API call e162ca1 [Phillip Cloud] Allocate scratch space to hold the byteswapped values 5c9292b [Phillip Cloud] Proper dcheck call 1782da0 [Phillip Cloud] Use arrow 3d243d5 [Phillip Cloud] Checkpoint [ci skip] 028fb03 [Phillip Cloud] Remove garbage values 46dff15 [Phillip Cloud] Clean up uint32 test 613255e [Phillip Cloud] Do not use std::copy when reinterpret_cast will suffice 2917a62 [Phillip Cloud] PARQUET-1095: [C++] Read and write Arrow decimal values

This depends on: - [x] [ARROW-1607](apache#1128) - [x] [ARROW-1656](apache#1184) - [x] [ARROW-1588](apache#1211) - [x] Add tests for writing different sizes of values Author: Phillip Cloud <cpcloud@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#403 from cpcloud/PARQUET-1095 and squashes the following commits: 8c3d222 [Phillip Cloud] Remove loop from BytesToInteger 63018bc [Wes McKinney] Suppress C4996 due to arrow/util/variant.h e4b02d3 [Phillip Cloud] Refactor types.h 83948ec [Phillip Cloud] Add last_value_ init 51965cd [Phillip Cloud] Min commit that contains the unique kernel in arrow e25c59b [Phillip Cloud] Fix reader writer test for unique kernel addition da0a7eb [Phillip Cloud] Update for ARROW-1811 16935de [Phillip Cloud] Reverse operand order and explicit cast 6036ca5 [Phillip Cloud] ARROW-1811 c5c4294 [Phillip Cloud] Fix issues 32a4abe [Phillip Cloud] Cleanup iteration a bit 920832a [Phillip Cloud] Update arrow version 9f97c1d [Phillip Cloud] Update for ARROW-1794: rename DecimalArray to Decimal128Array b2e0290 [Phillip Cloud] IWYU 64748a8 [Phillip Cloud] Copy from arrow for now 6c9e2a7 [Phillip Cloud] Reduce the number of decimal test cases 7ab2e5c [Phillip Cloud] Parameterize on precision 30655d6 [Phillip Cloud] Use arrow random_decimals 9ff7eb4 [Phillip Cloud] Remove specific template parameters 1eee6a9 [Phillip Cloud] Remove specific randint call 8808e4c [Phillip Cloud] Bump arrow version 659fbc1 [Phillip Cloud] Fix deprecated API call e162ca1 [Phillip Cloud] Allocate scratch space to hold the byteswapped values 5c9292b [Phillip Cloud] Proper dcheck call 1782da0 [Phillip Cloud] Use arrow 3d243d5 [Phillip Cloud] Checkpoint [ci skip] 028fb03 [Phillip Cloud] Remove garbage values 46dff15 [Phillip Cloud] Clean up uint32 test 613255e [Phillip Cloud] Do not use std::copy when reinterpret_cast will suffice 2917a62 [Phillip Cloud] PARQUET-1095: [C++] Read and write Arrow decimal values Change-Id: Ibe81cd5a5961bbe86c66db811ec8b770ae48c38b

cpcloud force-pushed the decimal branch 2 times, most recently from 6e49399 to 49a5780 Compare March 19, 2017 22:38

wesm reviewed Mar 19, 2017

View reviewed changes

cpcloud force-pushed the decimal branch from 49a5780 to 983edde Compare March 20, 2017 00:45

tebeka reviewed Mar 21, 2017

View reviewed changes

cpcloud force-pushed the decimal branch 2 times, most recently from be209e1 to 390be34 Compare March 23, 2017 01:26

cpcloud force-pushed the decimal branch from 390be34 to 4e553d0 Compare April 2, 2017 04:31

cpcloud force-pushed the decimal branch from 4e553d0 to 8941d91 Compare April 2, 2017 04:40

cpcloud force-pushed the decimal branch 2 times, most recently from dbb08ef to 7fd7185 Compare April 2, 2017 21:27

cpcloud commented Apr 2, 2017

View reviewed changes

cpcloud force-pushed the decimal branch from 7fd7185 to d31c8ee Compare April 2, 2017 21:36

cpcloud force-pushed the decimal branch 4 times, most recently from 4efcfcd to 57ef5de Compare April 7, 2017 00:01

wesm reviewed Apr 7, 2017

View reviewed changes

cpcloud force-pushed the decimal branch from eecf060 to 8731ed0 Compare April 7, 2017 19:07

wesm reviewed Apr 8, 2017

View reviewed changes

cpcloud force-pushed the decimal branch 3 times, most recently from f6aed22 to ef5c159 Compare April 8, 2017 21:44

wesm reviewed Apr 9, 2017

View reviewed changes

cpcloud force-pushed the decimal branch from 6bf432b to 7deadf8 Compare April 9, 2017 18:23

cpcloud added 8 commits April 9, 2017 14:44

ARROW-655: [C++/Python] Implement DecimalArray

805bbac

Remove python extra cmake args

e63b766

Do not link boost regex to jemalloc

cb2c1ac

Move IsNegative to cc file

bf2a7ea

Transitively link static libs as well

a7109b2

Revert "Transitively link static libs as well"

f764156

This reverts commit 7f6ea7e.

Remove more boost leakage

07713a7

Remove unnecessary header in helpers.h

e5470fd

cpcloud force-pushed the decimal branch from 7deadf8 to e5470fd Compare April 9, 2017 18:48

wesm approved these changes Apr 9, 2017

View reviewed changes

asfgit closed this in 754bcce Apr 9, 2017

cpcloud deleted the decimal branch April 9, 2017 19:20

		ADD_ARROW_BENCHMARK(column-benchmark)
		ARROW_BENCHMARK_LINK_LIBRARIES(column-benchmark ${BOOST_REGEX_LIBRARY})

ARROW-655: [C++/Python] Implement DecimalArray #403

ARROW-655: [C++/Python] Implement DecimalArray #403

Conversation

cpcloud commented Mar 19, 2017 • edited Loading

wesm commented Mar 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Mar 20, 2017

wesm commented Mar 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Apr 2, 2017

cpcloud commented Apr 2, 2017

cpcloud commented Apr 2, 2017

cpcloud commented Apr 2, 2017

wesm commented Apr 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Apr 3, 2017

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Apr 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Apr 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Apr 9, 2017

wesm commented Apr 9, 2017

cpcloud commented Apr 9, 2017

cpcloud commented Apr 9, 2017

wesm left a comment

Choose a reason for hiding this comment

wesm commented Apr 9, 2017

cpcloud commented Apr 9, 2017

cpcloud commented Mar 19, 2017 •

edited

Loading