Skip to content

WIP ARROW-2193: [C++] Do not depend on Boost libraries at runtime in plasma_store#1711

Closed
wesm wants to merge 1590 commits intoapache:masterfrom
wesm:ARROW-2193
Closed

WIP ARROW-2193: [C++] Do not depend on Boost libraries at runtime in plasma_store#1711
wesm wants to merge 1590 commits intoapache:masterfrom
wesm:ARROW-2193

Conversation

@wesm
Copy link
Copy Markdown
Member

@wesm wesm commented Mar 6, 2018

This is sort of a hack; I wasn't sure the way to deal with this more generally. Unfortunately, this only gets rid of the boost_system and boost_filesystem runtime dependencies. boost_regex still has a transitive dependency somehow

wesm and others added 30 commits December 12, 2017 17:22
Change-Id: I2a7909e2f0fa87780197270982ef941d89834cca
Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1420 from pcmoritz/revert-to-pickle-arg and squashes the following commits:

bfef3ae [Philipp Moritz] fix windows test
c156653 [Philipp Moritz] fix remote serialization test on windows
3f58d0d [Philipp Moritz] fix windows
6a2a83d [Philipp Moritz] add regression test
3eb9325 [Philipp Moritz] fix
518fb7d [Philipp Moritz] fix
b488586 [Philipp Moritz] revert to pickle=True argument for serialization
The option is used in building deb package.

`arrow/gpu/cuda_version.h` exists in build directory not source directory.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1429 from kou/glib-fix-build-error-with-arrow-cpp-build-dir and squashes the following commits:

879d6bc [Kouhei Sutou] [GLib] Fix build error with --with-arrow-cpp-build-dir
This is not a 0.8.0 broker.

If 0.8.0 RC2 is dropped, I hope that 0.8.0 includes this.
If RC2 has no problem, I hope that 0.9.0 includes this.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1424 from kou/glib-add-timestamp-data-type-get-unit and squashes the following commits:

a88771b [Kouhei Sutou] [GLib] Add garrow_timestamp_data_type_get_unit()
Add JSON reader, as well as `js/bin/integration.js` script for running integration test validation

Author: Paul Taylor <paul.e.taylor@me.com>
Author: Brian Hulette <hulettbh@gmail.com>
Author: Brian Hulette <brian.hulette@ccri.com>

Closes apache#1343 from TheNeuralBit/json-reader and squashes the following commits:

bd6e80c [Paul Taylor] print correct error messages
f1a51bd [Paul Taylor] update example html file for new API
bb0059b [Paul Taylor] fix off by one error reading buffers from metadata v3 arrows
6da297c [Paul Taylor] update CI and JS integration scripts to invoke jest integration tests
03d82bd [Paul Taylor] add integration tests to gulp test task
937dbf8 [Paul Taylor] split out integration and unit tests
e0354a7 [Paul Taylor] quote enum keys to save from mangling, add JSON customMetadata map
c49f43a [Paul Taylor] use string indexers to protect JSON fields from closure compiler's mangler, fix es5 umd build
97f8e5e [Paul Taylor] really flatten buffers from json
7b6ea0a [Paul Taylor] fix a few json reader typos
fa352ea [Paul Taylor] update tests
f81dcb9 [Paul Taylor] move arrow2csv into src so it's distributed in the npm packages
b5f1470 [Paul Taylor] add Arrow types AST, refactor buffer + json reader to emit type AST nodes
85bf03c [Brian Hulette] linter fixes
793f9e5 [Brian Hulette] only use json-bignum in bin/integration
eaa5de4 [Brian Hulette] add dictionary-encoded vectors
12f99de [Brian Hulette] Add JSON support for Date/Time/Timestamp vectors
5349080 [Brian Hulette] move test data creation after integration_test.py
68c2349 [Paul Taylor] update npm script name in integration runner
f50356e [Paul Taylor] run the js build before integration.py
dcc85f9 [Brian Hulette] linter fixes
313cd58 [Brian Hulette] Add int-test to test-task
86b53b4 [Brian Hulette] Use Int128.fromString in JSON reader
148e997 [Brian Hulette] cleanup
a2befbb [Brian Hulette] Switch endianness of Int64/128
a1ea88f [Brian Hulette] Now uses Uint32 for all internal buffers
02a7838 [Brian Hulette] WIP Int64, Int128
645b844 [Brian Hulette] JS integration script uses new Table.from for JSON
c1f3f6a [Paul Taylor] move createTypedArray and createValidityArray to VectorReaderContext
1191a27 [Paul Taylor] refactor `Table.from()` to accept a JSON object or string
02ea8a6 [Paul Taylor] refactor traits to be compatible with closure compiler's full ES6 -> ES5
01de162 [Paul Taylor] move generated format to format/fb folder, fix closure compiler es5 build
ad41741 [Brian Hulette] Fix bug with zero-length vectors
b17367c [Brian Hulette] Add list,struct to JSON reader
e3d6d62 [Brian Hulette] linter fixes
1e64707 [Brian Hulette] Add JS integration script and integration_test.py JS runner
7e33b1c [Brian Hulette] Add JSON reader
cc @wesm , @jacques-n , @BryanCutler , @icexelloss

A small post on recent improvements in JAVA vectors. Suggestions are welcome :)

Author: siddharth <siddharth@dremio.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1419 from siddharthteotia/ARROW-1922 and squashes the following commits:

ebdd986 [Wes McKinney] Minor tweaks to post, add Dremio link
eaedd87 [siddharth] review comments
5705019 [siddharth] correct typo
c2af13c [siddharth] ARROW-1922: Blog post on JAVA vector changes
…checksum links, add verification instructions

Website fixes per ASF policies and feedback in ARROW-1935, ARROW-1936

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1434 from wesm/ARROW-1935 and squashes the following commits:

41f128e [Wes McKinney] Remove link to nightly builds. Fix signature / checksum links, add KEYS file, verification instructions
…ed table

This requires PARQUET-1092 apache/parquet-cpp#426

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1425 from wesm/ARROW-232 and squashes the following commits:

da8d999 [Wes McKinney] Add unit test to validate PARQUET-1092
I took a hack at this after poring through the changelog, others please let me know if you'd like to add or change anything. I need to incorporate Sidd's blog post and add a link to that here. I can publish all of this sometime tomorrow morning New York time and post to social media etc.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1432 from wesm/ARROW-1934 and squashes the following commits:

4aa6bc6 [Wes McKinney] Tweaks for publication
da9e65d [Wes McKinney] Start drafting 0.8.0 release blog post
cc @wesm

Author: siddharth <siddharth@dremio.com>

Closes apache#1436 from siddharthteotia/ARROW-1939 and squashes the following commits:

31f1be1 [siddharth] ARROW-1939: Correct links in release blog post
…or now

It's reasonably harmless to suppress these warnings for the time being. When we upgrade to a new release of googletest, we can remove this again

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1433 from wesm/ARROW-1931 and squashes the following commits:

09c3722 [Wes McKinney] Add tr1 define to CMAKE_CXX_FLAGS
6fe636e [Wes McKinney] Rearrange appveyor build jobs for faster feedback
6ace398 [Wes McKinney] Use CXX_COMMON_FLAGS, all in one place
b786a40 [Wes McKinney] Silence std::tr1 tuple warning everywhere
549e083 [Wes McKinney] Silence std::tr1 namespace warning only when building gtest
13a77e6 [Wes McKinney] Silence tr1 deprecation warning in MSVC 2017
ccf1318 [Wes McKinney] Add /bigobj flag, suppress C4996 deprecation warning for now
Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1440 from pcmoritz/findarrow-libs-fix and squashes the following commits:

bb161b4 [Philipp Moritz] fix static lib path in FindArrow
The current implementation of setInitialCapacity() uses a factor of 5 for every level we go into list:

So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)))))) and we start with an initial capacity of 128, we end up throwing OversizedAllocationException from the BigIntVector because at every level we increased the capacity by 5 and by the time we reached inner scalar that actually stores the data, we were well over max size limit per vector (1MB).

We saw this problem downstream when we failed to read deeply nested JSON data.

The potential fix is to use the factor of 5 only when we are down to the leaf vector. As the depth increases and we are still working with complex/list, we don't use the factor of 5.

cc @jacques-n , @BryanCutler , @icexelloss

Author: siddharth <siddharth@dremio.com>

Closes apache#1439 from siddharthteotia/ARROW-1943 and squashes the following commits:

d0adbad [siddharth] unit tests
e2f21a8 [siddharth] fix imports
d103436 [siddharth] ARROW-1943: handle setInitialCapacity for deeply nested lists
Author: Robert Nishihara <robertnishihara@gmail.com>

Closes apache#1451 from robertnishihara/numthreads and squashes the following commits:

5e2c7ee [Robert Nishihara] Fix tests.
55cb8ac [Robert Nishihara] Revert old change
0903726 [Robert Nishihara] Move memcopy_threads from serialization context to put.
9281de1 [Robert Nishihara] Expose memcopy threads to serialization context.
…er to handle all non-null

Need to properly set the ListVector validity buffer for the case when the field has all non-nulls.  This is done already in `BitVectorHelper.loadValidityBuffer`, so just need to build the buffer with a call to that function.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1447 from BryanCutler/java-ListVector-non-null-validity-buffer-ARROW-1948 and squashes the following commits:

0d82345 [Bryan Cutler] used BitVectorHelper to properly set the validity buffer
this is just a small fix in doxygen documentation of the c++ api

Author: Viktor Gal <viktor.gal@maeth.com>

Closes apache#1442 from vigsterkr/doxygen_fix and squashes the following commits:

3557aef [Viktor Gal] fix doxygen example in array.h
this is necessary until gulp publishes 4.0.0 builds to npm

Author: Paul Taylor <paul.e.taylor@me.com>

Closes apache#1453 from trxcllnt/fix-js-gulp-build and squashes the following commits:

0dddbcc [Paul Taylor] Merge branch 'master' into fix-js-gulp-build
dc329fa [Paul Taylor] Merge branch 'master' into fix-js-gulp-build
d9c1b0c [Paul Taylor] set gulp dependency to specific commit
- Create now takes in a pointer to a shared pointer of Buffer and returns a MutableBuffer.
- Object Buffers data and metadata are pointers to shared pointers of Buffer.

Author: Philipp Moritz <pcmoritz@gmail.com>
Author: William Paul <wapaul1@berkeley.edu>

Closes apache#1444 from Wapaul1/plasma_buffer_api and squashes the following commits:

7fe1cee [Philipp Moritz] fix size of MutableBuffer returned by plasma::Create
aeed751 [Philipp Moritz] more linting
b3274e0 [Philipp Moritz] fix
463dbeb [Philipp Moritz] fix plasma python extension
a055fa8 [Philipp Moritz] fix linting
fc62dda [William Paul] Added metadata buffer
4d8cbb8 [William Paul] Create and Get use Buffers now
…data

We recently moved Dremio to LE Decimal format (similar to Arrow). As part of that we introduced some APIs in decimal vector which take a big endian data and swap the bytes while writing into the ArrowBuf of decimal vector.

The advantage of these APIs is that caller would not have to allocate an additional memory and write( and read) source big endian twice for swapping into new memory and using that to write into the vector.

We can directly swap bytes while writing into the vector – just read once and swap while writing.

cc @jacques-n , @BryanCutler , @icexelloss

Author: siddharth <siddharth@dremio.com>

Closes apache#1443 from siddharthteotia/ARROW-1946 and squashes the following commits:

7805b62 [siddharth] unit tests
c89efbf [siddharth] ARROW-1946: Add APIs to decimal vector for writing big endian data
This closes [ARROW-1941](https://issues.apache.org/jira/browse/ARROW-1941).

Author: Licht-T <licht-t@outlook.jp>

Closes apache#1449 from Licht-T/fix-empty-list-roundtrip and squashes the following commits:

165dc6f [Licht-T] TST: Add test for the empty list roundtrip
0ddfd87 [Licht-T] BUG: Fix empty list roundtrip
This adds support for reading ORC files in the C++ library, as well as python bindings for this functionality.

Author: Jim Crist <jiminy.crist@gmail.com>
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1418 from jcrist/orc-adapter and squashes the following commits:

7e0400e [Jim Crist] lint
d6d32b5 [Uwe L. Korn] Hide symbols introduced by orc static lib
a296640 [Jim Crist] Tweak error message
f45ac3d [Jim Crist] Read reads as a table
57bc63d [Jim Crist] Use `vector<int>` instead of `list<uint64_t>`
1d53927 [Jim Crist] date32 instead of date64
4b7a3a5 [Jim Crist] Add brief docs
e783544 [Jim Crist] More fixups
33f5b10 [Jim Crist] Turn off ARROW_ORC on windows
86a2355 [Jim Crist] Cleanups
2cfdd92 [Jim Crist] Fix build when dependencies aren't already installed
876c3a3 [Jim Crist] Use fPIC on protobuf as well
f4a29f8 [Jim Crist] Ensure -fPIC on orc build
7cf1659 [Jim Crist] Build python orc support on travis
2adf938 [Jim Crist] Add ORC support
5c79104 [Jim Crist] Add cmake support for liborc
These changes were necessary to compile on Windows with "-DARROW_BUILD_BENCHMARKS=ON".  I added Shwlapi based on google/benchmark#202.

Author: Adam Seibert <seibs@users.noreply.github.com>

Closes apache#1406 from seibs/ARROW-1909 and squashes the following commits:

98602cd [Adam Seibert] ARROW-1909: [C++] Enables building with benchmarks on windows
Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1421 from pcmoritz/plasma-object-ids and squashes the following commits:

fc77908 [Philipp Moritz] fixes
9f613c0 [Philipp Moritz] fix windows test
f1d7ca0 [Philipp Moritz] fix linting
6be7f4a [Philipp Moritz] Test that object ids are 20 bytes
Adding `reset()` to the ValueVector interface and implementing where it is not done already.  Removing unused abstract class BaseDataValueVector that is not used anymore by the UnionVector.

Expanded reset tests to check that valueCount is 0, and buffers have same capacity and zeroed out.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1455 from BryanCutler/java-reset-ValueVector-ARROW-1962 and squashes the following commits:

da994e1 [Bryan Cutler] typo
a52e7db [Bryan Cutler] expanded reset documentations
1526a83 [Bryan Cutler] improved vector reset testing
a251d10 [Bryan Cutler] reset should zero data buffer and set value count to 0
bf2a16a [Bryan Cutler] add reset to NullableMapVector to zero validity buffer
7fbde5b [Bryan Cutler] need to zero out vector buffers when reset
b59addf [Bryan Cutler] adding reset to ValueVector interface, removing BaseDataValueVector
… garrow_chunked_array_get_value_type()

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1458 from kou/glib-add-chunked-array-get-value-type and squashes the following commits:

4d99a07 [Kouhei Sutou] [GLib] Add garrow_chunked_array_get_value_data_type() and garrow_chunked_array_get_value_type()
Fix conversion of datetimetz row index for non-UTC time zones in to_pandas.

Author: Albert Shieh <adshieh@gmail.com>

Closes apache#1454 from adshieh/master and squashes the following commits:

6f41302 [Albert Shieh] Fix pandas conversion for datetimetz row index.
Author: Robert Nishihara <robertnishihara@gmail.com>
Author: Philipp Moritz <pcmoritz@gmail.com>

Closes apache#1463 from robertnishihara/segfaultfix and squashes the following commits:

ec8a6c5 [Robert Nishihara] Add comment.
6222340 [Philipp Moritz] fix tests, linting, add license
3e969db [Robert Nishihara] Simplify tests.
8aa3fca [Philipp Moritz] add regression test
bfa0851 [Robert Nishihara] Import pyarrow in DeserializeObject.
- Turns off building optional ORC extension by default
- Fixes travis builds to turn on ORC extension for a few branches
- Adds trivial import test to python build
- Adds documentation on how to build optional ORC extension

Author: Jim Crist <jiminy.crist@gmail.com>

Closes apache#1457 from jcrist/orc-off-by-default and squashes the following commits:

fc9898d [Jim Crist] Document how to build ORC integration
950ae38 [Jim Crist] ORC integration is off by default
Change-Id: I82e6b51fa5129473815e19fd6d03bdaaef7a88ff
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this block could benefit from a short comment explaining what the purpose is.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a rebase artifact

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have added this in another commit (without it, the manylinux1 wheel would fail).

kou and others added 19 commits March 6, 2018 20:34
This change introduces GBytes constructors to GArrowBuffer and
GArrowMutableBuffer. GBytes has reference count feature. It means that
we can share the same memory safely.

We can't share the same memory safely with the current raw guint8
constructor.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1701 from kou/glib-buffer-accept-gbytes and squashes the following commits:

78de627 <Kouhei Sutou>  Improve memory management for GArrowBuffer data
…g-config

pkg-config doesn't show -I... and -L... flags when ... is the system
default path (i.e. /usr). If Arrow C++ is installed in /usr, we
couldn't detect include path and library path for Arrow C++.

It's caused when we install Arrow C++ with .rpm and .deb packages.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1721 from kou/cpp-find-arrow and squashes the following commits:

4911c8f <Kouhei Sutou>  Add Debian based system support
0a2d9c9 <Kouhei Sutou>  Support Arrow C++ installed in /usr detection by pkg-config
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1684 from pitrou/ARROW-2238-cmake-clcache and squashes the following commits:

8539a0e <Antoine Pitrou> ARROW-2238:  Detect and use clcache in cmake configuration
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1719 from xhochy/ARROW-2280 and squashes the following commits:

82b50a7 <Uwe L. Korn> ARROW-2280:  Return the offset for the buffers in pyarrow.Array
Recommend Ninja and clcache.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#1722 from pitrou/ARROW-2239-windows-build-docs and squashes the following commits:

a0e0288 <Antoine Pitrou> ARROW-2239:  Update Windows build docs
…t_cython.py

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1730 from wesm/ARROW-2263 and squashes the following commits:

6b3f827 <Wes McKinney> Prepend local pyarrow/ path to PYTHONPATH in test_cython.py
…ions between pd.DataFrame and pa.Table

Author: Phillip Cloud <cpcloud@gmail.com>

Closes apache#1728 from cpcloud/ARROW-1940 and squashes the following commits:

2e5b7af <Phillip Cloud> ARROW-1940:  Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table
They are useful to detect numeric data types.

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#1726 from kou/glib-numeric-data-type and squashes the following commits:

89a5a8a <Kouhei Sutou>  Add Numeric, Integer, FloatingPoint data types
Author: Antoine Pitrou <antoine@python.org>

Closes apache#1714 from pitrou/ARROW-2270-pyforeignbuffer and squashes the following commits:

51f2d85 <Antoine Pitrou> ARROW-2270:  Fix lifetime of ForeignBuffer base object
…arrow.Array for now

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1729 from wesm/ARROW-2150 and squashes the following commits:

f0ddcf5 <Wes McKinney> Raise NotImplementedError when comparing with pyarrow.Array for now
Just a trivial fix.  stderr is captured by py.test, not by the subprocess call.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#1724 from pitrou/ARROW-2284 and squashes the following commits:

46a692c <Antoine Pitrou> ARROW-2284:  Fix error display on test_plasma error
Also disambiguate the Tensor API on this front.

Author: Antoine Pitrou <antoine@python.org>

Closes apache#1717 from pitrou/ARROW-2275-bad-mutable-data and squashes the following commits:

fabd7b9 <Antoine Pitrou> ARROW-2275:  Guard against bad use of Buffer.mutable_data()
…tion scripts

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#1731 from wesm/ARROW-2268 and squashes the following commits:

42768ef <Wes McKinney> Drop usage of md5 checksums for source releases, verification scripts
cc @mitar

Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#1718 from xhochy/ARROW-2269 and squashes the following commits:

edc1f3d <Uwe L. Korn> ARROW-2269:  Make boost namespace selectable in wheels
…se existing process

Author: Mitar <mitar.git@tnode.com>

Closes apache#1705 from mitar/ARROW-2250 and squashes the following commits:

4e71d44 <Mitar> ARROW-2250:  Do not create a subprocess for plasma but just use existing process.
Adds not, gt, lt, and neq

Author: Brian Hulette <brian.hulette@ccri.com>

Closes apache#1683 from TheNeuralBit/js-more-predicates and squashes the following commits:

707de82 <Brian Hulette> Use two letter names
56ecea3 <Brian Hulette> export packBools, import compiled code in vector-test
32b26e3 <Brian Hulette> lint
5738327 <Brian Hulette> add externs
84895a0 <Brian Hulette> Add not, lt, gt, neq
Change-Id: I1b2fb90419df18a52e9d302ee0225871b1294903
Change-Id: Iee935158133c653a4c2a663ba7693aa83406b5e3
wesm added 2 commits March 9, 2018 14:56
Change-Id: I1261970faf5d150d43b903d3b741de6b0dabd0ea
Change-Id: Ib2363a27779a3559e3f8f57f01ee02b8c701e66a
@wesm
Copy link
Copy Markdown
Member Author

wesm commented Mar 9, 2018

I am not sure why libboost_regex is still a runtime dependency, even with --as-needed -- Plasma doesn't appear to have any symbols with a transitive dependency on code in arrow/util/decimal.cc (where boost::regex is used)

@wesm
Copy link
Copy Markdown
Member Author

wesm commented Mar 9, 2018

@xhochy It seems that CHECK_CXX_COMPILER_FLAG doesn't turn up the right answer on macOS. I'm afraid we'll have to leave this PR in WIP unless someone else can figure this out for 0.9.0

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 11, 2018

Should probably close this PR as this issue has been fixed by removing regex_boost usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.