Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-67: C++ metadata flatbuffer serialization and data movement to memory maps #28

Closed
wants to merge 3 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Mar 17, 2016

Several things here:

  • Add Google flatbuffers dependency
  • Flatbuffers IDL draft in collaboration with @jacques-n and @StevenMPhillips
  • Add Schema wrapper in Cython
  • arrow::Schema conversion to/from flatbuffer representation
  • Remove unneeded physical layout types from type.h
  • Refactor ListType to be a nested type with a single child
  • Implement shared memory round-trip for numeric row batches
  • mmap-based shared memory interface and MemorySource abstract API

Quite a bit of judicious code cleaning and consolidation as part of this. For example, List types are now internally equivalent to a nested type with 1 named child field (versus a struct, which can have any number of child fields).

Associated JIRAs: ARROW-48, ARROW-57, ARROW-58

@wesm
Copy link
Member Author

wesm commented Mar 17, 2016

FWIW, I have only good things to say about Flatbuffers (https://github.com/google/flatbuffers) so far. It's only metadata (serializing data itself with flatbuffers would not be appropriate for several reasons), but it's fast and most importantly lightweight (compared with the alternatives, e.g. Thrift).


if (FLATBUFFERS_INCLUDE_DIR AND FLATBUFFERS_LIBRARIES)
set(FLATBUFFERS_FOUND TRUE)
get_filename_component( FLATBUFFERS_LIBS ${FLATBUFFERS_LIBRARIES} DIRECTORY )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use PATH instead of DIRECTORY here to support CMake 2.8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, fixed

@wesm wesm changed the title ARROW-67: [WIP] C++ metadata and data movement to memory maps ARROW-67: C++ metadata and data movement to memory maps Mar 19, 2016
@wesm
Copy link
Member Author

wesm commented Mar 19, 2016

I'm calling a checkpoint here (before the patch gets much larger) as I've got simple integer row batches round-tripping successfully to a memory map. Sorry this got so big; lot of scaffolding needed to make this all work.

@StevenMPhillips I wouldn't mind a review of the code in adapter.h/adapter.cc to see if the logic jives with your thinking about the Java side.

I'll make a pass at incorporating feedback from #16. There's a long list of follow ups to incorporate all aspects of the Arrow spec into the IPC code, so I'll create lots of follow up JIRAs to break things into more bite-sized pieces.

@wesm
Copy link
Member Author

wesm commented Mar 19, 2016

Incorporated code review comments from #16.

@@ -7,5 +7,6 @@ SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd)
source thirdparty/versions.sh

export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR
export FLATBUFFERS_HOME=$SOURCE_DIR/thirdparty/installed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this might conflict with #29 as well

namespace arrow {
namespace ipc {

static bool IsPrimitive(const DataType* type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comment on the file, once we have a DCHECK MACRO, there are a lot of places in this file that could use it =)

@emkornfield
Copy link
Contributor

I'm kind of hitting a wall, in terms of reviewing this. I looked at most of the c++ stuff but not the cython stuff (but I think this mostly refactoring?). If you have the time to answer some of the questions I posed with further documentation in the code that would be great.

This is great progress, I'm excited to see how quickly this is coming along. Hopefully I will get some time soon to contribute more then just build tool-chain stuff. If you need help filling in more of the implementation for structs/lists I'd be interested in taking that on.

@wesm
Copy link
Member Author

wesm commented Mar 22, 2016

No problem, very much appreciate the reviews. I'll address your comments and then rebase/merge so we can move forward. The Cython stuff is all refactoring so no need to scrutinize there.

@wesm
Copy link
Member Author

wesm commented Mar 22, 2016

I'm mostly going to be working on Python and Parquet-related stuff this week so I'll keep you in the loop on which JIRAs are on my critical path. Note: I'll be on vacation 4/9-21 so not much code progress during that time but I'll be able to do code reviews (within any given 72 hour window) so you don't get blocked waiting on reviews or patches merged. Hope to get a lot of stuff done before 4/9

}

union MessageHeader {
Schema, DictionaryBatch, RecordBatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a distinction between RecordBatch and RowBatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- I didn't like the name RecordBatch (Kudu and Impala all use "row batch" to refer to the same concept), but Drill uses "record batch". Will be sure to add comments clarifying

@wesm wesm force-pushed the cpp-ipc-draft branch 2 times, most recently from 6b88094 to 5a70a52 Compare March 22, 2016 21:42
@wesm
Copy link
Member Author

wesm commented Mar 22, 2016

Just rebased and resolved conflicts from #29. Going to address all the feedback now and merge with a green build

@wesm
Copy link
Member Author

wesm commented Mar 22, 2016

Thank you @emkornfield and others for the detailed reviews. I addressed the comments and added a bunch more comments to Message.fbs. Please continue to scrutinize the metadata serialization and let me or @StevenMPhillips know what is unclear. One thing that is not specified at all is transmitting dictionary-encoded data (especially nested data like strings or other complex structures).

bool Equals(const Array& arr) const;
virtual bool Equals(const std::shared_ptr<Array>& arr) const;
bool EqualsExact(const Array& arr) const;
virtual bool Equals(const std::shared_ptr<Array>& arr) const = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang revealed that this (plus the downstream PrimitiveArrayImpl templates) was causing some bad behavior, so we can review in the future how to have the cleanest code for the proliferation of array containers while avoiding unwanted template instantiation (I admit I don't know enough about this to make a judgment

@wesm
Copy link
Member Author

wesm commented Mar 23, 2016

+1, thank you all

@wesm wesm changed the title ARROW-67: C++ metadata and data movement to memory maps ARROW-67: C++ metadata flatbuffer serialization and data movement to memory maps Mar 23, 2016
@asfgit asfgit closed this in 65db0da Mar 23, 2016
@wesm wesm deleted the cpp-ipc-draft branch March 23, 2016 01:45
List,
Tuple,
Union,
JSONScalar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an alias for Union<Int, ...> it should probably not be in this type union.
Does flatbuff have constant literals?

wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
Author: Wes McKinney <wes@cloudera.com>

Closes apache#28 from wesm/PARQUET-439 and squashes the following commits:

7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
Author: Wes McKinney <wes@cloudera.com>

Closes apache#28 from wesm/PARQUET-439 and squashes the following commits:

7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements

Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
Author: Wes McKinney <wes@cloudera.com>

Closes apache#28 from wesm/PARQUET-439 and squashes the following commits:

7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements

Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
Author: Wes McKinney <wes@cloudera.com>

Closes apache#28 from wesm/PARQUET-439 and squashes the following commits:

7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements

Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Author: Wes McKinney <wes@cloudera.com>

Closes apache#28 from wesm/PARQUET-439 and squashes the following commits:

7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements

Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
kou pushed a commit that referenced this pull request May 10, 2020
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit ed5f534
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test #20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test #21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test #22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test #23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test #24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test #25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test #26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test #27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test #28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test #29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test #30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test #31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test #32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test #33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test #34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test #35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test #36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test #37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test #38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test #39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test #40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test #41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test #42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test #43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test #44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test #45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test #46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test #47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test #48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test #49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test #50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test #51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes #7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants