Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-694: [C++] Initial parser interface for reading JSON into RecordBatches #3592

Closed
wants to merge 33 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Feb 8, 2019

( abandoning #3206 )

Adds json sub project with:

  • BlockParser which parses Buffers of json formatted data into a StructArray with minimal conversion
    • true/false, and null fields are stored in BooleanArray and NullArray respectively
    • strings are stored as indices into a single StringArray
    • numbers are not converted; their string representations are stored alongside string values
    • nested fields are stored as ListArray or StructArray of their parsed (unconverted) children
  • Three approaches to handling unexpected fields:
    1. Error on an unexpected field
    2. Ignore unexpected fields
    3. Infer the type of unexpected fields and add them to the schema
  • Convenience interface for parsing a single chunk of json data into a RecordBatch with fully converted columns
  • Chunker to process a stream of unchunked data for use by BlockParser (not currently used)

@wesm
Copy link
Member

wesm commented Feb 8, 2019

Can you update the title and write a PR description of what is in the patch?

@bkietz bkietz changed the title ARROW-694: [C++] json reader [WIP] ARROW-694: [C++] initial parser interface for reading JSON into RecordBatches [WIP] Feb 8, 2019
@wesm wesm changed the title ARROW-694: [C++] initial parser interface for reading JSON into RecordBatches [WIP] ARROW-694: [C++] Initial parser interface for reading JSON into RecordBatches [WIP] Feb 8, 2019
@wesm wesm changed the title ARROW-694: [C++] Initial parser interface for reading JSON into RecordBatches [WIP] ARROW-694: [C++] Initial parser interface for reading JSON into RecordBatches Feb 8, 2019
@wesm
Copy link
Member

wesm commented Feb 8, 2019

I removed the WIP from the title. I will review, then let's get the tests passing and merge this in the near future

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments.

The code here makes a lot use of std::move in places where we have used const & instead. We should keep using const references. I don't see whether moving really has performance impacts, I rather feel that is the other way around.

cpp/src/arrow/json/chunker.cc Show resolved Hide resolved

class ReadOnlyStream {
public:
using Ch = char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this 4->2 letter abbreviation. It worsens readability enormously for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a convention (a requirement maybe?) of rapidjson streams https://github.com/Tencent/rapidjson/blob/master/include/rapidjson/stream.h#L27


class StringStream : public ReadOnlyStream {
public:
using Ch = char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this abbreviation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That typedef is part of RapidJson's stream interface, but I can refactor the rest of the class to use char instead

cpp/src/arrow/type.h Show resolved Hide resolved
cpp/src/arrow/json/parser.h Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Outdated Show resolved Hide resolved
@xhochy
Copy link
Member

xhochy commented Feb 15, 2019

Looking good so far. Please ping again before you merge.

@wesm
Copy link
Member

wesm commented Feb 15, 2019

I will give this a look through today and then merge with no build issues. We can address further things in follow up patches

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm going to go ahead and merge this with a passing build. I left comments -- a lot of the polishing and improvements can happen up in follow up patches. I think we should soon implement Python bindings and begin to drive feature and performance work based on feature requirements.

At the face of it, the parsing perf of ~77 MB/s seems slower than I would have expected but I didn't do any perf tests of other libraries to get a baseline. I made a quick flamegraph and it seems about 20% of runtime is spent in SetFieldBuilder. We may need to make some optimizations for the common case where fields appear in the same order in each record.

cpp/src/arrow/array/builder_binary.h Show resolved Hide resolved

class ReadOnlyStream {
public:
using Ch = char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a convention (a requirement maybe?) of rapidjson streams https://github.com/Tencent/rapidjson/blob/master/include/rapidjson/stream.h#L27

cpp/src/arrow/json/chunker.cc Show resolved Hide resolved
/// \brief A reusable block-based chunker for JSON data
///
/// The chunker takes a block of JSON data and finds a suitable place
/// to cut it up without splitting an object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no benchmarks yet for the Chunker code, whoever works on this next needs to write some before doing any more work on the API or implementation

cpp/src/arrow/json/options.h Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Outdated Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
cpp/src/arrow/json/parser.cc Show resolved Hide resolved
bkietz and others added 11 commits February 15, 2019 17:34
(it is a stand in for the eventual TableReader impl)

Also: do not rely on pointer identity of tags
(`std::shared_ptr<const KeyValueMetadata>`)
Useful:

```
clang-query-7 -p=compile_commands.json \
  cpp/src/arrow/json/* \
  -c='match ifStmt(hasThen(stmt(unless(compoundStmt()))))'
```
Change-Id: I8e632a786ba6cc26afa3fb14e8769013ffe9a28c
Change-Id: Ibe9148f4210c9a1f794d7c928d7e70c5ddcaae56
Change-Id: I96d6a741ecef9dbfced5ef7ee4c4fcdc697ad64e
@wesm
Copy link
Member

wesm commented Feb 16, 2019

The chunker test was failing valgrind, so I disabled it

@wesm
Copy link
Member

wesm commented Feb 20, 2019

This looks like a race condition with the metadata_fbs target, not related to this patch

[33/566] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o
FAILED: src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o 
/Users/travis/build/apache/arrow/cpp-toolchain/bin/ccache /Users/travis/build/apache/arrow/cpp-toolchain/bin/clang++  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR=/Users/travis/build/apache/arrow/cpp-build/jemalloc_ep-prefix/src/jemalloc_ep/dist//include -DARROW_NO_DEPRECATED_API -DARROW_USE_GLOG -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -Isrc -I/Users/travis/build/apache/arrow/cpp/src -isystem boost_ep-prefix/src/boost_ep -isystem /Users/travis/build/apache/arrow/cpp-toolchain/include -isystem gbenchmark_ep/src/gbenchmark_ep-install/include -isystem jemalloc_ep-prefix/src -isystem /Users/travis/build/apache/arrow/cpp/thirdparty/hadoop/include -isystem orc_ep-install/include -isystem /Users/travis/build/apache/arrow/cpp-toolchain/include/thrift -Qunused-arguments -ggdb -O0  -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-deprecated -Wno-weak-vtables -Wno-padded -Wno-comma -Wno-unused-macros -Wno-unused-parameter -Wno-unused-template -Wno-undef -Wno-shadow -Wno-switch-enum -Wno-exit-time-destructors -Wno-global-constructors -Wno-weak-template-vtables -Wno-undefined-reinterpret-cast -Wno-implicit-fallthrough -Wno-unreachable-code-return -Wno-float-equal -Wno-missing-prototypes -Wno-documentation-unknown-command -Wno-old-style-cast -Wno-covered-switch-default -Wno-cast-align -Wno-vla-extension -Wno-shift-sign-overflow -Wno-used-but-marked-unused -Wno-missing-variable-declarations -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-sign-conversion -Wno-disabled-macro-expansion -Wno-format-nonliteral -Wno-missing-noreturn -Wno-gnu-folding-constant -Wno-reserved-id-macro -Wno-range-loop-analysis -Wno-double-promotion -Wno-undefined-func-template -Wno-zero-as-null-pointer-constant -Wno-unknown-warning-option -Werror -Wno-unknown-warning-option -msse4.2 -maltivec -stdlib=libc++  -g -fPIC   -std=gnu++11 -MD -MT src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o -MF src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o.d -o src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o -c /Users/travis/build/apache/arrow/cpp/src/arrow/flight/client.cc
In file included from /Users/travis/build/apache/arrow/cpp/src/arrow/flight/client.cc:29:
/Users/travis/build/apache/arrow/cpp/src/arrow/ipc/metadata-internal.h:32:10: fatal error: 'arrow/ipc/Schema_generated.h' file not found
#include "arrow/ipc/Schema_generated.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
[36/566] Performing build step for 'boost_ep'
-- boost_ep build command succeeded.  See also /Users/travis/build/apache/arrow/cpp-build/boost_ep-prefix/src/boost_ep-stamp/boost_ep-build-*.log
ninja: build stopped: subcommand failed.

…ndition with Flatbuffers

Change-Id: Ief2c1de54e39731fb17ffe6b5c306152b0c76169
@xhochy
Copy link
Member

xhochy commented Feb 20, 2019

This looks like a race condition with the metadata_fbs target, not related to this patch

[33/566] Building CXX object src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o
FAILED: src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o 
/Users/travis/build/apache/arrow/cpp-toolchain/bin/ccache /Users/travis/build/apache/arrow/cpp-toolchain/bin/clang++  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR=/Users/travis/build/apache/arrow/cpp-build/jemalloc_ep-prefix/src/jemalloc_ep/dist//include -DARROW_NO_DEPRECATED_API -DARROW_USE_GLOG -DARROW_USE_SIMD -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -Isrc -I/Users/travis/build/apache/arrow/cpp/src -isystem boost_ep-prefix/src/boost_ep -isystem /Users/travis/build/apache/arrow/cpp-toolchain/include -isystem gbenchmark_ep/src/gbenchmark_ep-install/include -isystem jemalloc_ep-prefix/src -isystem /Users/travis/build/apache/arrow/cpp/thirdparty/hadoop/include -isystem orc_ep-install/include -isystem /Users/travis/build/apache/arrow/cpp-toolchain/include/thrift -Qunused-arguments -ggdb -O0  -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-deprecated -Wno-weak-vtables -Wno-padded -Wno-comma -Wno-unused-macros -Wno-unused-parameter -Wno-unused-template -Wno-undef -Wno-shadow -Wno-switch-enum -Wno-exit-time-destructors -Wno-global-constructors -Wno-weak-template-vtables -Wno-undefined-reinterpret-cast -Wno-implicit-fallthrough -Wno-unreachable-code-return -Wno-float-equal -Wno-missing-prototypes -Wno-documentation-unknown-command -Wno-old-style-cast -Wno-covered-switch-default -Wno-cast-align -Wno-vla-extension -Wno-shift-sign-overflow -Wno-used-but-marked-unused -Wno-missing-variable-declarations -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-sign-conversion -Wno-disabled-macro-expansion -Wno-format-nonliteral -Wno-missing-noreturn -Wno-gnu-folding-constant -Wno-reserved-id-macro -Wno-range-loop-analysis -Wno-double-promotion -Wno-undefined-func-template -Wno-zero-as-null-pointer-constant -Wno-unknown-warning-option -Werror -Wno-unknown-warning-option -msse4.2 -maltivec -stdlib=libc++  -g -fPIC   -std=gnu++11 -MD -MT src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o -MF src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o.d -o src/arrow/flight/CMakeFiles/arrow_flight_objlib.dir/client.cc.o -c /Users/travis/build/apache/arrow/cpp/src/arrow/flight/client.cc
In file included from /Users/travis/build/apache/arrow/cpp/src/arrow/flight/client.cc:29:
/Users/travis/build/apache/arrow/cpp/src/arrow/ipc/metadata-internal.h:32:10: fatal error: 'arrow/ipc/Schema_generated.h' file not found
#include "arrow/ipc/Schema_generated.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
[36/566] Performing build step for 'boost_ep'
-- boost_ep build command succeeded.  See also /Users/travis/build/apache/arrow/cpp-build/boost_ep-prefix/src/boost_ep-stamp/boost_ep-build-*.log
ninja: build stopped: subcommand failed.

Rebasing should help, I've fixed this race recently.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @bkietz! This puts us on a good path to getting some Arrow-powered JSON into the hands of users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants