-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-694: [C++] Initial parser interface for reading JSON into RecordBatches #3592
Conversation
Can you update the title and write a PR description of what is in the patch? |
I removed the WIP from the title. I will review, then let's get the tests passing and merge this in the near future |
e7804f6
to
6126a92
Compare
6126a92
to
9f4065a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments.
The code here makes a lot use of std::move
in places where we have used const &
instead. We should keep using const references. I don't see whether moving really has performance impacts, I rather feel that is the other way around.
|
||
class ReadOnlyStream { | ||
public: | ||
using Ch = char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this 4->2 letter abbreviation. It worsens readability enormously for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a convention (a requirement maybe?) of rapidjson streams https://github.com/Tencent/rapidjson/blob/master/include/rapidjson/stream.h#L27
cpp/src/arrow/json/chunker.cc
Outdated
|
||
class StringStream : public ReadOnlyStream { | ||
public: | ||
using Ch = char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this abbreviation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That typedef
is part of RapidJson's stream interface, but I can refactor the rest of the class to use char
instead
9e158df
to
ca59903
Compare
Looking good so far. Please ping again before you merge. |
I will give this a look through today and then merge with no build issues. We can address further things in follow up patches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'm going to go ahead and merge this with a passing build. I left comments -- a lot of the polishing and improvements can happen up in follow up patches. I think we should soon implement Python bindings and begin to drive feature and performance work based on feature requirements.
At the face of it, the parsing perf of ~77 MB/s seems slower than I would have expected but I didn't do any perf tests of other libraries to get a baseline. I made a quick flamegraph and it seems about 20% of runtime is spent in SetFieldBuilder. We may need to make some optimizations for the common case where fields appear in the same order in each record.
|
||
class ReadOnlyStream { | ||
public: | ||
using Ch = char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a convention (a requirement maybe?) of rapidjson streams https://github.com/Tencent/rapidjson/blob/master/include/rapidjson/stream.h#L27
/// \brief A reusable block-based chunker for JSON data | ||
/// | ||
/// The chunker takes a block of JSON data and finds a suitable place | ||
/// to cut it up without splitting an object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are no benchmarks yet for the Chunker code, whoever works on this next needs to write some before doing any more work on the API or implementation
Instead of a subclass of ArrayBuilder, type inferrence now uses the new interface AdaptiveArrayBuilder. This class is designed to accomodate changing element type while building.
Also minor refactorings in parser, renamed to new un-prefixed convention
BlockParser now stores the both strings and unconverted numbers as a `dictionary<indices::Int32, dictionary::String>` for later parsing.
(it is a stand in for the eventual TableReader impl) Also: do not rely on pointer identity of tags (`std::shared_ptr<const KeyValueMetadata>`)
Useful: ``` clang-query-7 -p=compile_commands.json \ cpp/src/arrow/json/* \ -c='match ifStmt(hasThen(stmt(unless(compoundStmt()))))' ```
Change-Id: I8e632a786ba6cc26afa3fb14e8769013ffe9a28c
Change-Id: Ibe9148f4210c9a1f794d7c928d7e70c5ddcaae56
d2042d2
to
76d0431
Compare
Change-Id: I96d6a741ecef9dbfced5ef7ee4c4fcdc697ad64e
The chunker test was failing valgrind, so I disabled it |
This looks like a race condition with the
|
…ndition with Flatbuffers Change-Id: Ief2c1de54e39731fb17ffe6b5c306152b0c76169
Rebasing should help, I've fixed this race recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @bkietz! This puts us on a good path to getting some Arrow-powered JSON into the hands of users
( abandoning #3206 )
Adds
json
sub project with: