Skip to content

ARROW-694: [C++] json reader, WIP#3206

Closed
bkietz wants to merge 8 commits intoapache:masterfrom
bkietz:ARROW-694-json-reader
Closed

ARROW-694: [C++] json reader, WIP#3206
bkietz wants to merge 8 commits intoapache:masterfrom
bkietz:ARROW-694-json-reader

Conversation

@bkietz
Copy link
Copy Markdown
Member

@bkietz bkietz commented Dec 17, 2018

Work in progress

Instead of a subclass of ArrayBuilder, type inferrence now uses the
new interface AdaptiveArrayBuilder. This class is designed to accomodate
changing element type while building.
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Dec 17, 2018

If it's WIP I would recommend either adding "WIP" in the PR title and/or the "WIP" label.

@bkietz bkietz changed the title ARROW-694: [C++] json reader ARROW-694: [C++] json reader, WIP Dec 17, 2018
@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 18, 2018

@pitrou can you help with reviewing this (since similar to the CSV scanner)? I intend to leave comments today

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Dec 18, 2018

Yes, though it seems to be unfinished?

@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 18, 2018

Yes, indeed, so these would be high level comments like "this seems to be the correct approach to solving the problem"


Status BlockParser::Parse(const char* data, uint32_t size, uint32_t* out_size) {
constexpr unsigned parse_flags =
rapidjson::kParseInsituFlag | rapidjson::kParseIterativeFlag |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, you can't use in-situ parsing with a const char*, as it will modify the input.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. There's a lot of copypasta from your CSV reader. I'll correct this, thanks

Status Parse(const char* data, uint32_t size, uint32_t* out_size);

/// \brief Extract parsed data as a RecordBatch
Status Finish(std::shared_ptr<RecordBatch>* parsed) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inference will have to work at the level of all blocks (which there are several of, otherwise you can't parse in parallel). So you can't return a RecordBatch here, since you can't infer the column types from a single block.

Of course, this assumes we want to do type inference at all, since JSON is already (partially) typed. @wesm what is the plan here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON files can have variation from one line to another (e.g. fields missing from one object, but not another), so we do need to have both type inference as well as user-specified field types

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I thought the intention was to either provide a schema (in which case fields outside the schema are ignored) or to infer types:

Reading should support two modes

  • Type inference: use observed types to determine schema for table
  • Passed schema: use a passed schema to normalize types. This will insert nulls when a field is not found, or ignore a key not found in the schema

Are we amending that to allow mixing those?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, nevermind. I misread your comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing a schema is just one way to indicate field types -- also if you do pass a schema, then fields not contained in the schema should be discarded. I think we also want to allow for passing types for one or more fields (allowing others to be inferred) in the same style as the CSV reader.

To give an example of how this might be useful: if we have dates in a file represented in some particular way (e.g. something we can parse with strptime) then we could indicate the conversion rule for just that field

This doesn't all have to get done in a single PR, but it's worth keeping in mind for structuring the abstractions so that this functionality can be developed incrementally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this to the design Google doc?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Dec 18, 2018

My overall opinion is that the inference design looks a bit convoluted, and potentially incompatible with the existence of multiple blocks (unless the plan is to cast to the final types at the end, which may be wasteful?).

@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 18, 2018

I will have a look later today.

There may be cases like:

Block 0: field A has type struct <B: null>
Block 1, field A has type struct<B: int>
Block 2: field A has type struct<B: int, C: null>
Block 3: field A has type struct<B: null, C: string>

In such case, we'd want to be able to promote all the blocks to the common type struct<B: int, C: string>, which wouldn't require too much work (a bit of bitmap splicing perhaps).

The other mode of reading would be with provided schema A: struct<B: int, C: string> where each block would be coerced to the type (safely) without need for inference / unification

@bkietz
Copy link
Copy Markdown
Member Author

bkietz commented Dec 18, 2018

In the case of type inference instead of just finishing to a RecordBatch, the root builder from each block will be kept until all blocks have been parsed. AdaptiveArrayBuilder supports promotion so as the overall schema is updated and expanded the builders can be promoted. When all blocks are complete and agree on a schema a Table can be constructed.

@pitrou sorry it's convoluted, though it's a bit more complicated than CSV since I need to support nested types. I think my approach to multiple blocks is pretty much "cast to final types at the end", coudl you describe what kind of waste you anticipate?

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Dec 18, 2018

In the case of type inference instead of just finishing to a RecordBatch, the root builder from each block will be kept until all blocks have been parsed.

I see.

I think my approach to multiple blocks is pretty much "cast to final types at the end", coudl you describe what kind of waste you anticipate?

Mostly that depends if casting is cheap (as @wesm thinks) or expensive. For example if you have to cast int64 to double it will be expensive.

Int64OrDoubleBuilder(MemoryPool* pool)
: AdaptiveArrayBuilder(initial_type()), bytes_builder_(pool) {}

Status Append(string_view repr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky case: is the string "-0" treated consistently?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. If that were encountered before a cast to double, then we'd get double(int64_t(-0)) which would be double(+0). if it's encountered afterwards then we'd get double(-0). Guess not

@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 19, 2018

I plan to leave some high level comments on this today now that I've got the CI / build system craziness sorted out for now

@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 20, 2018

I wasn't able to get very far in my review today, got distracted by some other things. Plan to look tomorrow morning

@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 20, 2018

Working on review, will put up comments by early afternoon (central time)

Copy link
Copy Markdown
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a handful of particular comments, I haven't dug too deeply into the inferring handler, but will keep reading and leaving comments.

As a high level critique: I would guess that having value conversion and error checking at the cell-level vs. vector-level is both slower and also will make the code significantly more complicated and difficult to extend. If I were doing a from-scratch implementation, I would have the SAX handlers do as little work as possible -- accumulate data in the correct kind of primitive builder and check for invalid type mixing, i.e.

  • NULL can transition to any
  • INT64 can transition to DOUBLE

Any other state transition (away from DOUBLE, LIST, STRUCT, BOOL, or VARBINARY/STRING) is invalid during the JSON parse step. This means you have a state machine with 7 states (I think that's right).

After you have accumulated the raw data, then do further conversions. This also allows for easier extensibility.

This is more similar to the way that the CSV parser is structured right now, which also means that some code can possibly be shared.

@pitrou thoughts about this general code structuring question?

// Parsing options

// Optional explicit schema (no type inference, ignores other fields)
std::shared_ptr<Schema> explicit_schema;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want something similar to https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L68. As one complexity, that data structure won't support nested fields, so we might need to come up with a path specifier (e.g. you could specify that foo.bar.baz should be a list of timestamps). Does not need to be done in this patch but worth keeping in mind

We'll eventually need an option whether to allow new fields (whose types would be inferred) when a schema is given. When you supply a schema, this option is turned off (new fields ignored). In theory some users might want there be an error when an unanticipated field is encountered

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{"foo.bar.baz": 1} is valid JSON, so it couldn't just be a map<string, type>. With the addition of converters it sounds like we need something just as nested as Schema, but more configurable.

Thinking about the Python API:

opts = ConvertOptions()
opts.fields = {
  # 'a' is int32
  'a': pa.int32(),

  # 'b' is string, non-nullable
  'b': ConvertOptions.required(pa.string()),

  'c': {
    # specify a converter for 'c.d'
    'd': ConvertOptions.binary_from_base64
  }
}
opts.unspecified_fields = 'ignore' # or 'infer', 'error'
pa.read_json(buf, opts)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm would that cover all permutations of conversion we want to support?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we suggesting this over schema? I think it make sense to accept an optional explicit schema and optional convert/serialization transformation (string -> u64, b64 -> binary...).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are saying both. Supplying a schema is one way to initialize the conversion options

Copy link
Copy Markdown
Member Author

@bkietz bkietz Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is more general than a schema, I think it'd be pretty straightforward to convert a schema into a ConversionStrategy. Passing a schema for opts.fields would be equivalent to

opts.fields =  {
  nm: ConvertOptions.required(typ) if not schm.field_by_name(nm).nullable else typ
  for nm, typ in zip(schm.names, schm.types)
}


std::shared_ptr<Array> GetColumn(const RecordBatch& batch, const std::string& name) {
return batch.column(batch.schema()->GetFieldIndex(name));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add RecordBatch::GetColumnByName -- since we have similar methods in Schema, StructType

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Table::GetColumnByName()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

os << std::quoted(fields[i]->name()) << ":";
return GenerateValue(fields[i]->type(), e, os);
}));
os << "}";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use RapidJSON to generate the example JSON

std::abort();
}
std::shared_ptr<RecordBatch> parsed;
ABORT_NOT_OK(parser.Finish(&parsed));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like everything below this line is going to skew the results

BenchmarkJSONParsing(state, json.str(), num_rows, options);
}

BENCHMARK(BM_ParseJSONQuotedBlock)->Repetitions(3)->Unit(benchmark::kMicrosecond);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe MinTime(1.0) instead of Repetitions (I've found it can produce less noisy results)?


bool String(const char* data, rapidjson::SizeType size, bool) {
if (Skipping()) return true;
status_ = VisitBuilder(builder_, AppendStringVisitor{data, size});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to do full dynamic dispatch here seems perhaps excessive. It might be better to limit to two cases here: variable length and fixed length strings, and deal with timestamp conversions in "post-processing". Having the ability to post-process strings into some other type is important anyway, since we may need to provide other kinds of conversion rules (e.g using strptime to convert strings in some format to dates)


bool Key(const char* key, rapidjson::SizeType len, bool) {
MaybeStopSkipping(); // new key at the depth where we started skipping ->
// terminate skipping
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the comment in the line(s) before the line of code would be more consistent with the commenting style in the codebase

// efficient to use CountLeadingZeros() to find the indices of the few
// null fields
if (null) {
status_ = VisitBuilder(parent->field_builder(field_index), AppendNullVisitor{});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add ArrayBuilder::VirtualAppendNull (or similar?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say for performance let's avoid virtual, but it would definitely be more user friendly to have a "works-for-all-builders" set of append functions. Those could use VisitBuilderInline under the hood, and maybe they should be non-members to avoid obfuscating general requirement that builders be downcast before use:

/// \brief Downcast builder as necessary and append null
inline Status AppendNull(ArrayBuilder* builder) {
  struct {
    // Visit(...)
  } visitor;
  return VisitBuilderInline(builder, &visitor);
}

(see also ARROW-4067 RFC: standardize ArrayBuilder subclasses)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem

if (!converter(data_, size_, value)) {
return ConversionError();
}
return Status::OK();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be better to do conversions in post-processing. @pitrou thoughts?

Copy link
Copy Markdown
Member

@pitrou pitrou Dec 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean first building a string array and then convert it to another type? I think that's going to be slow / slower (interestingly, appending one string at a time to a binary array isn't very fast).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, a problem we have right now is that the only non-string conversion supported is timestamps. So in any case we will eventually need to support a conversion-in-post-processing mode

template <typename T, typename... A>
std::unique_ptr<T> make_unique(A&&... args) {
return std::unique_ptr<T>(new T(std::forward<A>(args)...));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's other places we could use this, we should place in arrow/util somewhere

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Dec 20, 2018

I don't think it's worthwhile to try and share code with the CSV parser. Also it will make evolving one of the two designs more difficult.

I was a bit surprised by the adaptive builder approach at first, but it might be interesting to go with it and see the results (in terms of maintenance and performance).

@wesm
Copy link
Copy Markdown
Member

wesm commented Dec 20, 2018

I was a bit surprised by the adaptive builder approach at first, but it might be interesting to forward with it and see the results (in terms of maintenance and performance).

Yes -- I would describe my comments as mostly a "hot take" and perhaps superficial since I have not worked directly on this project myself. Ultimately performance requirements are going to drive the design of the computationally intensive part of this, and in the meantime it's probably best to focus on defining the public API and behavior so that we can refactor or try different approaches without breaking anything.

bool use_threads = true;
// Block size we request from the IO layer; also determines the size of
// chunks when use_threads is true
int32_t block_size = 1 << 20; // 1 MB
Copy link
Copy Markdown
Contributor

@fsaintjacques fsaintjacques Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to partition on number of row as opposed to bytes. A lot more friendly to make concurrent parsing, same for csv.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it though? Depending on the size of lines, a block could be 500KB in some cases or 50MB in others.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice, but this is use at read time (this is the size of blocks loaded from a file) before any parsing has been done, so we don't know where the row boundaries are. Finding those boundaries is what Chunker is for (which I haven't implemented yet but CSV has)

Copy link
Copy Markdown
Contributor

@fsaintjacques fsaintjacques Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm while this is possible, in my time working the ad tech, this was the exception more than the norm. Byte addressable chunking suffer from this problem, it only takes a huge line on one of the boundary.

Also I think it makes sense to dictate a size of chunks, especially if you carefully chose a size that rounds to a number of pages, say 8,16,32k. And in that case, partitioning on the number of bytes require a re-shuffling to have chunks of equal size.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern is a bit hypothetical / low priority for the moment. Whether chunking is based on size or number of rows is not a detail that should impact the design of the rest of the code, so we don't need to burden ourselves with it right now.

In CSV files I don't think chunking based on number of rows is always the best option. I've seen many files with 1000 columns or more in the wild. If your goal is eventually to transport the data on the wire (e.g. with gRPC), then you will end up wanting to split datasets into chunks that are around a certain size (8MB or so) anyway

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern with number of rows chunking is that chunking becomes more expensive, and may become a bottleneck because it's single-threaded. It's probably possible to optimize it (e.g. using masks and perhaps SIMD ops), but it's not as easy as having a fast number of bytes chunker.

Note that chunking really has two different goals here:

  • the primary goal is to allow efficient multi-threaded ingestion of CSV or JSON data
  • the secondary goal (which is more of a nice-to-have) is to have a chunked output with chunks of a reasonable size

@wesm
Copy link
Copy Markdown
Member

wesm commented Jan 4, 2019

@bkietz what do you think makes most sense on this project re: development workflow? It is OK for the API to be experimental for a little while, and let things develop over a serious of incremental patches to refactor and improve things. Once an initial patch is merged, others can add Python, GLib, and Ruby bindings

@wesm
Copy link
Copy Markdown
Member

wesm commented Feb 8, 2019

superseded by #3592

@wesm wesm closed this Feb 8, 2019
wesm pushed a commit that referenced this pull request Feb 20, 2019
…dBatches

( abandoning #3206 )

Adds [`json` sub project](https://github.com/apache/arrow/pull/3592/files#diff-2443c7d7b39b992ea580f0fbd387284a) with:

- BlockParser which parses Buffers of json formatted data into a StructArray with minimal conversion
  * true/false, and null fields are stored in BooleanArray and NullArray respectively
  * strings are stored as indices into a single StringArray
  * numbers are not converted; their string representations are stored alongside string values
  * nested fields are stored as ListArray or StructArray of their parsed (unconverted) children
- Three approaches to handling unexpected fields:
  1. Error on an unexpected field
  2. Ignore unexpected fields
  3. Infer the type of unexpected fields and add them to the schema
- [Convenience interface](https://github.com/apache/arrow/pull/3592/files#diff-d043a0249cc485b08d93767d2075bd83R124) for parsing a single chunk of json data into a RecordBatch with fully converted columns
- Chunker to process a stream of unchunked data for use by BlockParser (not currently used)

Author: Benjamin Kietzman <bengilgit@gmail.com>
Author: Wes McKinney <wesm+git@apache.org>

Closes #3592 from bkietz/ARROW-694-json-reader-WIP and squashes the following commits:

e42e5d7 <Wes McKinney> Add arrow_dependencies to arrow_flight.so dependencies to fix race condition with Flatbuffers
d67aff6 <Benjamin Kietzman> adding more comments to parser.cc
554b595 <Benjamin Kietzman> adding explanatory comments to json/parser.cc
0d0caa9 <Benjamin Kietzman> Add ARROW_PREDICT_* to conditions in parser.cc
d7d0a2e <Benjamin Kietzman> fix doc error in chunker.h
29c2312 <Wes McKinney> Disable arrow-json-chunker-test
76d0431 <Wes McKinney> cmake-format
83150ec <Wes McKinney> Restore BinaryBuilder::UnsafeAppend for const std::string&
05fd9d1 <Benjamin Kietzman> add json project back (merge error)
69541ea <Benjamin Kietzman> correct test util includes
be720c8 <Benjamin Kietzman> Use compound statements in if()
baac467 <Benjamin Kietzman> use const shared_ptr<T>& instead of move
6e86078 <Benjamin Kietzman> Move ParseOne to reader.cc
1bebda8 <Benjamin Kietzman> clean up Chunker's stream usage
0d6d920 <Benjamin Kietzman> disabling chunker test
c1a7f4b <Benjamin Kietzman> check status for generating list elements
292672a <Benjamin Kietzman> add inline tag
f92f850 <Benjamin Kietzman> add Status return to Generate
e334648 <Benjamin Kietzman> remove misplaced const
9679b6f <Benjamin Kietzman> fix format issue, use SFINAE to detect StringConverter default constructibility
aaaf9e7 <Benjamin Kietzman> remove bitfields
10e4f0d <Benjamin Kietzman> add missing virtual destructor
92ffc64 <Benjamin Kietzman> adding ParseOne interface for dead simple parsing
330615b <Benjamin Kietzman> adding first draft of parsing with type inference
69d7c5c <Benjamin Kietzman> Rewrite parser to defer conversion of strings and numbers
b9d5c3d <Benjamin Kietzman> adding Chunker implementation and tests
2677a57 <Benjamin Kietzman> use recommended loop style
e39a4a9 <Benjamin Kietzman> add (failing) test for '-0' consistency
0f3a3bc <Benjamin Kietzman> Added trivial parser benchmark and data generator
6472738 <Benjamin Kietzman> Refactored type inferrence
cb6a313 <Benjamin Kietzman> adding first draft of type inferrence to BlockParser
cc3698a <Benjamin Kietzman> refactoring Schema::GetFieldIndex to return int
17176f9 <Benjamin Kietzman> first sketch of JSON parser
@bkietz bkietz deleted the ARROW-694-json-reader branch February 22, 2019 16:15
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…dBatches

( abandoning apache/arrow#3206 )

Adds [`json` sub project](https://github.com/apache/arrow/pull/3592/files#diff-2443c7d7b39b992ea580f0fbd387284a) with:

- BlockParser which parses Buffers of json formatted data into a StructArray with minimal conversion
  * true/false, and null fields are stored in BooleanArray and NullArray respectively
  * strings are stored as indices into a single StringArray
  * numbers are not converted; their string representations are stored alongside string values
  * nested fields are stored as ListArray or StructArray of their parsed (unconverted) children
- Three approaches to handling unexpected fields:
  1. Error on an unexpected field
  2. Ignore unexpected fields
  3. Infer the type of unexpected fields and add them to the schema
- [Convenience interface](https://github.com/apache/arrow/pull/3592/files#diff-d043a0249cc485b08d93767d2075bd83R124) for parsing a single chunk of json data into a RecordBatch with fully converted columns
- Chunker to process a stream of unchunked data for use by BlockParser (not currently used)

Author: Benjamin Kietzman <bengilgit@gmail.com>
Author: Wes McKinney <wesm+git@apache.org>

Closes #3592 from bkietz/ARROW-694-json-reader-WIP and squashes the following commits:

e42e5d730 <Wes McKinney> Add arrow_dependencies to arrow_flight.so dependencies to fix race condition with Flatbuffers
d67aff63e <Benjamin Kietzman> adding more comments to parser.cc
554b595d9 <Benjamin Kietzman> adding explanatory comments to json/parser.cc
0d0caa991 <Benjamin Kietzman> Add ARROW_PREDICT_* to conditions in parser.cc
d7d0a2eb6 <Benjamin Kietzman> fix doc error in chunker.h
29c23128a <Wes McKinney> Disable arrow-json-chunker-test
76d0431e2 <Wes McKinney> cmake-format
83150ec13 <Wes McKinney> Restore BinaryBuilder::UnsafeAppend for const std::string&
05fd9d189 <Benjamin Kietzman> add json project back (merge error)
69541ea87 <Benjamin Kietzman> correct test util includes
be720c812 <Benjamin Kietzman> Use compound statements in if()
baac46735 <Benjamin Kietzman> use const shared_ptr<T>& instead of move
6e86078f1 <Benjamin Kietzman> Move ParseOne to reader.cc
1bebda861 <Benjamin Kietzman> clean up Chunker's stream usage
0d6d92026 <Benjamin Kietzman> disabling chunker test
c1a7f4bd3 <Benjamin Kietzman> check status for generating list elements
292672aee <Benjamin Kietzman> add inline tag
f92f8508c <Benjamin Kietzman> add Status return to Generate
e3346485e <Benjamin Kietzman> remove misplaced const
9679b6f76 <Benjamin Kietzman> fix format issue, use SFINAE to detect StringConverter default constructibility
aaaf9e7e8 <Benjamin Kietzman> remove bitfields
10e4f0dc4 <Benjamin Kietzman> add missing virtual destructor
92ffc640d <Benjamin Kietzman> adding ParseOne interface for dead simple parsing
330615b95 <Benjamin Kietzman> adding first draft of parsing with type inference
69d7c5c00 <Benjamin Kietzman> Rewrite parser to defer conversion of strings and numbers
b9d5c3d2d <Benjamin Kietzman> adding Chunker implementation and tests
2677a575d <Benjamin Kietzman> use recommended loop style
e39a4a9e0 <Benjamin Kietzman> add (failing) test for '-0' consistency
0f3a3bc0f <Benjamin Kietzman> Added trivial parser benchmark and data generator
6472738b3 <Benjamin Kietzman> Refactored type inferrence
cb6a313d7 <Benjamin Kietzman> adding first draft of type inferrence to BlockParser
cc3698a44 <Benjamin Kietzman> refactoring Schema::GetFieldIndex to return int
17176f975 <Benjamin Kietzman> first sketch of JSON parser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants