Conversation
Instead of a subclass of ArrayBuilder, type inferrence now uses the new interface AdaptiveArrayBuilder. This class is designed to accomodate changing element type while building.
|
If it's WIP I would recommend either adding "WIP" in the PR title and/or the "WIP" label. |
|
@pitrou can you help with reviewing this (since similar to the CSV scanner)? I intend to leave comments today |
|
Yes, though it seems to be unfinished? |
|
Yes, indeed, so these would be high level comments like "this seems to be the correct approach to solving the problem" |
|
|
||
| Status BlockParser::Parse(const char* data, uint32_t size, uint32_t* out_size) { | ||
| constexpr unsigned parse_flags = | ||
| rapidjson::kParseInsituFlag | rapidjson::kParseIterativeFlag | |
There was a problem hiding this comment.
AFAIK, you can't use in-situ parsing with a const char*, as it will modify the input.
There was a problem hiding this comment.
Oops. There's a lot of copypasta from your CSV reader. I'll correct this, thanks
| Status Parse(const char* data, uint32_t size, uint32_t* out_size); | ||
|
|
||
| /// \brief Extract parsed data as a RecordBatch | ||
| Status Finish(std::shared_ptr<RecordBatch>* parsed) { |
There was a problem hiding this comment.
Type inference will have to work at the level of all blocks (which there are several of, otherwise you can't parse in parallel). So you can't return a RecordBatch here, since you can't infer the column types from a single block.
Of course, this assumes we want to do type inference at all, since JSON is already (partially) typed. @wesm what is the plan here?
There was a problem hiding this comment.
JSON files can have variation from one line to another (e.g. fields missing from one object, but not another), so we do need to have both type inference as well as user-specified field types
There was a problem hiding this comment.
@wesm I thought the intention was to either provide a schema (in which case fields outside the schema are ignored) or to infer types:
Reading should support two modes
- Type inference: use observed types to determine schema for table
- Passed schema: use a passed schema to normalize types. This will insert nulls when a field is not found, or ignore a key not found in the schema
Are we amending that to allow mixing those?
There was a problem hiding this comment.
sorry, nevermind. I misread your comment
There was a problem hiding this comment.
Passing a schema is just one way to indicate field types -- also if you do pass a schema, then fields not contained in the schema should be discarded. I think we also want to allow for passing types for one or more fields (allowing others to be inferred) in the same style as the CSV reader.
To give an example of how this might be useful: if we have dates in a file represented in some particular way (e.g. something we can parse with strptime) then we could indicate the conversion rule for just that field
This doesn't all have to get done in a single PR, but it's worth keeping in mind for structuring the abstractions so that this functionality can be developed incrementally.
There was a problem hiding this comment.
Could you add this to the design Google doc?
There was a problem hiding this comment.
|
My overall opinion is that the inference design looks a bit convoluted, and potentially incompatible with the existence of multiple blocks (unless the plan is to cast to the final types at the end, which may be wasteful?). |
|
I will have a look later today. There may be cases like: Block 0: field A has type struct <B: null> In such case, we'd want to be able to promote all the blocks to the common type The other mode of reading would be with provided schema |
|
In the case of type inference instead of just finishing to a RecordBatch, the root builder from each block will be kept until all blocks have been parsed. @pitrou sorry it's convoluted, though it's a bit more complicated than CSV since I need to support nested types. I think my approach to multiple blocks is pretty much "cast to final types at the end", coudl you describe what kind of waste you anticipate? |
I see.
Mostly that depends if casting is cheap (as @wesm thinks) or expensive. For example if you have to cast int64 to double it will be expensive. |
| Int64OrDoubleBuilder(MemoryPool* pool) | ||
| : AdaptiveArrayBuilder(initial_type()), bytes_builder_(pool) {} | ||
|
|
||
| Status Append(string_view repr) { |
There was a problem hiding this comment.
Tricky case: is the string "-0" treated consistently?
There was a problem hiding this comment.
Hmmm. If that were encountered before a cast to double, then we'd get double(int64_t(-0)) which would be double(+0). if it's encountered afterwards then we'd get double(-0). Guess not
Also minor refactorings in parser, renamed to new un-prefixed convention
|
I plan to leave some high level comments on this today now that I've got the CI / build system craziness sorted out for now |
|
I wasn't able to get very far in my review today, got distracted by some other things. Plan to look tomorrow morning |
|
Working on review, will put up comments by early afternoon (central time) |
There was a problem hiding this comment.
I left a handful of particular comments, I haven't dug too deeply into the inferring handler, but will keep reading and leaving comments.
As a high level critique: I would guess that having value conversion and error checking at the cell-level vs. vector-level is both slower and also will make the code significantly more complicated and difficult to extend. If I were doing a from-scratch implementation, I would have the SAX handlers do as little work as possible -- accumulate data in the correct kind of primitive builder and check for invalid type mixing, i.e.
- NULL can transition to any
- INT64 can transition to DOUBLE
Any other state transition (away from DOUBLE, LIST, STRUCT, BOOL, or VARBINARY/STRING) is invalid during the JSON parse step. This means you have a state machine with 7 states (I think that's right).
After you have accumulated the raw data, then do further conversions. This also allows for easier extensibility.
This is more similar to the way that the CSV parser is structured right now, which also means that some code can possibly be shared.
@pitrou thoughts about this general code structuring question?
| // Parsing options | ||
|
|
||
| // Optional explicit schema (no type inference, ignores other fields) | ||
| std::shared_ptr<Schema> explicit_schema; |
There was a problem hiding this comment.
We probably want something similar to https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L68. As one complexity, that data structure won't support nested fields, so we might need to come up with a path specifier (e.g. you could specify that foo.bar.baz should be a list of timestamps). Does not need to be done in this patch but worth keeping in mind
We'll eventually need an option whether to allow new fields (whose types would be inferred) when a schema is given. When you supply a schema, this option is turned off (new fields ignored). In theory some users might want there be an error when an unanticipated field is encountered
There was a problem hiding this comment.
{"foo.bar.baz": 1} is valid JSON, so it couldn't just be a map<string, type>. With the addition of converters it sounds like we need something just as nested as Schema, but more configurable.
Thinking about the Python API:
opts = ConvertOptions()
opts.fields = {
# 'a' is int32
'a': pa.int32(),
# 'b' is string, non-nullable
'b': ConvertOptions.required(pa.string()),
'c': {
# specify a converter for 'c.d'
'd': ConvertOptions.binary_from_base64
}
}
opts.unspecified_fields = 'ignore' # or 'infer', 'error'
pa.read_json(buf, opts)There was a problem hiding this comment.
@wesm would that cover all permutations of conversion we want to support?
There was a problem hiding this comment.
Are we suggesting this over schema? I think it make sense to accept an optional explicit schema and optional convert/serialization transformation (string -> u64, b64 -> binary...).
There was a problem hiding this comment.
I think we are saying both. Supplying a schema is one way to initialize the conversion options
There was a problem hiding this comment.
Since this is more general than a schema, I think it'd be pretty straightforward to convert a schema into a ConversionStrategy. Passing a schema for opts.fields would be equivalent to
opts.fields = {
nm: ConvertOptions.required(typ) if not schm.field_by_name(nm).nullable else typ
for nm, typ in zip(schm.names, schm.types)
}|
|
||
| std::shared_ptr<Array> GetColumn(const RecordBatch& batch, const std::string& name) { | ||
| return batch.column(batch.schema()->GetFieldIndex(name)); | ||
| } |
There was a problem hiding this comment.
Let's add RecordBatch::GetColumnByName -- since we have similar methods in Schema, StructType
There was a problem hiding this comment.
And Table::GetColumnByName()?
| os << std::quoted(fields[i]->name()) << ":"; | ||
| return GenerateValue(fields[i]->type(), e, os); | ||
| })); | ||
| os << "}"; |
There was a problem hiding this comment.
We could use RapidJSON to generate the example JSON
| std::abort(); | ||
| } | ||
| std::shared_ptr<RecordBatch> parsed; | ||
| ABORT_NOT_OK(parser.Finish(&parsed)); |
There was a problem hiding this comment.
It seems like everything below this line is going to skew the results
| BenchmarkJSONParsing(state, json.str(), num_rows, options); | ||
| } | ||
|
|
||
| BENCHMARK(BM_ParseJSONQuotedBlock)->Repetitions(3)->Unit(benchmark::kMicrosecond); |
There was a problem hiding this comment.
Maybe MinTime(1.0) instead of Repetitions (I've found it can produce less noisy results)?
|
|
||
| bool String(const char* data, rapidjson::SizeType size, bool) { | ||
| if (Skipping()) return true; | ||
| status_ = VisitBuilder(builder_, AppendStringVisitor{data, size}); |
There was a problem hiding this comment.
Having to do full dynamic dispatch here seems perhaps excessive. It might be better to limit to two cases here: variable length and fixed length strings, and deal with timestamp conversions in "post-processing". Having the ability to post-process strings into some other type is important anyway, since we may need to provide other kinds of conversion rules (e.g using strptime to convert strings in some format to dates)
|
|
||
| bool Key(const char* key, rapidjson::SizeType len, bool) { | ||
| MaybeStopSkipping(); // new key at the depth where we started skipping -> | ||
| // terminate skipping |
There was a problem hiding this comment.
Putting the comment in the line(s) before the line of code would be more consistent with the commenting style in the codebase
| // efficient to use CountLeadingZeros() to find the indices of the few | ||
| // null fields | ||
| if (null) { | ||
| status_ = VisitBuilder(parent->field_builder(field_index), AppendNullVisitor{}); |
There was a problem hiding this comment.
Should we add ArrayBuilder::VirtualAppendNull (or similar?)
There was a problem hiding this comment.
I'd say for performance let's avoid virtual, but it would definitely be more user friendly to have a "works-for-all-builders" set of append functions. Those could use VisitBuilderInline under the hood, and maybe they should be non-members to avoid obfuscating general requirement that builders be downcast before use:
/// \brief Downcast builder as necessary and append null
inline Status AppendNull(ArrayBuilder* builder) {
struct {
// Visit(...)
} visitor;
return VisitBuilderInline(builder, &visitor);
}(see also ARROW-4067 RFC: standardize ArrayBuilder subclasses)
| if (!converter(data_, size_, value)) { | ||
| return ConversionError(); | ||
| } | ||
| return Status::OK(); |
There was a problem hiding this comment.
It could be better to do conversions in post-processing. @pitrou thoughts?
There was a problem hiding this comment.
You mean first building a string array and then convert it to another type? I think that's going to be slow / slower (interestingly, appending one string at a time to a binary array isn't very fast).
There was a problem hiding this comment.
Well, a problem we have right now is that the only non-string conversion supported is timestamps. So in any case we will eventually need to support a conversion-in-post-processing mode
| template <typename T, typename... A> | ||
| std::unique_ptr<T> make_unique(A&&... args) { | ||
| return std::unique_ptr<T>(new T(std::forward<A>(args)...)); | ||
| } |
There was a problem hiding this comment.
There's other places we could use this, we should place in arrow/util somewhere
|
I don't think it's worthwhile to try and share code with the CSV parser. Also it will make evolving one of the two designs more difficult. I was a bit surprised by the adaptive builder approach at first, but it might be interesting to go with it and see the results (in terms of maintenance and performance). |
Yes -- I would describe my comments as mostly a "hot take" and perhaps superficial since I have not worked directly on this project myself. Ultimately performance requirements are going to drive the design of the computationally intensive part of this, and in the meantime it's probably best to focus on defining the public API and behavior so that we can refactor or try different approaches without breaking anything. |
| bool use_threads = true; | ||
| // Block size we request from the IO layer; also determines the size of | ||
| // chunks when use_threads is true | ||
| int32_t block_size = 1 << 20; // 1 MB |
There was a problem hiding this comment.
I think it would make more sense to partition on number of row as opposed to bytes. A lot more friendly to make concurrent parsing, same for csv.
There was a problem hiding this comment.
Would it though? Depending on the size of lines, a block could be 500KB in some cases or 50MB in others.
There was a problem hiding this comment.
It would be nice, but this is use at read time (this is the size of blocks loaded from a file) before any parsing has been done, so we don't know where the row boundaries are. Finding those boundaries is what Chunker is for (which I haven't implemented yet but CSV has)
There was a problem hiding this comment.
@wesm while this is possible, in my time working the ad tech, this was the exception more than the norm. Byte addressable chunking suffer from this problem, it only takes a huge line on one of the boundary.
Also I think it makes sense to dictate a size of chunks, especially if you carefully chose a size that rounds to a number of pages, say 8,16,32k. And in that case, partitioning on the number of bytes require a re-shuffling to have chunks of equal size.
There was a problem hiding this comment.
I think the concern is a bit hypothetical / low priority for the moment. Whether chunking is based on size or number of rows is not a detail that should impact the design of the rest of the code, so we don't need to burden ourselves with it right now.
In CSV files I don't think chunking based on number of rows is always the best option. I've seen many files with 1000 columns or more in the wild. If your goal is eventually to transport the data on the wire (e.g. with gRPC), then you will end up wanting to split datasets into chunks that are around a certain size (8MB or so) anyway
There was a problem hiding this comment.
One concern with number of rows chunking is that chunking becomes more expensive, and may become a bottleneck because it's single-threaded. It's probably possible to optimize it (e.g. using masks and perhaps SIMD ops), but it's not as easy as having a fast number of bytes chunker.
Note that chunking really has two different goals here:
- the primary goal is to allow efficient multi-threaded ingestion of CSV or JSON data
- the secondary goal (which is more of a nice-to-have) is to have a chunked output with chunks of a reasonable size
|
@bkietz what do you think makes most sense on this project re: development workflow? It is OK for the API to be experimental for a little while, and let things develop over a serious of incremental patches to refactor and improve things. Once an initial patch is merged, others can add Python, GLib, and Ruby bindings |
|
superseded by #3592 |
…dBatches ( abandoning #3206 ) Adds [`json` sub project](https://github.com/apache/arrow/pull/3592/files#diff-2443c7d7b39b992ea580f0fbd387284a) with: - BlockParser which parses Buffers of json formatted data into a StructArray with minimal conversion * true/false, and null fields are stored in BooleanArray and NullArray respectively * strings are stored as indices into a single StringArray * numbers are not converted; their string representations are stored alongside string values * nested fields are stored as ListArray or StructArray of their parsed (unconverted) children - Three approaches to handling unexpected fields: 1. Error on an unexpected field 2. Ignore unexpected fields 3. Infer the type of unexpected fields and add them to the schema - [Convenience interface](https://github.com/apache/arrow/pull/3592/files#diff-d043a0249cc485b08d93767d2075bd83R124) for parsing a single chunk of json data into a RecordBatch with fully converted columns - Chunker to process a stream of unchunked data for use by BlockParser (not currently used) Author: Benjamin Kietzman <bengilgit@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #3592 from bkietz/ARROW-694-json-reader-WIP and squashes the following commits: e42e5d7 <Wes McKinney> Add arrow_dependencies to arrow_flight.so dependencies to fix race condition with Flatbuffers d67aff6 <Benjamin Kietzman> adding more comments to parser.cc 554b595 <Benjamin Kietzman> adding explanatory comments to json/parser.cc 0d0caa9 <Benjamin Kietzman> Add ARROW_PREDICT_* to conditions in parser.cc d7d0a2e <Benjamin Kietzman> fix doc error in chunker.h 29c2312 <Wes McKinney> Disable arrow-json-chunker-test 76d0431 <Wes McKinney> cmake-format 83150ec <Wes McKinney> Restore BinaryBuilder::UnsafeAppend for const std::string& 05fd9d1 <Benjamin Kietzman> add json project back (merge error) 69541ea <Benjamin Kietzman> correct test util includes be720c8 <Benjamin Kietzman> Use compound statements in if() baac467 <Benjamin Kietzman> use const shared_ptr<T>& instead of move 6e86078 <Benjamin Kietzman> Move ParseOne to reader.cc 1bebda8 <Benjamin Kietzman> clean up Chunker's stream usage 0d6d920 <Benjamin Kietzman> disabling chunker test c1a7f4b <Benjamin Kietzman> check status for generating list elements 292672a <Benjamin Kietzman> add inline tag f92f850 <Benjamin Kietzman> add Status return to Generate e334648 <Benjamin Kietzman> remove misplaced const 9679b6f <Benjamin Kietzman> fix format issue, use SFINAE to detect StringConverter default constructibility aaaf9e7 <Benjamin Kietzman> remove bitfields 10e4f0d <Benjamin Kietzman> add missing virtual destructor 92ffc64 <Benjamin Kietzman> adding ParseOne interface for dead simple parsing 330615b <Benjamin Kietzman> adding first draft of parsing with type inference 69d7c5c <Benjamin Kietzman> Rewrite parser to defer conversion of strings and numbers b9d5c3d <Benjamin Kietzman> adding Chunker implementation and tests 2677a57 <Benjamin Kietzman> use recommended loop style e39a4a9 <Benjamin Kietzman> add (failing) test for '-0' consistency 0f3a3bc <Benjamin Kietzman> Added trivial parser benchmark and data generator 6472738 <Benjamin Kietzman> Refactored type inferrence cb6a313 <Benjamin Kietzman> adding first draft of type inferrence to BlockParser cc3698a <Benjamin Kietzman> refactoring Schema::GetFieldIndex to return int 17176f9 <Benjamin Kietzman> first sketch of JSON parser
…dBatches ( abandoning apache/arrow#3206 ) Adds [`json` sub project](https://github.com/apache/arrow/pull/3592/files#diff-2443c7d7b39b992ea580f0fbd387284a) with: - BlockParser which parses Buffers of json formatted data into a StructArray with minimal conversion * true/false, and null fields are stored in BooleanArray and NullArray respectively * strings are stored as indices into a single StringArray * numbers are not converted; their string representations are stored alongside string values * nested fields are stored as ListArray or StructArray of their parsed (unconverted) children - Three approaches to handling unexpected fields: 1. Error on an unexpected field 2. Ignore unexpected fields 3. Infer the type of unexpected fields and add them to the schema - [Convenience interface](https://github.com/apache/arrow/pull/3592/files#diff-d043a0249cc485b08d93767d2075bd83R124) for parsing a single chunk of json data into a RecordBatch with fully converted columns - Chunker to process a stream of unchunked data for use by BlockParser (not currently used) Author: Benjamin Kietzman <bengilgit@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #3592 from bkietz/ARROW-694-json-reader-WIP and squashes the following commits: e42e5d730 <Wes McKinney> Add arrow_dependencies to arrow_flight.so dependencies to fix race condition with Flatbuffers d67aff63e <Benjamin Kietzman> adding more comments to parser.cc 554b595d9 <Benjamin Kietzman> adding explanatory comments to json/parser.cc 0d0caa991 <Benjamin Kietzman> Add ARROW_PREDICT_* to conditions in parser.cc d7d0a2eb6 <Benjamin Kietzman> fix doc error in chunker.h 29c23128a <Wes McKinney> Disable arrow-json-chunker-test 76d0431e2 <Wes McKinney> cmake-format 83150ec13 <Wes McKinney> Restore BinaryBuilder::UnsafeAppend for const std::string& 05fd9d189 <Benjamin Kietzman> add json project back (merge error) 69541ea87 <Benjamin Kietzman> correct test util includes be720c812 <Benjamin Kietzman> Use compound statements in if() baac46735 <Benjamin Kietzman> use const shared_ptr<T>& instead of move 6e86078f1 <Benjamin Kietzman> Move ParseOne to reader.cc 1bebda861 <Benjamin Kietzman> clean up Chunker's stream usage 0d6d92026 <Benjamin Kietzman> disabling chunker test c1a7f4bd3 <Benjamin Kietzman> check status for generating list elements 292672aee <Benjamin Kietzman> add inline tag f92f8508c <Benjamin Kietzman> add Status return to Generate e3346485e <Benjamin Kietzman> remove misplaced const 9679b6f76 <Benjamin Kietzman> fix format issue, use SFINAE to detect StringConverter default constructibility aaaf9e7e8 <Benjamin Kietzman> remove bitfields 10e4f0dc4 <Benjamin Kietzman> add missing virtual destructor 92ffc640d <Benjamin Kietzman> adding ParseOne interface for dead simple parsing 330615b95 <Benjamin Kietzman> adding first draft of parsing with type inference 69d7c5c00 <Benjamin Kietzman> Rewrite parser to defer conversion of strings and numbers b9d5c3d2d <Benjamin Kietzman> adding Chunker implementation and tests 2677a575d <Benjamin Kietzman> use recommended loop style e39a4a9e0 <Benjamin Kietzman> add (failing) test for '-0' consistency 0f3a3bc0f <Benjamin Kietzman> Added trivial parser benchmark and data generator 6472738b3 <Benjamin Kietzman> Refactored type inferrence cb6a313d7 <Benjamin Kietzman> adding first draft of type inferrence to BlockParser cc3698a44 <Benjamin Kietzman> refactoring Schema::GetFieldIndex to return int 17176f975 <Benjamin Kietzman> first sketch of JSON parser
Work in progress