Add Raw JSON Reader (~2.5x faster) #3479

tustvold · 2023-01-06T17:35:45Z

Which issue does this PR close?

Rationale for this change

This adds a new JSON reader that reads directly into arrow arrays, this leads to non-trivial performance improvements vs the current serde_json::Value approach, whilst also I think making the logic for handling nested schema easier to follow.

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-01-07T11:15:01Z

My basic plan is to do something similar to https://github.com/simdjson/simdjson/blob/master/doc/tape.md . There is a Rust implementation of simd-json but it is a bit heavy-weight for our needs, and is a fairly substantial dependency, so I'd like to try doing something simpler 😅

Not sure when exactly I'll get around to doing this, perhaps in a week's time

Dandandan · 2023-01-15T15:06:58Z

@tustvold I saw this coming by https://github.com/PSeitz/serde_json_borrow we might take some inspiration from there or potentially use that crate?

tustvold · 2023-01-15T15:27:57Z

Thanks for the link. I'm actually part way through implementing a mechanism that decodes directly to arrow - I think this should give us the best possible performance. Just need to find some focus time to get it over the line

Dandandan · 2023-01-15T17:33:06Z

Cool, happy to learn about the results / do a review!

tustvold · 2023-01-16T17:19:15Z

It needs some more cleanup, and I've not really spent any time trying to optimize it, but getting a more respectable 2.5x performance improvement with the new approach, albeit for a fair amount of additional code complexity...

large_bench_primitive (basic)
                        time:   [1.5475 ms 1.5488 ms 1.5501 ms]

large_bench_primitive (raw)
                        time:   [660.08 µs 660.37 µs 660.67 µs]

tustvold · 2023-01-16T18:55:36Z

arrow-json/src/raw/mod.rs

+    }
+}
+
+trait ArrayDecoder {


This approach is based on what we do for parquet, and will more naturally generalize to support arbitrarily nested data than the current implementation

tustvold · 2023-01-24T20:26:32Z

arrow-json/src/raw/mod.rs

+    ///     Ok(std::iter::from_fn(move || next().transpose()))
+    /// }
+    /// ```
+    pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError> {


I'm pretty chuffed with this interface, it should allow for streaming decode from object storage without having to first delimit rows, which I think is pretty cool.

I think we should explicity mention this usecase in the doc comments somewhere to help others discover/understand why they might want to use this interface -- probably in the main struct doc comments. I know you say "facilitating integration with arbitrary byte streams," but I am thinking something very direct like:

"This interface allows streaming decode. For example, it can decode a stream of bytes directly from object storage without having to first delimit rows"

tustvold · 2023-01-24T20:35:50Z

I am happy that this is now ready for review, whilst it is a fair amount of complexity, the 2.5x performance improvement I think justifies it.

Furthermore, for async workloads this will be even more pronounced, as it avoids having to perform a pre-parse to delimit newlines. Data can be directly streamed from object storage, and fed into RawDecoder as it arrives without needing to scan it for newlines nor perform any additional data copying.

My plan is to get this integrated into DataFusion, fix the inevitable fallout, and then deprecate the old reader.

tustvold · 2023-01-24T21:07:57Z

Writing some integration tests, comparing RawReader found some divergence:

Interpret non-string payloads as strings, e.g. {"string": false}\n{"string": "foo"} can be parsed into a StringArray
Convert scalars into lists, e.g. {"list": 2}\n {"list": [2, 1]} can be parsed into a ListArray

I'm not sure if this is a behaviour we wish to replicate, I at least found it very surprising

Thoughts @nevi-me @alamb ?

Edit: In fact the list promotion logic is currently broken on master - #3601

alamb

All in all, I love this change. 🏆 Thank you @tustvold.

I went through the code and tests carefully. I have some suggestions, but I don't think any are strictly required to merge this. The most important thing I think is some more tests, especially focused on error cases, as I mentioned in line.

Stepping back, I actually think this is a quite important feature for arrow-rs and will serve us well. I imagine we can write up a great post about "how we made JSON decoding 2.5x faster" -- aka "look at this shiny JSON reader we have, you should try it out, and while you are here...." 🚀

Question: Why `raw` for a name?

Maybe this is moot given the next question, but I didn't understand "raw" -- Some other possibly better names "v2", "fast", "direct"

Question: Why keep both json readers?

So I wonder why keep both original json reader https://docs.rs/arrow-json/31.0.0/arrow_json/reader/struct.Reader.html and this one?

Given the compelling performance improvements, it seems like we should simply switch to use the raw decoder and remove the existing one. This would

improve user performance
reduce our maintenance burden
make the crate easier to use (no need to pick which decoder is desired)
Ensure this reader passed all the same tests, etc

If we are thinking about a migration strategy, perhaps it could be like:

Release the raw reader in arrow next
Switch the default json reader to the raw reader in arrow next+1 (but keep the old reader around for another release)
Remove the old reader in arrow next+2

Suggestion for (even more) tests

It would be awesome to get some sort of larger test corpus for this decoder. I wonder if there is some way to reuse the test suite in simdjson or similar 🤔 )

alamb · 2023-01-25T13:52:06Z

arrow-json/src/raw/tape.rs

+/// A tape encoding inspired by [simdjson]
+///
+/// Uses `u32` for offsets to ensure `TapeElement` is 64-bits. A future
+/// iteration may increase this to a custom `u56` type.


I think it would be valuable to inline as much of https://github.com/simdjson/simdjson/blob/master/doc/tape.md as is relevant to this implementation to document the tape format (maybe copy in tape.md from simdjson, update it is as appropriate, keeping a pointer back to the original)

Reasons:

It would reduce people's questions about "what is different" (if this one is only "inspired")

It would allow doc updates to this format along with code updates

alamb · 2023-01-25T13:53:54Z

arrow-json/src/raw/tape.rs

+
+        {"a": "b", "object": {"nested": "hello", "foo": 23}, "b": {}, "c": {"foo": null }}
+
+        {"a": ["", "foo", ["bar", "c"]], "b": {"1": []}, "c": {"2": [1, 2, 3]} }


👍 I double checked this contains lists of objects 👍

alamb · 2023-01-25T14:00:56Z

arrow-json/src/raw/mod.rs

+/// A [`RecordBatchReader`] that reads newline-delimited JSON data with a known schema
+/// directly into the corresponding arrow arrays
+///
+/// This makes it significantly faster than [`Reader`]


It would help here to comment / explain to readers how to pick which reader to use. See my main PR review comments.

alamb · 2023-01-25T14:03:23Z

arrow-json/src/raw/mod.rs

+    /// Create a [`RawDecoder`] with the provided schema and batch size
+    pub fn try_new(schema: SchemaRef, batch_size: usize) -> Result<Self, ArrowError> {
+        let decoder = make_decoder(DataType::Struct(schema.fields.clone()), false)?;
+        // TODO: This should probably include nested fields


is this still a todo? It seems like this is just an optimization to get the initial capacity sizing correct, not a correctness issue (it might help to make that clear)

alamb · 2023-01-25T14:05:05Z

arrow-json/src/raw/mod.rs

+    ///     Ok(std::iter::from_fn(move || next().transpose()))
+    /// }
+    /// ```
+    pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError> {


I think we should explicity mention this usecase in the doc comments somewhere to help others discover/understand why they might want to use this interface -- probably in the main struct doc comments. I know you say "facilitating integration with arbitrary byte streams," but I am thinking something very direct like:

"This interface allows streaming decode. For example, it can decode a stream of bytes directly from object storage without having to first delimit rows"

alamb · 2023-01-25T14:20:49Z

arrow-json/src/raw/mod.rs

+        assert_eq!(c.values(), &[3, 4]);
+    }
+
+    #[test]


Tests I think would be good that I didn't see are for error conditions:

Send in non UTF8 in json

Send in partially / truncated json (both the first object and also subsequent objects)

alamb · 2023-01-25T14:22:39Z

arrow-json/src/raw/mod.rs

+}
+
+trait ArrayDecoder: Send {
+    fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError>;


I think it would help to document what the expected values in pos are (indexes into the tape of starting elements?)

arrow-json/src/raw/struct_array.rs

alamb · 2023-01-25T14:29:38Z

arrow-json/src/raw/tape.rs

+        }
+
+        if self.offsets.len() >= u32::MAX as usize {
+            return Err(ArrowError::JsonError(format!("Encountered more than {} bytes of string data, consider using a smaller batch size", u32::MAX)));


this would be a good condition to cover if possible 🤔

alamb · 2023-01-25T14:32:48Z

arrow-json/src/raw/tape.rs

+            ]
+        )
+    }
+}


I think this should also have some tests for error cases:

utf8 encoded data

Invalid / corrupt utf8 data

Truncated data (like a string that "ends in the middle of a)

tustvold · 2023-01-25T14:42:59Z

Why raw for a name

It's hopefully temporary, but to allow for a grace period where both readers are supported

Why keep both json readers
If we are thinking about a migration strategy, perhaps it could be like:

This is exactly what I intend to do 👍 As the current one exposes serde_json::Value it will be a breaking change, so I want to do it slowly, but I agree that maintain two readers is not a good idea long-term

alamb · 2023-01-25T16:39:43Z

This is exactly what I intend to do 👍 As the current one exposes serde_json::Value it will be a breaking change, so I want to do it slowly, but I agree that maintain two readers is not a good idea long-term

It is probably good to file a ticket with this overall plan to make it clearer -- I can do so if you would like

Dandandan · 2023-01-25T22:08:42Z

arrow-json/src/raw/primitive_array.rs

+                }
+                TapeElement::Number(idx) => {
+                    let s = tape.get_string(idx);
+                    let value = lexical_core::parse::<f64>(s.as_bytes())


Is this faster than std? AFAIK the std parse should be about as fast as lexical_core by now.

It was faster when I benchmarked it, yes

ursabot · 2023-01-26T15:12:47Z

Benchmark runs are scheduled for baseline = 902a17d and contender = 0f1a92a. 0f1a92a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add Raw JSON Reader

14cc7c2

github-actions bot added the arrow Changes to the arrow crate label Jan 6, 2023

Merge remote-tracking branch 'upstream/master' into raw-json-reader

b0b4517

Custom tape decoder

6027695

RAT

6ae1f06

tustvold force-pushed the raw-json-reader branch from 9bcf19f to 6ae1f06 Compare January 16, 2023 17:20

tustvold commented Jan 16, 2023

View reviewed changes

tustvold added 4 commits January 17, 2023 22:14

Cleanup

14dd51e

More columns in benchmark

a28a377

CI fixes

e845162

Merge remote-tracking branch 'upstream/master' into raw-json-reader

89b1179

tustvold changed the title ~~Add Raw JSON Reader~~ Add Raw JSON Reader (~2.5x faster) Jan 18, 2023

tustvold added 4 commits January 18, 2023 22:59

Tweaks

5b86faf

Add List support

3fab49f

Add support for nested nulls

1e410c8

Remove unnecessary dependency

cba0dba

scsmithr mentioned this pull request Jan 23, 2023

Add functions for working with json GlareDB/glaredb#421

Open

tustvold added 2 commits January 24, 2023 20:14

Add RawDecoder

784e726

Clippy

53a58e3

tustvold commented Jan 24, 2023

View reviewed changes

Fix List

a022a96

tustvold marked this pull request as ready for review January 24, 2023 20:33

tustvold marked this pull request as draft January 24, 2023 21:30

tustvold added 2 commits January 25, 2023 12:19

Add Send bounds

fef4655

Fix variance

3c7b919

tustvold mentioned this pull request Jan 25, 2023

Update to arrow 32 and Switch to RawDecoder for JSON apache/datafusion#5056

Merged

alamb mentioned this pull request Jan 25, 2023

Improve Performance of JSON Reader #3441

Closed

alamb approved these changes Jan 25, 2023

View reviewed changes

Dandandan reviewed Jan 25, 2023

View reviewed changes

tustvold added 2 commits January 26, 2023 11:43

Merge remote-tracking branch 'upstream/master' into raw-json-reader

97aa388

Review feedback

bc1849f

tustvold mentioned this pull request Jan 26, 2023

Replace JSON Reader with RawReader #3610

Closed

tustvold added 6 commits January 26, 2023 13:24

Add deprecation notices

1d28086

Build RawDecoder with builder

a14044e

Improve field estimate

cd2dd4c

Format

9d5f224

Handle unicode split over strings

ee082ce

Improve detection of invalid UTF-8 sequences

b58a962

tustvold force-pushed the raw-json-reader branch from 784b3dd to b58a962 Compare January 26, 2023 14:46

tustvold merged commit 0f1a92a into apache:master Jan 26, 2023

marklit mentioned this pull request Jan 26, 2023

Performance improvement for compressing Snappy-compressed Parquet files? domoritz/json2parquet#116

Closed

tustvold mentioned this pull request Jan 29, 2023

Two Stage JSON Tape Decode #3629

Open

This was referenced Feb 8, 2023

RFC: Use Apache Arrow Parquet Crate pola-rs/polars#6735

Closed

Improve speed of JSON nested list reader #157

Closed

tustvold mentioned this pull request Mar 16, 2023

Speed up JSON reader #2341

Closed

tustvold mentioned this pull request Apr 10, 2023

Async JSON reader #4043

Closed

This was referenced Apr 12, 2023

Write blog about improvements in JSON and CSV processing #4062

Closed

Write blog about improvements in JSON and CSV processing #4072

Open

tustvold mentioned this pull request Jan 19, 2024

Raw JSON Writer #5314

Closed

matthewgapp mentioned this pull request Jan 20, 2024

arrow_json: support decimal 128 and 256 types in json writer #5197

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Raw JSON Reader (~2.5x faster) #3479

Add Raw JSON Reader (~2.5x faster) #3479

tustvold commented Jan 6, 2023 •

edited

Loading

tustvold commented Jan 7, 2023

Dandandan commented Jan 15, 2023

tustvold commented Jan 15, 2023

Dandandan commented Jan 15, 2023

tustvold commented Jan 16, 2023

tustvold Jan 16, 2023

tustvold Jan 24, 2023

alamb Jan 25, 2023

tustvold commented Jan 24, 2023 •

edited

Loading

tustvold commented Jan 24, 2023 •

edited

Loading

alamb left a comment

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

alamb Jan 25, 2023

tustvold commented Jan 25, 2023 •

edited

Loading

alamb commented Jan 25, 2023

Dandandan Jan 25, 2023

tustvold Jan 25, 2023

ursabot commented Jan 26, 2023


		{"a": "b", "object": {"nested": "hello", "foo": 23}, "b": {}, "c": {"foo": null }}

		{"a": ["", "foo", ["bar", "c"]], "b": {"1": []}, "c": {"2": [1, 2, 3]} }

Add Raw JSON Reader (~2.5x faster) #3479

Add Raw JSON Reader (~2.5x faster) #3479

Conversation

tustvold commented Jan 6, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Jan 7, 2023

Dandandan commented Jan 15, 2023

tustvold commented Jan 15, 2023

Dandandan commented Jan 15, 2023

tustvold commented Jan 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 24, 2023 • edited Loading

tustvold commented Jan 24, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Question: Why raw for a name?

Question: Why keep both json readers?

Suggestion for (even more) tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 25, 2023 • edited Loading

alamb commented Jan 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Jan 26, 2023

tustvold commented Jan 6, 2023 •

edited

Loading

tustvold commented Jan 24, 2023 •

edited

Loading

tustvold commented Jan 24, 2023 •

edited

Loading

Question: Why `raw` for a name?

tustvold commented Jan 25, 2023 •

edited

Loading