Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Raw JSON Reader (~2.5x faster) #3479

Merged
merged 27 commits into from
Jan 26, 2023
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 6, 2023

Which issue does this PR close?

Closes #3441

Rationale for this change

This adds a new JSON reader that reads directly into arrow arrays, this leads to non-trivial performance improvements vs the current serde_json::Value approach, whilst also I think making the logic for handling nested schema easier to follow.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 6, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Jan 7, 2023

My basic plan is to do something similar to https://github.com/simdjson/simdjson/blob/master/doc/tape.md . There is a Rust implementation of simd-json but it is a bit heavy-weight for our needs, and is a fairly substantial dependency, so I'd like to try doing something simpler 😅

Not sure when exactly I'll get around to doing this, perhaps in a week's time

@Dandandan
Copy link
Contributor

@tustvold I saw this coming by https://github.com/PSeitz/serde_json_borrow we might take some inspiration from there or potentially use that crate?

@tustvold
Copy link
Contributor Author

Thanks for the link. I'm actually part way through implementing a mechanism that decodes directly to arrow - I think this should give us the best possible performance. Just need to find some focus time to get it over the line

@Dandandan
Copy link
Contributor

Cool, happy to learn about the results / do a review!

@tustvold
Copy link
Contributor Author

It needs some more cleanup, and I've not really spent any time trying to optimize it, but getting a more respectable 2.5x performance improvement with the new approach, albeit for a fair amount of additional code complexity...

large_bench_primitive (basic)
                        time:   [1.5475 ms 1.5488 ms 1.5501 ms]

large_bench_primitive (raw)
                        time:   [660.08 µs 660.37 µs 660.67 µs]

}
}

trait ArrayDecoder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is based on what we do for parquet, and will more naturally generalize to support arbitrarily nested data than the current implementation

@tustvold tustvold changed the title Add Raw JSON Reader Add Raw JSON Reader (~2.5x faster) Jan 18, 2023
/// Ok(std::iter::from_fn(move || next().transpose()))
/// }
/// ```
pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty chuffed with this interface, it should allow for streaming decode from object storage without having to first delimit rows, which I think is pretty cool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explicity mention this usecase in the doc comments somewhere to help others discover/understand why they might want to use this interface -- probably in the main struct doc comments. I know you say "facilitating integration with arbitrary byte streams," but I am thinking something very direct like:

"This interface allows streaming decode. For example, it can decode a stream of bytes directly from object storage without having to first delimit rows"

@tustvold tustvold marked this pull request as ready for review January 24, 2023 20:33
@tustvold
Copy link
Contributor Author

tustvold commented Jan 24, 2023

I am happy that this is now ready for review, whilst it is a fair amount of complexity, the 2.5x performance improvement I think justifies it.

Furthermore, for async workloads this will be even more pronounced, as it avoids having to perform a pre-parse to delimit newlines. Data can be directly streamed from object storage, and fed into RawDecoder as it arrives without needing to scan it for newlines nor perform any additional data copying.

My plan is to get this integrated into DataFusion, fix the inevitable fallout, and then deprecate the old reader.

@tustvold
Copy link
Contributor Author

tustvold commented Jan 24, 2023

Writing some integration tests, comparing RawReader found some divergence:

  • Interpret non-string payloads as strings, e.g. {"string": false}\n{"string": "foo"} can be parsed into a StringArray
  • Convert scalars into lists, e.g. {"list": 2}\n {"list": [2, 1]} can be parsed into a ListArray

I'm not sure if this is a behaviour we wish to replicate, I at least found it very surprising

Thoughts @nevi-me @alamb ?

Edit: In fact the list promotion logic is currently broken on master - #3601

@tustvold tustvold marked this pull request as draft January 24, 2023 21:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, I love this change. 🏆 Thank you @tustvold.

I went through the code and tests carefully. I have some suggestions, but I don't think any are strictly required to merge this. The most important thing I think is some more tests, especially focused on error cases, as I mentioned in line.

Stepping back, I actually think this is a quite important feature for arrow-rs and will serve us well. I imagine we can write up a great post about "how we made JSON decoding 2.5x faster" -- aka "look at this shiny JSON reader we have, you should try it out, and while you are here...." 🚀

Question: Why raw for a name?

Maybe this is moot given the next question, but I didn't understand "raw" -- Some other possibly better names "v2", "fast", "direct"

Question: Why keep both json readers?

So I wonder why keep both original json reader https://docs.rs/arrow-json/31.0.0/arrow_json/reader/struct.Reader.html and this one?

Given the compelling performance improvements, it seems like we should simply switch to use the raw decoder and remove the existing one. This would

  1. improve user performance
  2. reduce our maintenance burden
  3. make the crate easier to use (no need to pick which decoder is desired)
  4. Ensure this reader passed all the same tests, etc

If we are thinking about a migration strategy, perhaps it could be like:

  1. Release the raw reader in arrow next
  2. Switch the default json reader to the raw reader in arrow next+1 (but keep the old reader around for another release)
  3. Remove the old reader in arrow next+2

Suggestion for (even more) tests

It would be awesome to get some sort of larger test corpus for this decoder. I wonder if there is some way to reuse the test suite in simdjson or similar 🤔 )

/// A tape encoding inspired by [simdjson]
///
/// Uses `u32` for offsets to ensure `TapeElement` is 64-bits. A future
/// iteration may increase this to a custom `u56` type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be valuable to inline as much of https://github.com/simdjson/simdjson/blob/master/doc/tape.md as is relevant to this implementation to document the tape format (maybe copy in tape.md from simdjson, update it is as appropriate, keeping a pointer back to the original)

Reasons:

  1. It would reduce people's questions about "what is different" (if this one is only "inspired")
  2. It would allow doc updates to this format along with code updates


{"a": "b", "object": {"nested": "hello", "foo": 23}, "b": {}, "c": {"foo": null }}

{"a": ["", "foo", ["bar", "c"]], "b": {"1": []}, "c": {"2": [1, 2, 3]} }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I double checked this contains lists of objects 👍

/// A [`RecordBatchReader`] that reads newline-delimited JSON data with a known schema
/// directly into the corresponding arrow arrays
///
/// This makes it significantly faster than [`Reader`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help here to comment / explain to readers how to pick which reader to use. See my main PR review comments.

/// Create a [`RawDecoder`] with the provided schema and batch size
pub fn try_new(schema: SchemaRef, batch_size: usize) -> Result<Self, ArrowError> {
let decoder = make_decoder(DataType::Struct(schema.fields.clone()), false)?;
// TODO: This should probably include nested fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a todo? It seems like this is just an optimization to get the initial capacity sizing correct, not a correctness issue (it might help to make that clear)

/// Ok(std::iter::from_fn(move || next().transpose()))
/// }
/// ```
pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explicity mention this usecase in the doc comments somewhere to help others discover/understand why they might want to use this interface -- probably in the main struct doc comments. I know you say "facilitating integration with arbitrary byte streams," but I am thinking something very direct like:

"This interface allows streaming decode. For example, it can decode a stream of bytes directly from object storage without having to first delimit rows"

assert_eq!(c.values(), &[3, 4]);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests I think would be good that I didn't see are for error conditions:

  1. Send in non UTF8 in json
  2. Send in partially / truncated json (both the first object and also subsequent objects)

}

trait ArrayDecoder: Send {
fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to document what the expected values in pos are (indexes into the tape of starting elements?)

arrow-json/src/raw/struct_array.rs Show resolved Hide resolved
}

if self.offsets.len() >= u32::MAX as usize {
return Err(ArrowError::JsonError(format!("Encountered more than {} bytes of string data, consider using a smaller batch size", u32::MAX)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a good condition to cover if possible 🤔

]
)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also have some tests for error cases:

  1. utf8 encoded data
  2. Invalid / corrupt utf8 data
  3. Truncated data (like a string that "ends in the middle of a)

@tustvold
Copy link
Contributor Author

tustvold commented Jan 25, 2023

Why raw for a name

It's hopefully temporary, but to allow for a grace period where both readers are supported

Why keep both json readers
If we are thinking about a migration strategy, perhaps it could be like:

This is exactly what I intend to do 👍 As the current one exposes serde_json::Value it will be a breaking change, so I want to do it slowly, but I agree that maintain two readers is not a good idea long-term

@alamb
Copy link
Contributor

alamb commented Jan 25, 2023

This is exactly what I intend to do 👍 As the current one exposes serde_json::Value it will be a breaking change, so I want to do it slowly, but I agree that maintain two readers is not a good idea long-term

It is probably good to file a ticket with this overall plan to make it clearer -- I can do so if you would like

}
TapeElement::Number(idx) => {
let s = tape.get_string(idx);
let value = lexical_core::parse::<f64>(s.as_bytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster than std? AFAIK the std parse should be about as fast as lexical_core by now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was faster when I benchmarked it, yes

@tustvold tustvold merged commit 0f1a92a into apache:master Jan 26, 2023
@ursabot
Copy link

ursabot commented Jan 26, 2023

Benchmark runs are scheduled for baseline = 902a17d and contender = 0f1a92a. 0f1a92a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Performance of JSON Reader
4 participants