Faster Serde Integration (~80% faster) #4861

tustvold · 2023-09-26T13:36:16Z

Which issue does this PR close?

Closes #.

Rationale for this change

Encoding numerics directly in the tape drastically improves the performance of the serde integration.

small_i32               time:   [5.3992 µs 5.4006 µs 5.4020 µs]
                        change: [-70.553% -70.532% -70.511%] (p = 0.00 < 0.05)
                        Performance has improved.

large_i32               time:   [5.2606 µs 5.2618 µs 5.2631 µs]
                        change: [-76.768% -76.747% -76.727%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

small_i64               time:   [5.2937 µs 5.2960 µs 5.2986 µs]
                        change: [-73.032% -73.002% -72.974%] (p = 0.00 < 0.05)
                        Performance has improved.

medium_i64              time:   [5.3314 µs 5.3372 µs 5.3417 µs]
                        change: [-77.574% -77.553% -77.533%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild

large_i64               time:   [5.6473 µs 5.6503 µs 5.6532 µs]
                        change: [-81.152% -81.103% -81.056%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 22 outliers among 100 measurements (22.00%)
  22 (22.00%) low severe

small_f32               time:   [3.6082 µs 3.6101 µs 3.6121 µs]
                        change: [-93.595% -93.591% -93.588%] (p = 0.00 < 0.05)
                        Performance has improved.

large_f32               time:   [3.5233 µs 3.5245 µs 3.5256 µs]
                        change: [-94.058% -94.055% -94.053%] (p = 0.00 < 0.05)
                        Performance has improved.

It additionally opens the door to eager parsing in the future, which may yield performance improvements for regular JSON decoding.

I have confirmed this does not regress the performance of the JSON decoder

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-09-26T13:37:04Z

arrow-json/src/reader/serializer.rs

+        match i64::try_from(v) {
+            Ok(v) => self.serialize_i64(v),
+            Err(_) => {
+                let mut buffer = [0_u8; u64::FORMATTED_SIZE];


The additional complexity to support rountripping values u64::MAX > v > i64::MAX seemed not worth it, so we just fallback to serializing to a string

Would you imagine doing it via i128 or something?

Possibly, or as a u64 variant. Given JSON only reliably roundtrips f64, I don't think this is a very common use-case worth optimising for

alamb

Looks good to me -- thank you @tustvold

alamb · 2023-09-26T14:42:12Z

arrow-json/src/reader/serializer.rs

-/// Formatting to a string only to parse it back again is rather wasteful,
-/// it may be possible to tweak the tape representation to avoid this
-///
-/// Need to use macro as const generic expressions are unstable


alamb · 2023-09-26T14:42:45Z

arrow-json/src/reader/serializer.rs

+        match i64::try_from(v) {
+            Ok(v) => self.serialize_i64(v),
+            Err(_) => {
+                let mut buffer = [0_u8; u64::FORMATTED_SIZE];


Would you imagine doing it via i128 or something?

alamb · 2023-09-26T14:42:55Z

arrow-json/src/reader/tape.rs

@@ -54,6 +55,25 @@ pub enum TapeElement {
    ///
    /// Contains the offset into the [`Tape`] string data
    Number(u32),
+
+    /// The high bits of a i64


alamb · 2023-09-26T14:43:55Z

arrow-json/src/reader/tape.rs

            TapeElement::StartList(end_idx) => Ok(end_idx + 1),
            TapeElement::StartObject(end_idx) => Ok(end_idx + 1),
-            _ => Err(self.error(cur_idx, expected)),
+            TapeElement::EndObject(_) | TapeElement::EndList(_) => {


+1 for removing the catch all

* Store decoded numerics in JSON tape * Add arrow-json serde benchmarks * Fix timestamp serialize * Clippy

tustvold added 3 commits September 26, 2023 14:14

Store decoded numerics in JSON tape

47cfd97

Add arrow-json serde benchmarks

1fd22c5

Fix timestamp serialize

535a719

github-actions bot added the arrow Changes to the arrow crate label Sep 26, 2023

tustvold commented Sep 26, 2023

View reviewed changes

Clippy

48f2bfd

alamb approved these changes Sep 26, 2023

View reviewed changes

tustvold merged commit fbd9008 into apache:master Sep 26, 2023
23 checks passed

tustvold mentioned this pull request Nov 5, 2023

Regression when serializing large json numbers #5038

Closed

ryanaston pushed a commit to segmentio/arrow-rs that referenced this pull request Nov 6, 2023

Faster Serde Integration (~80% faster) (apache#4861)

19d52ba

* Store decoded numerics in JSON tape * Add arrow-json serde benchmarks * Fix timestamp serialize * Clippy

jonmmease mentioned this pull request Nov 18, 2023

coerce_primitive not honored when decoding from serde object #5095

Closed

tustvold mentioned this pull request Feb 1, 2024

Add Avro Support #4886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Serde Integration (~80% faster) #4861

Faster Serde Integration (~80% faster) #4861

tustvold commented Sep 26, 2023 •

edited

tustvold Sep 26, 2023

alamb Sep 26, 2023

tustvold Sep 26, 2023

alamb left a comment

alamb Sep 26, 2023

alamb Sep 26, 2023

alamb Sep 26, 2023

alamb Sep 26, 2023

Faster Serde Integration (~80% faster) #4861

Faster Serde Integration (~80% faster) #4861

Conversation

tustvold commented Sep 26, 2023 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Sep 26, 2023

Choose a reason for hiding this comment

alamb Sep 26, 2023

Choose a reason for hiding this comment

tustvold Sep 26, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 26, 2023

Choose a reason for hiding this comment

alamb Sep 26, 2023

Choose a reason for hiding this comment

alamb Sep 26, 2023

Choose a reason for hiding this comment

alamb Sep 26, 2023

Choose a reason for hiding this comment

tustvold commented Sep 26, 2023 •

edited