Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: encode FixedSizeBinary in JSON as hex string #5622

Merged
merged 8 commits into from
Apr 12, 2024

Conversation

hiltontj
Copy link
Contributor

Which issue does this PR close?

Closes #5620.

Rationale for this change

See #5620.

What changes are included in this PR?

Adds encoding support to the JSON writer for the FixedSizeBinary data type.

A simple test was added as well. I can add more extensive testing if needed.

Are there any user-facing changes?

Potentially some documentation to state that FixedSizeBinary is encoded as hex strings.

Adds encoding support to the JSON writer for the FixedSizeBinary DataType

A test was added as well
@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 10, 2024
@@ -56,6 +56,7 @@ tokio = { version = "1.27", default-features = false, features = ["io-util"] }
bytes = "1.4"
criterion = { version = "0.5", default-features = false }
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
hex = "0.4.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not required. The reason I added it was so that in the test I added, it was apparent that the bytes in the column resulted in the correct HEX string. I will see if I can write the test in a way that gets by without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test so that it no longer relies on the hex crate, and removed the dep. I think it would be nice to have a test that fully round-trips data, first by decoding from HEX to the FixedSizeBinary, then encoding back. But, I am not sure if we want to support this behaviour on the JSON reader, and this issue is only about the write side.

@hiltontj
Copy link
Contributor Author

I realize now, that the test I wrote is complete bogus (not valid JSON). So I am reworking that.

The fixed size binary values were not being encoded with surrounding
double quotes. This fixes that, and updates the added test to actually
parse the written JSON as JSON, using serde_json, and make assertions
against that.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj ! This is looking close -- I am not sure it handles nulls properly yet, but I think that should be straightforward to add

arrow-json/src/writer.rs Show resolved Hide resolved

impl Encoder for FixedSizeBinaryEncoder {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
let v = self.0.value(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to check is_valid before serializing (I think you'll see the need when testing for null)

Have the FixedSizeBinaryEncoder for the JSON writer handle explicit null
values, based on the Writer's EncoderOptions.
@hiltontj
Copy link
Contributor Author

@alamb - I think I addressed your comments. With 35371d3, explicit nulls were actually being handled correctly in the test, but I believe this was because of the wrapping list/map encoders. So, I extended the explicit null support down to the FixedSizeBinaryEncoder in 9768a43 directly, to be sure.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @hiltontj for adding the new tests

arrow-json/src/writer.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Apr 10, 2024

I think this is good to go -- any concerns @tustvold ?

Changed the FixedSizeBinaryEncoder for the JSON writer to use a borrow
of the FixedSizeBinaryArray being encoded, to follow other Encoder
implementations, and to remove the use of clone.
@hiltontj
Copy link
Contributor Author

@alamb - I changed the FixedSizeBinaryEncoder to use a borrow of the FixedSizeBinaryArray, instead of cloning it.

BooleanEncoder and StringEncoder were changed to use borrows of their
respective Array types, to avoid cloning.
@hiltontj
Copy link
Contributor Author

I also noticed that the StringEncoder and BooleanEncoder were needlessly cloning, so I updated them to use borrows, and removed the .clone()s. Hope you don't mind.

@hiltontj hiltontj requested a review from alamb April 12, 2024 00:51

impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
if self.array.is_valid(idx) {
Copy link
Contributor

@tustvold tustvold Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC nullability is handled by the parent and so this shouldn't be necessary for a non nested type

Copy link
Contributor

@alamb alamb Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed

diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(

         DataType::FixedSizeBinary(_) => {
             let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
-            (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+            (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
         }

         DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {

 struct FixedSizeBinaryEncoder<'a> {
     array: &'a FixedSizeBinaryArray,
-    explicit_nulls: bool,
 }

 impl<'a> FixedSizeBinaryEncoder<'a> {
-    fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+    fn new(array: &'a FixedSizeBinaryArray) -> Self {
         Self {
             array,
-            explicit_nulls: options.explicit_nulls,
         }
     }
 }
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
                 write!(out, "{byte:02x}").unwrap();
             }
             out.push(b'"');
-        } else if self.explicit_nulls {
-            out.extend_from_slice(b"null");
         }
     }
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, as per #5622 (comment), when I first added the test, it was passing without adding the explicit_nulls option to the FixedSizeBinaryEncoder, but I added it anyways. I am totally fine with taking it out.

@tustvold
Copy link
Contributor

were needlessly cloning

A couple of reference counts are irrelevant compared to encoding overheads, but I suppose it can't hurt to avoid them

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj -- I think this PR could be merged as is.

@tustvold has a good suggestion for simplificiation https://github.com/apache/arrow-rs/pull/5622/files#r1562191540 but I also think we could do that as a follow on PR if you prefer

Just let me know what you would prefer


impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
if self.array.is_valid(idx) {
Copy link
Contributor

@alamb alamb Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed

diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(

         DataType::FixedSizeBinary(_) => {
             let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
-            (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+            (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
         }

         DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {

 struct FixedSizeBinaryEncoder<'a> {
     array: &'a FixedSizeBinaryArray,
-    explicit_nulls: bool,
 }

 impl<'a> FixedSizeBinaryEncoder<'a> {
-    fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+    fn new(array: &'a FixedSizeBinaryArray) -> Self {
         Self {
             array,
-            explicit_nulls: options.explicit_nulls,
         }
     }
 }
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
                 write!(out, "{byte:02x}").unwrap();
             }
             out.push(b'"');
-        } else if self.explicit_nulls {
-            out.extend_from_slice(b"null");
         }
     }
 }

@hiltontj
Copy link
Contributor Author

@alamb - I will update as suggested on this PR and re-request review, no problem!

The FixedSizeBinaryEncoder does not need to handle nulls, as that will
be handled by a parent encoder, i.e., list/map.
@hiltontj hiltontj requested a review from alamb April 12, 2024 12:46
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj for sticking with this. It looks very nice at the end 🙏

@alamb alamb merged commit 6450527 into apache:master Apr 12, 2024
22 checks passed
@hiltontj hiltontj deleted the json-fixed-size-binary branch April 12, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serialize FixedSizeBinary as HEX with JSON Writer
3 participants