-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: encode FixedSizeBinary in JSON as hex string #5622
Conversation
Adds encoding support to the JSON writer for the FixedSizeBinary DataType A test was added as well
arrow-json/Cargo.toml
Outdated
@@ -56,6 +56,7 @@ tokio = { version = "1.27", default-features = false, features = ["io-util"] } | |||
bytes = "1.4" | |||
criterion = { version = "0.5", default-features = false } | |||
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] } | |||
hex = "0.4.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is not required. The reason I added it was so that in the test I added, it was apparent that the bytes in the column resulted in the correct HEX string. I will see if I can write the test in a way that gets by without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test so that it no longer relies on the hex
crate, and removed the dep. I think it would be nice to have a test that fully round-trips data, first by decoding from HEX to the FixedSizeBinary
, then encoding back. But, I am not sure if we want to support this behaviour on the JSON reader, and this issue is only about the write side.
I realize now, that the test I wrote is complete bogus (not valid JSON). So I am reworking that. |
The fixed size binary values were not being encoded with surrounding double quotes. This fixes that, and updates the added test to actually parse the written JSON as JSON, using serde_json, and make assertions against that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hiltontj ! This is looking close -- I am not sure it handles nulls properly yet, but I think that should be straightforward to add
arrow-json/src/writer/encoder.rs
Outdated
|
||
impl Encoder for FixedSizeBinaryEncoder { | ||
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
let v = self.0.value(idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to check is_valid
before serializing (I think you'll see the need when testing for null)
Have the FixedSizeBinaryEncoder for the JSON writer handle explicit null values, based on the Writer's EncoderOptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @hiltontj for adding the new tests
I think this is good to go -- any concerns @tustvold ? |
Changed the FixedSizeBinaryEncoder for the JSON writer to use a borrow of the FixedSizeBinaryArray being encoded, to follow other Encoder implementations, and to remove the use of clone.
@alamb - I changed the |
BooleanEncoder and StringEncoder were changed to use borrows of their respective Array types, to avoid cloning.
I also noticed that the |
arrow-json/src/writer/encoder.rs
Outdated
|
||
impl<'a> Encoder for FixedSizeBinaryEncoder<'a> { | ||
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
if self.array.is_valid(idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC nullability is handled by the parent and so this shouldn't be necessary for a non nested type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed
diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(
DataType::FixedSizeBinary(_) => {
let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
- (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+ (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
}
DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {
struct FixedSizeBinaryEncoder<'a> {
array: &'a FixedSizeBinaryArray,
- explicit_nulls: bool,
}
impl<'a> FixedSizeBinaryEncoder<'a> {
- fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+ fn new(array: &'a FixedSizeBinaryArray) -> Self {
Self {
array,
- explicit_nulls: options.explicit_nulls,
}
}
}
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
write!(out, "{byte:02x}").unwrap();
}
out.push(b'"');
- } else if self.explicit_nulls {
- out.extend_from_slice(b"null");
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, as per #5622 (comment), when I first added the test, it was passing without adding the explicit_nulls
option to the FixedSizeBinaryEncoder
, but I added it anyways. I am totally fine with taking it out.
A couple of reference counts are irrelevant compared to encoding overheads, but I suppose it can't hurt to avoid them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hiltontj -- I think this PR could be merged as is.
@tustvold has a good suggestion for simplificiation https://github.com/apache/arrow-rs/pull/5622/files#r1562191540 but I also think we could do that as a follow on PR if you prefer
Just let me know what you would prefer
arrow-json/src/writer/encoder.rs
Outdated
|
||
impl<'a> Encoder for FixedSizeBinaryEncoder<'a> { | ||
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
if self.array.is_valid(idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked by making the following change locally and the tests all still pass. Thus I agree the valid check is not needed
diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index 27fa5747288..470d2fadd06 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -101,7 +101,7 @@ fn make_encoder_impl<'a>(
DataType::FixedSizeBinary(_) => {
let array = array.as_any().downcast_ref::<FixedSizeBinaryArray>().unwrap();
- (Box::new(FixedSizeBinaryEncoder::new(array, options)) as _, array.nulls().cloned())
+ (Box::new(FixedSizeBinaryEncoder::new(array)) as _, array.nulls().cloned())
}
DataType::Struct(fields) => {
@@ -451,14 +451,12 @@ impl<'a> Encoder for MapEncoder<'a> {
struct FixedSizeBinaryEncoder<'a> {
array: &'a FixedSizeBinaryArray,
- explicit_nulls: bool,
}
impl<'a> FixedSizeBinaryEncoder<'a> {
- fn new(array: &'a FixedSizeBinaryArray, options: &EncoderOptions) -> Self {
+ fn new(array: &'a FixedSizeBinaryArray) -> Self {
Self {
array,
- explicit_nulls: options.explicit_nulls,
}
}
}
@@ -473,8 +471,6 @@ impl<'a> Encoder for FixedSizeBinaryEncoder<'a> {
write!(out, "{byte:02x}").unwrap();
}
out.push(b'"');
- } else if self.explicit_nulls {
- out.extend_from_slice(b"null");
}
}
}
@alamb - I will update as suggested on this PR and re-request review, no problem! |
The FixedSizeBinaryEncoder does not need to handle nulls, as that will be handled by a parent encoder, i.e., list/map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hiltontj for sticking with this. It looks very nice at the end 🙏
Which issue does this PR close?
Closes #5620.
Rationale for this change
See #5620.
What changes are included in this PR?
Adds encoding support to the JSON writer for the
FixedSizeBinary
data type.A simple test was added as well. I can add more extensive testing if needed.
Are there any user-facing changes?
Potentially some documentation to state that
FixedSizeBinary
is encoded as hex strings.