Safely ignore Parquet fields with unimplemented Thrift types#9974
Safely ignore Parquet fields with unimplemented Thrift types#9974etseidl wants to merge 3 commits into
Conversation
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing forward_proof_skip (63d82b7) to 1ffd202 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| type Error = ThriftProtocolError; | ||
| fn try_from(value: ElementType) -> std::result::Result<Self, Self::Error> { | ||
| impl From<ElementType> for FieldType { | ||
| fn from(value: ElementType) -> Self { |
There was a problem hiding this comment.
This conversion can now be infallible
| FieldType::Double => self.skip_bytes(8).map(|_| ()), | ||
| FieldType::Binary => self.skip_binary().map(|_| ()), | ||
| FieldType::Struct => { | ||
| let mut last_field_id = 0i16; |
There was a problem hiding this comment.
small drive-by improvement
There was a problem hiding this comment.
I found this confusing at first, it works because skipping does not use the field id.
| Ok(()) | ||
| } | ||
| // no list or map types in parquet format | ||
| FieldType::Map => { |
There was a problem hiding this comment.
A map begins with a varint size. If size is 0 it's empty. Otherwise, the next byte encodes the key and value types with 4 bits each, and then encodes size key/value pairs.
There was a problem hiding this comment.
maybe it would help to add a link to the relevant part of the thrift encoding spec: https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#map
| Ok(()) | ||
| } | ||
| // no list or map types in parquet format | ||
| FieldType::Map => { |
There was a problem hiding this comment.
maybe it would help to add a link to the relevant part of the thrift encoding spec: https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#map
| Ok(()) | ||
| } | ||
| FieldType::List => { | ||
| // lists and sets are encoded the same |
There was a problem hiding this comment.
A link to the spec might help: https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#list-and-set
|
Looks good to me! |
Which issue does this PR close?
Rationale for this change
The thrift decoder should be able to skip fields with unimplemented thrift types
set,map, anduuid.What changes are included in this PR?
Flesh out the thrift enums and add code to the skip function to handle the above types.
Are these changes tested?
Yes, test is added
Are there any user-facing changes?
No, changes are to internal APIs