New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field::new("int32", DataType::Int32, false), | ||
Field::new("int64", DataType::Int64, false), | ||
Field::new("double", DataType::Float64, true), | ||
Field::new("float", DataType::Float32, true), | ||
Field::new("string", DataType::Utf8, true), | ||
Field::new("string_2", DataType::Utf8, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it cool / needed that UTF8
and STRING
both map to Arrow DataType::Utf8
? It seems like UTF8
is not actually a valid "logical type" in parquet -- it should be STRING
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string-types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where things get interesting. UTF8 maps to the converted type, and STRING to the logical type. So, the pecking order is to check for logical type, then fall back to converted type.
It gets tricky when the logical and converted types are the same string value, but that is fine as the converted type is always written out. Q good example is DECIMAL(12,2), it is the same in either logical or converted type.
I was confused with this, as I initially tried parsing logical and converted types separately, without mixing them in one file. After @sunchao's review on the other PRs, it started to make sense that they can coexist.
So, both map to the same Arrow type
]; | ||
|
||
assert_eq!(arrow_fields, converted_arrow_fields); | ||
} | ||
|
||
#[test] | ||
#[ignore = "To be addressed as part of ARROW-11365"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
rust/parquet/src/schema/parser.rs
Outdated
self.tokenizer.next(), | ||
"Invalid boolean found", | ||
"Failure to parse timezone info for TIME type", | ||
)?; // TODO: this might not cater for the case of no scale correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment -- maybe worth a JIRA to track for later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always require both precision and scale to be present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec allows DECIMAL(precision)
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
The TODO was because I got confused by the token parsing at the time. I've looked at it again, and it looks fine.
tokenizer: &mut iter, | ||
} | ||
.parse_message_type(); | ||
assert!(result.is_ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend also assert the actual output schema too (e.g. that Time(micros) actually parsed to time(micros) and not some other type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also covered this on other tests
tokenizer: &mut iter, | ||
} | ||
.parse_message_type(); | ||
assert!(result.is_ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend testing the actual schema too in addition to asserting that there were no errors in parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked that the error messages are what's expected, but for this is_ok()
case, I think it'd be duplicative as we test the schema in test_parse_message_type_compare_*
further below in the code
rust/parquet/src/schema/parser.rs
Outdated
self.tokenizer.next(), | ||
"Invalid boolean found", | ||
"Failure to parse timezone info for TIME type", | ||
)?; // TODO: this might not cater for the case of no scale correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always require both precision and scale to be present?
} | ||
LogicalType::INTEGER(_) => { | ||
if let Some("(") = self.tokenizer.next() { | ||
let bit_width = parse_i32( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to check that the bit_width
is one of 8, 16, 32 and 64. All others should be invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked them against the physical type. Thanks
match logical { | ||
Ok(logical) => Ok(( | ||
Some(logical.clone()), | ||
ConvertedType::from(Some(logical)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not very related: it's a bit strange that ConvertedType::from
takes an option rather than just the logical type. The latter may be more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has worked well for me so far, because ConvertedType has a NONE
enum, so when using ConvertedType::from(None)
I get ConvertedType::NONE
. Also, not all LogicalTypes map to ConvertedType, so I can also return NONE in their instance.
If I were to take ConvertedType::from(LogicalType)
, I'd have to repeat if let Some(logicla_type) = logicalType
in a few places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match logical { | ||
Ok(logical) => Ok(( | ||
Some(logical.clone()), | ||
ConvertedType::from(Some(logical)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has worked well for me so far, because ConvertedType has a NONE
enum, so when using ConvertedType::from(None)
I get ConvertedType::NONE
. Also, not all LogicalTypes map to ConvertedType, so I can also return NONE in their instance.
If I were to take ConvertedType::from(LogicalType)
, I'd have to repeat if let Some(logicla_type) = logicalType
in a few places.
rust/parquet/src/schema/parser.rs
Outdated
self.tokenizer.next(), | ||
"Invalid boolean found", | ||
"Failure to parse timezone info for TIME type", | ||
)?; // TODO: this might not cater for the case of no scale correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec allows DECIMAL(precision)
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
The TODO was because I got confused by the token parsing at the time. I've looked at it again, and it looks fine.
} | ||
LogicalType::INTEGER(_) => { | ||
if let Some("(") = self.tokenizer.next() { | ||
let bit_width = parse_i32( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked them against the physical type. Thanks
tokenizer: &mut iter, | ||
} | ||
.parse_message_type(); | ||
assert!(result.is_ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked that the error messages are what's expected, but for this is_ok()
case, I think it'd be duplicative as we test the schema in test_parse_message_type_compare_*
further below in the code
tokenizer: &mut iter, | ||
} | ||
.parse_message_type(); | ||
assert!(result.is_ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also covered this on other tests
Looks good to me -- I think it is ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @nevi-me !
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`. Closes apache#9705 from nevi-me/ARROW-11365 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`. Closes apache#9705 from nevi-me/ARROW-11365 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`. Closes apache#9705 from nevi-me/ARROW-11365 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form
REQUIRED INT32 field_name (INTEGER(16,false))
.