Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

Closed
wants to merge 6 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Mar 15, 2021

This implements the parser and printer for logical types, allowing us to read and generate the schema in the form REQUIRED INT32 field_name (INTEGER(16,false)).

@nevi-me nevi-me requested a review from sunchao March 15, 2021 01:31
@github-actions
Copy link

@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 15, 2021

@sunchao I've created this on top of #9612, PTAL when you can.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in this code, but I went through it carefully and it looks good to me. Thanks @nevi-me

FYI @sunchao -- let us know if you want to review this one too

Field::new("int32", DataType::Int32, false),
Field::new("int64", DataType::Int64, false),
Field::new("double", DataType::Float64, true),
Field::new("float", DataType::Float32, true),
Field::new("string", DataType::Utf8, true),
Field::new("string_2", DataType::Utf8, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it cool / needed that UTF8 and STRING both map to Arrow DataType::Utf8? It seems like UTF8 is not actually a valid "logical type" in parquet -- it should be STRING

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string-types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where things get interesting. UTF8 maps to the converted type, and STRING to the logical type. So, the pecking order is to check for logical type, then fall back to converted type.

It gets tricky when the logical and converted types are the same string value, but that is fine as the converted type is always written out. Q good example is DECIMAL(12,2), it is the same in either logical or converted type.

I was confused with this, as I initially tried parsing logical and converted types separately, without mixing them in one file. After @sunchao's review on the other PRs, it started to make sense that they can coexist.

So, both map to the same Arrow type

];

assert_eq!(arrow_fields, converted_arrow_fields);
}

#[test]
#[ignore = "To be addressed as part of ARROW-11365"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

self.tokenizer.next(),
"Invalid boolean found",
"Failure to parse timezone info for TIME type",
)?; // TODO: this might not cater for the case of no scale correctly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment -- maybe worth a JIRA to track for later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always require both precision and scale to be present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec allows DECIMAL(precision) https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

The TODO was because I got confused by the token parsing at the time. I've looked at it again, and it looks fine.

rust/parquet/src/schema/parser.rs Outdated Show resolved Hide resolved
tokenizer: &mut iter,
}
.parse_message_type();
assert!(result.is_ok());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend also assert the actual output schema too (e.g. that Time(micros) actually parsed to time(micros) and not some other type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also covered this on other tests

tokenizer: &mut iter,
}
.parse_message_type();
assert!(result.is_ok());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend testing the actual schema too in addition to asserting that there were no errors in parsing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked that the error messages are what's expected, but for this is_ok() case, I think it'd be duplicative as we test the schema in test_parse_message_type_compare_* further below in the code

rust/parquet/src/arrow/schema.rs Outdated Show resolved Hide resolved
self.tokenizer.next(),
"Invalid boolean found",
"Failure to parse timezone info for TIME type",
)?; // TODO: this might not cater for the case of no scale correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always require both precision and scale to be present?

rust/parquet/src/schema/parser.rs Outdated Show resolved Hide resolved
}
LogicalType::INTEGER(_) => {
if let Some("(") = self.tokenizer.next() {
let bit_width = parse_i32(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to check that the bit_width is one of 8, 16, 32 and 64. All others should be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked them against the physical type. Thanks

match logical {
Ok(logical) => Ok((
Some(logical.clone()),
ConvertedType::from(Some(logical)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not very related: it's a bit strange that ConvertedType::from takes an option rather than just the logical type. The latter may be more intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has worked well for me so far, because ConvertedType has a NONE enum, so when using ConvertedType::from(None) I get ConvertedType::NONE. Also, not all LogicalTypes map to ConvertedType, so I can also return NONE in their instance.

If I were to take ConvertedType::from(LogicalType), I'd have to repeat if let Some(logicla_type) = logicalType in a few places.

Copy link
Contributor Author

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @sunchao I've addressed the comments

match logical {
Ok(logical) => Ok((
Some(logical.clone()),
ConvertedType::from(Some(logical)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has worked well for me so far, because ConvertedType has a NONE enum, so when using ConvertedType::from(None) I get ConvertedType::NONE. Also, not all LogicalTypes map to ConvertedType, so I can also return NONE in their instance.

If I were to take ConvertedType::from(LogicalType), I'd have to repeat if let Some(logicla_type) = logicalType in a few places.

self.tokenizer.next(),
"Invalid boolean found",
"Failure to parse timezone info for TIME type",
)?; // TODO: this might not cater for the case of no scale correctly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec allows DECIMAL(precision) https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

The TODO was because I got confused by the token parsing at the time. I've looked at it again, and it looks fine.

}
LogicalType::INTEGER(_) => {
if let Some("(") = self.tokenizer.next() {
let bit_width = parse_i32(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked them against the physical type. Thanks

tokenizer: &mut iter,
}
.parse_message_type();
assert!(result.is_ok());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked that the error messages are what's expected, but for this is_ok() case, I think it'd be duplicative as we test the schema in test_parse_message_type_compare_* further below in the code

tokenizer: &mut iter,
}
.parse_message_type();
assert!(result.is_ok());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also covered this on other tests

@alamb
Copy link
Contributor

alamb commented Mar 24, 2021

Looks good to me -- I think it is ready to merge.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @nevi-me !

@nevi-me nevi-me closed this in 2c5e264 Mar 26, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`.

Closes apache#9705 from nevi-me/ARROW-11365

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`.

Closes apache#9705 from nevi-me/ARROW-11365

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`.

Closes apache#9705 from nevi-me/ARROW-11365

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants