ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

nevi-me · 2021-03-15T01:31:31Z

This implements the parser and printer for logical types, allowing us to read and generate the schema in the form REQUIRED INT32 field_name (INTEGER(16,false)).

github-actions · 2021-03-15T01:32:05Z

https://issues.apache.org/jira/browse/ARROW-11365

nevi-me · 2021-03-15T01:32:38Z

@sunchao I've created this on top of #9612, PTAL when you can.

alamb

I am not an expert in this code, but I went through it carefully and it looks good to me. Thanks @nevi-me

FYI @sunchao -- let us know if you want to review this one too

alamb · 2021-03-22T19:31:36Z

rust/parquet/src/arrow/schema.rs

            Field::new("int32", DataType::Int32, false),
            Field::new("int64", DataType::Int64, false),
            Field::new("double", DataType::Float64, true),
            Field::new("float", DataType::Float32, true),
            Field::new("string", DataType::Utf8, true),
+            Field::new("string_2", DataType::Utf8, true),


is it cool / needed that UTF8 and STRING both map to Arrow DataType::Utf8? It seems like UTF8 is not actually a valid "logical type" in parquet -- it should be STRING

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string-types

This is where things get interesting. UTF8 maps to the converted type, and STRING to the logical type. So, the pecking order is to check for logical type, then fall back to converted type.

It gets tricky when the logical and converted types are the same string value, but that is fine as the converted type is always written out. Q good example is DECIMAL(12,2), it is the same in either logical or converted type.

I was confused with this, as I initially tried parsing logical and converted types separately, without mixing them in one file. After @sunchao's review on the other PRs, it started to make sense that they can coexist.

So, both map to the same Arrow type

alamb · 2021-03-22T19:32:03Z

rust/parquet/src/arrow/schema.rs

        ];

        assert_eq!(arrow_fields, converted_arrow_fields);
    }

    #[test]
-    #[ignore = "To be addressed as part of ARROW-11365"]


alamb · 2021-03-22T19:37:34Z

rust/parquet/src/schema/parser.rs

+                                    self.tokenizer.next(),
+                                    "Invalid boolean found",
+                                    "Failure to parse timezone info for TIME type",
+                                )?; // TODO: this might not cater for the case of no scale correctly


I don't understand the comment -- maybe worth a JIRA to track for later

I think we always require both precision and scale to be present?

The spec allows DECIMAL(precision) https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

The TODO was because I got confused by the token parsing at the time. I've looked at it again, and it looks fine.

rust/parquet/src/schema/parser.rs

alamb · 2021-03-22T19:40:02Z

rust/parquet/src/schema/parser.rs

+            tokenizer: &mut iter,
+        }
+        .parse_message_type();
+        assert!(result.is_ok());


I recommend also assert the actual output schema too (e.g. that Time(micros) actually parsed to time(micros) and not some other type)

Also covered this on other tests

alamb · 2021-03-22T19:40:31Z

rust/parquet/src/schema/parser.rs

+            tokenizer: &mut iter,
+        }
+        .parse_message_type();
+        assert!(result.is_ok());


I recommend testing the actual schema too in addition to asserting that there were no errors in parsing

I've checked that the error messages are what's expected, but for this is_ok() case, I think it'd be duplicative as we test the schema in test_parse_message_type_compare_* further below in the code

rust/parquet/src/arrow/schema.rs

sunchao · 2021-03-23T06:24:14Z

rust/parquet/src/schema/parser.rs

+                                    self.tokenizer.next(),
+                                    "Invalid boolean found",
+                                    "Failure to parse timezone info for TIME type",
+                                )?; // TODO: this might not cater for the case of no scale correctly


I think we always require both precision and scale to be present?

rust/parquet/src/schema/parser.rs

sunchao · 2021-03-23T06:32:07Z

rust/parquet/src/schema/parser.rs

+                    }
+                    LogicalType::INTEGER(_) => {
+                        if let Some("(") = self.tokenizer.next() {
+                            let bit_width = parse_i32(


We might want to check that the bit_width is one of 8, 16, 32 and 64. All others should be invalid.

I've checked them against the physical type. Thanks

sunchao · 2021-03-23T06:42:45Z

rust/parquet/src/schema/parser.rs

+                    match logical {
+                        Ok(logical) => Ok((
+                            Some(logical.clone()),
+                            ConvertedType::from(Some(logical)),


Probably not very related: it's a bit strange that ConvertedType::from takes an option rather than just the logical type. The latter may be more intuitive.

This has worked well for me so far, because ConvertedType has a NONE enum, so when using ConvertedType::from(None) I get ConvertedType::NONE. Also, not all LogicalTypes map to ConvertedType, so I can also return NONE in their instance.

If I were to take ConvertedType::from(LogicalType), I'd have to repeat if let Some(logicla_type) = logicalType in a few places.

(cherry picked from commit 780e966)

nevi-me

@alamb @sunchao I've addressed the comments

nevi-me · 2021-03-23T20:35:21Z

rust/parquet/src/schema/parser.rs

+                    match logical {
+                        Ok(logical) => Ok((
+                            Some(logical.clone()),
+                            ConvertedType::from(Some(logical)),


This has worked well for me so far, because ConvertedType has a NONE enum, so when using ConvertedType::from(None) I get ConvertedType::NONE. Also, not all LogicalTypes map to ConvertedType, so I can also return NONE in their instance.

If I were to take ConvertedType::from(LogicalType), I'd have to repeat if let Some(logicla_type) = logicalType in a few places.

nevi-me · 2021-03-24T06:04:16Z

rust/parquet/src/schema/parser.rs

+                                    self.tokenizer.next(),
+                                    "Invalid boolean found",
+                                    "Failure to parse timezone info for TIME type",
+                                )?; // TODO: this might not cater for the case of no scale correctly


The spec allows DECIMAL(precision) https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

The TODO was because I got confused by the token parsing at the time. I've looked at it again, and it looks fine.

nevi-me · 2021-03-24T06:05:03Z

rust/parquet/src/schema/parser.rs

+                    }
+                    LogicalType::INTEGER(_) => {
+                        if let Some("(") = self.tokenizer.next() {
+                            let bit_width = parse_i32(


I've checked them against the physical type. Thanks

nevi-me · 2021-03-24T06:20:58Z

rust/parquet/src/schema/parser.rs

+            tokenizer: &mut iter,
+        }
+        .parse_message_type();
+        assert!(result.is_ok());


I've checked that the error messages are what's expected, but for this is_ok() case, I think it'd be duplicative as we test the schema in test_parse_message_type_compare_* further below in the code

nevi-me · 2021-03-24T11:41:06Z

rust/parquet/src/schema/parser.rs

+            tokenizer: &mut iter,
+        }
+        .parse_message_type();
+        assert!(result.is_ok());


Also covered this on other tests

alamb · 2021-03-24T20:49:26Z

Looks good to me -- I think it is ready to merge.

sunchao

LGTM. Thanks @nevi-me !

This implements the parser and printer for logical types, allowing us to read and generate the schema in the form `REQUIRED INT32 field_name (INTEGER(16,false))`. Closes apache#9705 from nevi-me/ARROW-11365 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

nevi-me requested a review from sunchao March 15, 2021 01:31

github-actions bot added Component: Rust Component: Parquet labels Mar 15, 2021

nevi-me mentioned this pull request Mar 18, 2021

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

Closed

nevi-me force-pushed the ARROW-11365 branch from 157a06d to 95cd542 Compare March 19, 2021 09:08

alamb approved these changes Mar 22, 2021

View reviewed changes

sunchao reviewed Mar 23, 2021

View reviewed changes

nevi-me added 6 commits March 24, 2021 14:03

use logical types to map to Arrow types

f42337e

(cherry picked from commit 780e966)

address review comments

5b8578f

schema printer for logical types

b4eb0de

implement logical type parser

117b26a

test logical types, remove TODO

ffab87f

address review comments

6aaf4fa

nevi-me force-pushed the ARROW-11365 branch from 95cd542 to 6aaf4fa Compare March 24, 2021 12:04

nevi-me commented Mar 24, 2021

View reviewed changes

sunchao approved these changes Mar 26, 2021

View reviewed changes

nevi-me closed this in 2c5e264 Mar 26, 2021

asfimport mentioned this pull request Mar 26, 2021

[Rust] [Parquet] Implement parsers for v2 of the text schema #27259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

nevi-me commented Mar 15, 2021

github-actions bot commented Mar 15, 2021

nevi-me commented Mar 15, 2021

alamb left a comment

alamb Mar 22, 2021

nevi-me Mar 22, 2021

alamb Mar 22, 2021

alamb Mar 22, 2021

sunchao Mar 23, 2021

nevi-me Mar 24, 2021

alamb Mar 22, 2021

nevi-me Mar 24, 2021

alamb Mar 22, 2021

nevi-me Mar 24, 2021

sunchao Mar 23, 2021

sunchao Mar 23, 2021

nevi-me Mar 24, 2021

sunchao Mar 23, 2021

nevi-me Mar 23, 2021

nevi-me left a comment

nevi-me Mar 23, 2021

nevi-me Mar 24, 2021

nevi-me Mar 24, 2021

nevi-me Mar 24, 2021

nevi-me Mar 24, 2021

alamb commented Mar 24, 2021

sunchao left a comment

ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

Conversation

nevi-me commented Mar 15, 2021

github-actions bot commented Mar 15, 2021

nevi-me commented Mar 15, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Mar 24, 2021

sunchao left a comment

Choose a reason for hiding this comment