ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

nevi-me · 2021-03-02T09:14:43Z

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of #9592

github-actions · 2021-03-02T09:15:10Z

https://issues.apache.org/jira/browse/ARROW-11824

alamb · 2021-03-05T21:35:18Z

The clippy error seems unrelated to this PR:

error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`

error: aborting due to previous error

I also saw it on @Dandandan 's PR. #9639

alamb · 2021-03-05T21:37:01Z

Sorry I did not mean to close this PR

ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least #9612 and #9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes #9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

nevi-me · 2021-03-11T01:35:48Z

@sunchao may you please have a look at this when you get a chance, thanks :)

sunchao · 2021-03-13T16:39:50Z

@nevi-me sorry missed this one - will take a look today.

sunchao · 2021-03-14T06:58:00Z

rust/parquet/src/arrow/schema.rs

-                    TimeUnit::Nanosecond => ConvertedType::TIMESTAMP_MICROS,
-                })
+                .with_logical_type(Some(LogicalType::TIMESTAMP(TimestampType {
+                    is_adjusted_to_u_t_c: matches!(zone, Some(z) if z.as_str() == "UTC"),


Hmm this means we'll lose the timezone info right? as is_adjusted_to_u_t_c means using local timezone.

Yeah, I think that my logic is faulty. Reading https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc again, I now see that it says that is_adjusted_to_u_t_c = true is if we actually adjust the timezone.

So, I think it's safer to use false always, as we don't adjust any timezones?

I think we can perhaps follow Arrow C++ and always set it to true whenever the timezone info is set, and normalize the timestamp value to UTC when converting to Parquet. Please see a previous discussion here and the related C++ code here.

I've read the thread, and my interpretation is this:

What I was initially doing:

no timezone: false "UTC": true other timezone: false

What's done in C++

no timezone: false "UTC": true other timezone: true

and normalize the timestamp value to UTC when converting to Parquet

Arrow timestamps are always in UTC, such that any non-UTC timezone is for display purposes only (e.g. if we want to print formatted timestamps).
So, we shouldn't need to normalise timezones as they'll always be adjusted to UTC.

My initial approach was to set is_adjusted_to_u_t_c = true whenever there's a timezone, but I second-guessed myself while working on this code. I had looked at the C++ implementation, but somehow interpreted the true value to only be set if timezone = UTC.

@sunchao are you fine with setting is_adjusted_to_u_t_c = true whenever there's a timezone?

I see. In that case we don't need to do the normalization part right? Yes +1 on setting is_adjusted_to_u_t_c = true whenever there's a timezone. We should also handle the case when the timezone string is empty the same way as it is not set.

Yes, we don't need the normalisation. I've modified the code, to check if a timezone string is not empty

sunchao · 2021-03-14T07:17:32Z

rust/parquet/src/arrow/schema.rs

-                "Unable to convert parquet INT32 logical type {}",
-                other
+        match (
+            self.schema.get_basic_info().logical_type(),


nit: perhaps we can first "merge" the logical and converted type into a logical type and then do the conversion, to avoid some of the code duplications. In the case when logical type is not present, we can always convert the converted type into a logical type while losing some information.

We can do this as a follow-up though.

Yeah, I can do it as a follow-up when I've completed the overall 2.6.0 type support

rust/parquet/src/arrow/schema.rs

rust/parquet/src/schema/types.rs

sunchao · 2021-03-14T07:35:45Z

rust/parquet/src/schema/types.rs

            id: self.id,
        };
+        // Populate the converted type if only the logical type is populated


we might need more tests for the case when logical type is set.

I added a test for the group type (modified an existing one), but for primitive types, I need the schema printer + parser. So, I'll increase the test coverage as part of #9705

This makes it convenient for users to only specify the logical type, with the converted type being populated based on a 1:1 mapping. (cherry picked from commit 7b4dda4)

(cherry picked from commit 780e966)

(cherry picked from commit 115b946)

(cherry picked from commit 5afcc44)

nevi-me · 2021-03-18T17:42:42Z

Hey @sunchao I've updated this PR. I can't add some tests for logical type conversion as I need to complete #9705 for them. I'll extend the arrow schema conversion test coverage in that PR instead.

PTAL when you can 😄

sunchao

LGTM. Thanks @nevi-me . I think the clippy check failure is unrelated.

ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

nevi-me requested a review from sunchao March 2, 2021 09:14

github-actions bot added Component: Rust Component: Parquet labels Mar 2, 2021

nevi-me force-pushed the ARROW-11824 branch from 98bf62a to 2b3f342 Compare March 5, 2021 18:16

alamb closed this Mar 5, 2021

alamb reopened this Mar 5, 2021

alamb mentioned this pull request Mar 5, 2021

ARROW-11881: [Rust][DataFusion] Fix clippy lint #9642

Closed

nevi-me force-pushed the ARROW-11824 branch 2 times, most recently from 5afcc44 to aecd501 Compare March 11, 2021 01:35

sunchao reviewed Mar 14, 2021

View reviewed changes

nevi-me force-pushed the ARROW-11824 branch from aecd501 to 203f8da Compare March 15, 2021 01:20

nevi-me mentioned this pull request Mar 15, 2021

ARROW-11365: [Rust] [Parquet] Logical type printer and parser #9705

Closed

nevi-me force-pushed the ARROW-11824 branch from 203f8da to 14f646e Compare March 15, 2021 21:04

nevi-me added 7 commits March 18, 2021 18:41

validate logical vs physical type, populate converted type

7316083

This makes it convenient for users to only specify the logical type, with the converted type being populated based on a 1:1 mapping. (cherry picked from commit 7b4dda4)

use logical types to map to Arrow types

30da7a6

(cherry picked from commit 780e966)

rename logical > converted

84a636f

(cherry picked from commit 115b946)

clippy fixes

921aeae

(cherry picked from commit 5afcc44)

address review comments

a7c9bed

revert timeszone change

b7e76c6

add more types to schema conversion

a347932

nevi-me force-pushed the ARROW-11824 branch from 14f646e to a347932 Compare March 18, 2021 17:41

sunchao approved these changes Mar 19, 2021

View reviewed changes

fix clippy

6ba5d04

nevi-me closed this in ef64d00 Mar 19, 2021

asfimport mentioned this pull request Mar 19, 2021

[Rust] [Parquet] Use logical types in Arrow writer #18540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

nevi-me commented Mar 2, 2021

github-actions bot commented Mar 2, 2021

alamb commented Mar 5, 2021

alamb commented Mar 5, 2021

nevi-me commented Mar 11, 2021

sunchao commented Mar 13, 2021

sunchao Mar 14, 2021

nevi-me Mar 15, 2021

sunchao Mar 15, 2021

nevi-me Mar 15, 2021

sunchao Mar 15, 2021

nevi-me Mar 18, 2021

sunchao Mar 14, 2021

nevi-me Mar 15, 2021

sunchao Mar 14, 2021

nevi-me Mar 18, 2021

nevi-me commented Mar 18, 2021

sunchao left a comment •

edited

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

Conversation

nevi-me commented Mar 2, 2021

github-actions bot commented Mar 2, 2021

alamb commented Mar 5, 2021

alamb commented Mar 5, 2021

nevi-me commented Mar 11, 2021

sunchao commented Mar 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me commented Mar 18, 2021

sunchao left a comment • edited

Choose a reason for hiding this comment

sunchao left a comment •

edited