Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

Closed
wants to merge 8 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Mar 2, 2021

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of #9592

@nevi-me nevi-me requested a review from sunchao March 2, 2021 09:14
@github-actions
Copy link

github-actions bot commented Mar 2, 2021

@alamb
Copy link
Contributor

alamb commented Mar 5, 2021

The clippy error seems unrelated to this PR:

error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`

error: aborting due to previous error

I also saw it on @Dandandan 's PR. #9639

@alamb alamb closed this Mar 5, 2021
@alamb alamb reopened this Mar 5, 2021
@alamb
Copy link
Contributor

alamb commented Mar 5, 2021

Sorry I did not mean to close this PR

nevi-me pushed a commit that referenced this pull request Mar 6, 2021
ARROW-11881: [Rust][DataFusion] Fix clippy lint

A linter error has appeared on master somehow:

```
error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`
```

Seen on at least #9612 and #9639:

https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472

https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120

Closes #9642 from alamb/fix_clippy

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me nevi-me force-pushed the ARROW-11824 branch 2 times, most recently from 5afcc44 to aecd501 Compare March 11, 2021 01:35
@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 11, 2021

@sunchao may you please have a look at this when you get a chance, thanks :)

@sunchao
Copy link
Member

sunchao commented Mar 13, 2021

@nevi-me sorry missed this one - will take a look today.

TimeUnit::Nanosecond => ConvertedType::TIMESTAMP_MICROS,
})
.with_logical_type(Some(LogicalType::TIMESTAMP(TimestampType {
is_adjusted_to_u_t_c: matches!(zone, Some(z) if z.as_str() == "UTC"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this means we'll lose the timezone info right? as is_adjusted_to_u_t_c means using local timezone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that my logic is faulty. Reading https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc again, I now see that it says that is_adjusted_to_u_t_c = true is if we actually adjust the timezone.

So, I think it's safer to use false always, as we don't adjust any timezones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can perhaps follow Arrow C++ and always set it to true whenever the timezone info is set, and normalize the timestamp value to UTC when converting to Parquet. Please see a previous discussion here and the related C++ code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the thread, and my interpretation is this:

  • What I was initially doing:
no timezone:    false
"UTC":          true
other timezone: false
  • What's done in C++
no timezone:    false
"UTC":          true
other timezone: true

and normalize the timestamp value to UTC when converting to Parquet

Arrow timestamps are always in UTC, such that any non-UTC timezone is for display purposes only (e.g. if we want to print formatted timestamps).
So, we shouldn't need to normalise timezones as they'll always be adjusted to UTC.

My initial approach was to set is_adjusted_to_u_t_c = true whenever there's a timezone, but I second-guessed myself while working on this code. I had looked at the C++ implementation, but somehow interpreted the true value to only be set if timezone = UTC.

@sunchao are you fine with setting is_adjusted_to_u_t_c = true whenever there's a timezone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case we don't need to do the normalization part right? Yes +1 on setting is_adjusted_to_u_t_c = true whenever there's a timezone. We should also handle the case when the timezone string is empty the same way as it is not set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't need the normalisation. I've modified the code, to check if a timezone string is not empty

"Unable to convert parquet INT32 logical type {}",
other
match (
self.schema.get_basic_info().logical_type(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps we can first "merge" the logical and converted type into a logical type and then do the conversion, to avoid some of the code duplications. In the case when logical type is not present, we can always convert the converted type into a logical type while losing some information.

We can do this as a follow-up though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can do it as a follow-up when I've completed the overall 2.6.0 type support

rust/parquet/src/arrow/schema.rs Outdated Show resolved Hide resolved
rust/parquet/src/schema/types.rs Outdated Show resolved Hide resolved
id: self.id,
};
// Populate the converted type if only the logical type is populated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need more tests for the case when logical type is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for the group type (modified an existing one), but for primitive types, I need the schema printer + parser. So, I'll increase the test coverage as part of #9705

This makes it convenient for users to only specify the logical type,
with the converted type being populated based on a 1:1 mapping.

(cherry picked from commit 7b4dda4)
(cherry picked from commit 115b946)
(cherry picked from commit 5afcc44)
@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 18, 2021

Hey @sunchao I've updated this PR. I can't add some tests for logical type conversion as I need to complete #9705 for them. I'll extend the arrow schema conversion test coverage in that PR instead.

PTAL when you can 😄

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @nevi-me . I think the clippy check failure is unrelated.

@nevi-me nevi-me closed this in ef64d00 Mar 19, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
ARROW-11881: [Rust][DataFusion] Fix clippy lint

A linter error has appeared on master somehow:

```
error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`
```

Seen on at least apache#9612 and apache#9639:

https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472

https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120

Closes apache#9642 from alamb/fix_clippy

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…rsion

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of apache#9592

Closes apache#9612 from nevi-me/ARROW-11824

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
ARROW-11881: [Rust][DataFusion] Fix clippy lint

A linter error has appeared on master somehow:

```
error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`
```

Seen on at least apache#9612 and apache#9639:

https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472

https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120

Closes apache#9642 from alamb/fix_clippy

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…rsion

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of apache#9592

Closes apache#9612 from nevi-me/ARROW-11824

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants