New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612
Conversation
The clippy error seems unrelated to this PR:
I also saw it on @Dandandan 's PR. #9639 |
Sorry I did not mean to close this PR |
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least #9612 and #9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes #9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
5afcc44
to
aecd501
Compare
@sunchao may you please have a look at this when you get a chance, thanks :) |
@nevi-me sorry missed this one - will take a look today. |
rust/parquet/src/arrow/schema.rs
Outdated
TimeUnit::Nanosecond => ConvertedType::TIMESTAMP_MICROS, | ||
}) | ||
.with_logical_type(Some(LogicalType::TIMESTAMP(TimestampType { | ||
is_adjusted_to_u_t_c: matches!(zone, Some(z) if z.as_str() == "UTC"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this means we'll lose the timezone info right? as is_adjusted_to_u_t_c
means using local timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that my logic is faulty. Reading https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc again, I now see that it says that is_adjusted_to_u_t_c = true
is if we actually adjust the timezone.
So, I think it's safer to use false
always, as we don't adjust any timezones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the thread, and my interpretation is this:
- What I was initially doing:
no timezone: false
"UTC": true
other timezone: false
- What's done in C++
no timezone: false
"UTC": true
other timezone: true
and normalize the timestamp value to UTC when converting to Parquet
Arrow timestamps are always in UTC, such that any non-UTC timezone is for display purposes only (e.g. if we want to print formatted timestamps).
So, we shouldn't need to normalise timezones as they'll always be adjusted to UTC.
My initial approach was to set is_adjusted_to_u_t_c = true
whenever there's a timezone, but I second-guessed myself while working on this code. I had looked at the C++ implementation, but somehow interpreted the true
value to only be set if timezone = UTC
.
@sunchao are you fine with setting is_adjusted_to_u_t_c = true
whenever there's a timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case we don't need to do the normalization part right? Yes +1 on setting is_adjusted_to_u_t_c = true
whenever there's a timezone. We should also handle the case when the timezone string is empty the same way as it is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't need the normalisation. I've modified the code, to check if a timezone string is not empty
"Unable to convert parquet INT32 logical type {}", | ||
other | ||
match ( | ||
self.schema.get_basic_info().logical_type(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps we can first "merge" the logical and converted type into a logical type and then do the conversion, to avoid some of the code duplications. In the case when logical type is not present, we can always convert the converted type into a logical type while losing some information.
We can do this as a follow-up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can do it as a follow-up when I've completed the overall 2.6.0 type support
id: self.id, | ||
}; | ||
// Populate the converted type if only the logical type is populated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need more tests for the case when logical type is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test for the group type (modified an existing one), but for primitive types, I need the schema printer + parser. So, I'll increase the test coverage as part of #9705
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @nevi-me . I think the clippy check failure is unrelated.
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Populate LogicalType when converting from Arrow schema to Parquet schema.
This is on top of #9592