-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust/parquet/src/schema/types.rs
Outdated
@@ -972,18 +1011,22 @@ fn from_thrift_helper( | |||
} | |||
|
|||
/// Method to convert to Thrift. | |||
pub fn to_thrift(schema: &Type) -> Result<Vec<SchemaElement>> { | |||
pub fn to_thrift(schema: &Type, writer_version: i32) -> Result<Vec<SchemaElement>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change enables us to write the logical type if v2 of the format is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit why writer_version
is needed? I thought it would just be a straightforward conversion from the SchemaElement
to the thrift counterpart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment if you agree with my logic below.
I understand the format to mean that LogicalType
is a version 2 only detail, such that someone writing version 1 of the format, would not expect a LogicalType
to be populated. So, I'm checking if one is intending on writing v2, and only populating LogicalType
in that instance.
This would become relevant when parsing the schema and displaying it (future PR). a v2 file's text schema has INTEGER(32,true)
instead of INT_32
, so to ensure that this is the case, we shouldn't write the LogicalType
for v1 files; as a loose reader could end up misinterpreting the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, to be precise LogicalType
was introduced in 2.4.0, so it feel a bit strange that we choose to write it when version is, say, 2.3, but not when version is 1. It is also an optional field which means backward compatibility. Therefore, I guess it should be fine to write it no matter the writer_version
is 1 or 2? A reader > 2.4.0 will try to parse the optional LogicalType
first while one < 2.4.0 will just look at the ConvertedType
field.
I don't fully understand the implication of parsing schema. When looking at INTEGER(32,true)
, shouldn't we just populate the LogicalType
together with the ConvertedType
field? and only populate ConvertedType
when seeing INT_32
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean @sunchao. I've removed the version check, and always write the logical type. I suppose I'm not thinking of this well from a compatibility perspective. We'll always want to write complying with whatever parquet-format
version we're using.
There's still something that's unclear to me about how we'll deal with the text schema format, but I can raise the questions when I work on its relevant PR.
Codecov Report
@@ Coverage Diff @@
## master #9592 +/- ##
==========================================
- Coverage 82.51% 82.42% -0.10%
==========================================
Files 245 245
Lines 57329 57646 +317
==========================================
+ Hits 47306 47515 +209
- Misses 10023 10131 +108
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am not very familiar with the codebase, are there integration tests to verify that the code with the new changes can read the file written by the previous version and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nevi-me ! I remember this was brought up back in 2019 and great to see it's getting done.
rust/parquet/src/schema/types.rs
Outdated
@@ -972,18 +1011,22 @@ fn from_thrift_helper( | |||
} | |||
|
|||
/// Method to convert to Thrift. | |||
pub fn to_thrift(schema: &Type) -> Result<Vec<SchemaElement>> { | |||
pub fn to_thrift(schema: &Type, writer_version: i32) -> Result<Vec<SchemaElement>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit why writer_version
is needed? I thought it would just be a straightforward conversion from the SchemaElement
to the thrift counterpart.
Thanks for looking @sadikovi, I'll be able to address this in detail on follow-up PRs when I start using the populated logical_type. I have created ARROW-11824 for this. |
FWIW, I think that we should have a bunch of golden parquet files and the corresponding I.e. exactly the same as we already do for the IPC, where we have |
bbe9e0c
to
e49e3dd
Compare
e49e3dd
to
132949a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The failing CI check, https://github.com/apache/arrow/pull/9592/checks?check_run_id=2010574916 has the same pattern as was fixed in #9593 I pulled this branch locally and merged with apache/master and re-ran all the tests. One seems to be failing for me locally:
I think I did the merge / updated submodules correctly:
|
fefb4b9
to
48379aa
Compare
@alamb I was writing to the same file from 2 tests, so it looks like it was a timing issue. I've now fixed this. |
@alamb tests are getting stuck after the below, seems to be happening in master too, so likely not related to this PR.
On my machine, this is the last set of tests that run
There's 3 tests missing from the stalled CI. I can't establish what about the missing tests is special to start blocking CI on master and this PR. |
@nevi-me -- I see the failure on master too: https://github.com/apache/arrow/runs/2045186826 but it doesn't seem to happen for me locally (though I haven't run |
No I meant locally -- I am sorry I thought you meant you had it reproducing locally for you |
bd5f465
to
8a4a984
Compare
FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem |
V2 of the format has a LogicalType that is different to what we were using as a LogicalType. By renaming our one, this allows us to implement the v2 one.
This adds LogicalType to the internal types and builders. It also populates the thrift type with logical types if v2 of the writer is used. Added a TODO for tests that should be added.
Also addresses some deviations with the spec on sorting intervals
It might be premature to do this now. Can be done as part of ARROW-11365 if necessary.
8a4a984
to
360855f
Compare
🎉🎉🎉🎉🎉 Super cool! Thanks a lot @nevi-me ! |
…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of #9592 Closes #9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…alType # Rationale While updating arrow deps in influxdata/influxdb_iox#1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in #9592. # Changes Add some comments to try and help future users save some time during upgrade Closes #9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…alType # Rationale While updating arrow deps in influxdata/influxdb_iox#1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache/arrow#9592. # Changes Add some comments to try and help future users save some time during upgrade Closes #9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This implements the LogicalType from v2 of the format, by: - renaming `parquet::basic::LogicalType` to `parquet::basic::ConvertedType` to reflect the change in the spec - implementing `parquet::basic::LogicalType` which maps to `parquet_format::LogicalType` - writing the logical type in `parquet_format::SchemaElement` if v2 of the writer is used - making minor changes to align with the spec on column ordering This lays the groundwork for us to be able to: - support UUID and nanosecond precision timestamps (Arrow and non-Arrow) - support the new text schema format (`INT_32` and friends are deprecated) Closes apache#9592 from nevi-me/parquet-v2-support Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…alType # Rationale While updating arrow deps in https://github.com/influxdata/influxdb_iox/pull/1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache#9592. # Changes Add some comments to try and help future users save some time during upgrade Closes apache#9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This implements the LogicalType from v2 of the format, by: - renaming `parquet::basic::LogicalType` to `parquet::basic::ConvertedType` to reflect the change in the spec - implementing `parquet::basic::LogicalType` which maps to `parquet_format::LogicalType` - writing the logical type in `parquet_format::SchemaElement` if v2 of the writer is used - making minor changes to align with the spec on column ordering This lays the groundwork for us to be able to: - support UUID and nanosecond precision timestamps (Arrow and non-Arrow) - support the new text schema format (`INT_32` and friends are deprecated) Closes apache#9592 from nevi-me/parquet-v2-support Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…alType # Rationale While updating arrow deps in https://github.com/influxdata/influxdb_iox/pull/1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache#9592. # Changes Add some comments to try and help future users save some time during upgrade Closes apache#9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This implements the LogicalType from v2 of the format, by:
parquet::basic::LogicalType
toparquet::basic::ConvertedType
to reflect the change in the specparquet::basic::LogicalType
which maps toparquet_format::LogicalType
parquet_format::SchemaElement
if v2 of the writer is usedThis lays the groundwork for us to be able to:
INT_32
and friends are deprecated)