ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592

nevi-me · 2021-02-27T08:54:15Z

This implements the LogicalType from v2 of the format, by:

renaming parquet::basic::LogicalType to parquet::basic::ConvertedType to reflect the change in the spec
implementing parquet::basic::LogicalType which maps to parquet_format::LogicalType
writing the logical type in parquet_format::SchemaElement if v2 of the writer is used
making minor changes to align with the spec on column ordering

This lays the groundwork for us to be able to:

support UUID and nanosecond precision timestamps (Arrow and non-Arrow)
support the new text schema format (INT_32 and friends are deprecated)

github-actions · 2021-02-27T08:54:33Z

https://issues.apache.org/jira/browse/ARROW-11803

nevi-me

@sunchao @sadikovi may you please have a look at this one.

I'd like to merge this into a dedicated branch, in case future work becomes disruptive. I've previously tried doing this in one big PR, but it became too much; so I'm going to try it incrementally.

rust/parquet/src/basic.rs

nevi-me · 2021-02-27T08:59:22Z

rust/parquet/src/schema/types.rs

@@ -972,18 +1011,22 @@ fn from_thrift_helper(
 }

 /// Method to convert to Thrift.
-pub fn to_thrift(schema: &Type) -> Result<Vec<SchemaElement>> {
+pub fn to_thrift(schema: &Type, writer_version: i32) -> Result<Vec<SchemaElement>> {


This change enables us to write the logical type if v2 of the format is used.

Can you elaborate a bit why writer_version is needed? I thought it would just be a straightforward conversion from the SchemaElement to the thrift counterpart.

I'll add a comment if you agree with my logic below.

I understand the format to mean that LogicalType is a version 2 only detail, such that someone writing version 1 of the format, would not expect a LogicalType to be populated. So, I'm checking if one is intending on writing v2, and only populating LogicalType in that instance.

This would become relevant when parsing the schema and displaying it (future PR). a v2 file's text schema has INTEGER(32,true) instead of INT_32, so to ensure that this is the case, we shouldn't write the LogicalType for v1 files; as a loose reader could end up misinterpreting the schema.

Hmm, to be precise LogicalType was introduced in 2.4.0, so it feel a bit strange that we choose to write it when version is, say, 2.3, but not when version is 1. It is also an optional field which means backward compatibility. Therefore, I guess it should be fine to write it no matter the writer_version is 1 or 2? A reader > 2.4.0 will try to parse the optional LogicalType first while one < 2.4.0 will just look at the ConvertedType field.

I don't fully understand the implication of parsing schema. When looking at INTEGER(32,true), shouldn't we just populate the LogicalType together with the ConvertedType field? and only populate ConvertedType when seeing INT_32?

I see what you mean @sunchao. I've removed the version check, and always write the logical type. I suppose I'm not thinking of this well from a compatibility perspective. We'll always want to write complying with whatever parquet-format version we're using.

There's still something that's unclear to me about how we'll deal with the text schema format, but I can raise the questions when I work on its relevant PR.

rust/parquet/src/schema/types.rs

codecov-io · 2021-02-27T09:10:01Z

Codecov Report

Merging #9592 (209622a) into master (dec5ab9) will decrease coverage by 0.09%.
The diff coverage is 79.89%.

@@            Coverage Diff             @@
##           master    #9592      +/-   ##
==========================================
- Coverage   82.51%   82.42%   -0.10%     
==========================================
  Files         245      245              
  Lines       57329    57646     +317     
==========================================
+ Hits        47306    47515     +209     
- Misses      10023    10131     +108

Impacted Files	Coverage Δ
rust/arrow/src/array/array_list.rs	`92.88% <ø> (-0.07%)`	⬇️
rust/arrow/src/datatypes/native.rs	`76.59% <22.22%> (-12.88%)`	⬇️
rust/parquet/src/schema/visitor.rs	`68.42% <66.66%> (ø)`
rust/parquet/src/arrow/array_reader.rs	`77.53% <75.00%> (-0.08%)`	⬇️
rust/parquet/src/record/reader.rs	`91.09% <75.00%> (ø)`
rust/parquet/src/basic.rs	`87.22% <77.28%> (-10.04%)`	⬇️
rust/parquet/src/schema/types.rs	`89.79% <78.48%> (+0.25%)`	⬆️
rust/parquet/src/arrow/schema.rs	`92.36% <87.23%> (ø)`
rust/parquet/src/schema/parser.rs	`90.20% <88.88%> (ø)`
rust/parquet/src/schema/printer.rs	`62.08% <95.00%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b07027e...209622a. Read the comment docs.

sadikovi

LGTM. I am not very familiar with the codebase, are there integration tests to verify that the code with the new changes can read the file written by the previous version and vice versa?

sunchao

Thanks @nevi-me ! I remember this was brought up back in 2019 and great to see it's getting done.

rust/parquet/src/basic.rs

sunchao · 2021-02-28T07:43:09Z

rust/parquet/src/schema/types.rs

@@ -972,18 +1011,22 @@ fn from_thrift_helper(
 }

 /// Method to convert to Thrift.
-pub fn to_thrift(schema: &Type) -> Result<Vec<SchemaElement>> {
+pub fn to_thrift(schema: &Type, writer_version: i32) -> Result<Vec<SchemaElement>> {


Can you elaborate a bit why writer_version is needed? I thought it would just be a straightforward conversion from the SchemaElement to the thrift counterpart.

rust/parquet/src/schema/types.rs

nevi-me · 2021-02-28T09:02:03Z

LGTM. I am not very familiar with the codebase, are there integration tests to verify that the code with the new changes can read the file written by the previous version and vice versa?

Thanks for looking @sadikovi, I'll be able to address this in detail on follow-up PRs when I start using the populated logical_type. I have created ARROW-11824 for this.
For now, I'll write a few files with v2 and test reading them with Spark and pyarrow.

jorgecarleitao · 2021-02-28T14:40:23Z

FWIW, I think that we should have a bunch of golden parquet files and the corresponding .json, e.g. generated by C++, so that we can change parquets' implementation without introducing regressions.

I.e. exactly the same as we already do for the IPC, where we have .arrow and .json with a json representation of the expected in-memory data.

nevi-me · 2021-03-01T04:34:32Z

@sunchao I've updated the PR, @sadikovi I added a test that checks that we roundtrip the logical type to the file.

sunchao

LGTM

alamb · 2021-03-03T11:39:03Z

The failing CI check, https://github.com/apache/arrow/pull/9592/checks?check_run_id=2010574916 has the same pattern as was fixed in #9593

I pulled this branch locally and merged with apache/master and re-ran all the tests. One seems to be failing for me locally:

test util::io::tests::test_io_large_read ... ok

failures:

---- file::writer::tests::test_file_writer_with_metadata stdout ----
thread 'file::writer::tests::test_file_writer_with_metadata' panicked at 'called `Result::unwrap()` on an `Err` value: General("Could not parse metadata: end of file")', parquet/src/file/writer.rs:712:54


failures:
    file::writer::tests::test_file_writer_with_metadata

I think I did the merge / updated submodules correctly:

(arrow_dev) alamb@MacBook-Pro:~/Software/arrow2$ git status
On branch parquet-v2-support
Your branch is ahead of 'nevi-me/parquet-v2-support' by 9 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

nevi-me · 2021-03-06T06:55:36Z

One seems to be failing for me locally:

@alamb I was writing to the same file from 2 tests, so it looks like it was a timing issue. I've now fixed this.

nevi-me · 2021-03-06T09:23:57Z

@alamb tests are getting stuck after the below, seems to be happening in master too, so likely not related to this PR.

running 17 tests
test src/arrow/array_reader.rs - arrow::array_reader::StructArrayReader::next_batch (line 958) ... ignored
test src/arrow/array_reader.rs - arrow::array_reader::StructArrayReader::next_batch (line 964) ... ignored
test src/arrow/array_reader.rs - arrow::array_reader::StructArrayReader::next_batch (line 969) ... ignored
test src/arrow/mod.rs - arrow (line 26) ... ok
test src/column/mod.rs - column (line 38) ... ok
test src/compression.rs - compression (line 25) ... ok
test src/file/mod.rs - file (line 29) ... ok
test src/file/mod.rs - file (line 64) ... ok
test src/file/mod.rs - file (line 81) ... ok
test src/file/statistics.rs - file::statistics (line 23) ... ok
test src/file/properties.rs - file::properties (line 22) ... ok
test src/record/api.rs - record::api::Row::get_column_iter (line 62) ... ok
test src/schema/mod.rs - schema (line 22) ... ok
test src/schema/parser.rs - schema::parser (line 24) ... ok

On my machine, this is the last set of tests that run

test src/schema/types.rs - schema::types::ColumnPath::string (line 540) ... ok
test src/file/statistics.rs - file::statistics (line 23) ... ok
test src/schema/parser.rs - schema::parser (line 24) ... ok
test src/schema/mod.rs - schema (line 22) ... ok
test src/schema/printer.rs - schema::printer (line 23) ... ok

There's 3 tests missing from the stalled CI. I can't establish what about the missing tests is special to start blocking CI on master and this PR.

alamb · 2021-03-06T12:29:40Z

@nevi-me -- I see the failure on master too: https://github.com/apache/arrow/runs/2045186826 but it doesn't seem to happen for me locally (though I haven't run cargo update recently). I will file a ticket

alamb · 2021-03-06T12:32:53Z

https://issues.apache.org/jira/browse/ARROW-11896

alamb · 2021-03-06T12:53:03Z

@nevi-me I wonder if you can try to back out #9605 locally for you and see if that fixes the hang you are seeing?

nevi-me · 2021-03-06T13:17:48Z

@nevi-me I wonder if you can try to back out #9605 locally for you and see if that fixes the hang you are seeing?

You mean on the PR? Everything works locally, even after cargo update. I'll try with #9605 reverted later this evening. I'm stepping out for now

alamb · 2021-03-06T13:20:30Z

You mean on the PR? Everything works locally,

No I meant locally -- I am sorry I thought you meant you had it reproducing locally for you

alamb · 2021-03-07T18:03:06Z

FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem

V2 of the format has a LogicalType that is different to what we were using as a LogicalType. By renaming our one, this allows us to implement the v2 one.

This adds LogicalType to the internal types and builders. It also populates the thrift type with logical types if v2 of the writer is used. Added a TODO for tests that should be added.

Also addresses some deviations with the spec on sorting intervals

It might be premature to do this now. Can be done as part of ARROW-11365 if necessary.

jorgecarleitao · 2021-03-07T19:11:58Z

🎉🎉🎉🎉🎉 Super cool! Thanks a lot @nevi-me !

…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of #9592 Closes #9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…alType # Rationale While updating arrow deps in influxdata/influxdb_iox#1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in #9592. # Changes Add some comments to try and help future users save some time during upgrade Closes #9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…alType # Rationale While updating arrow deps in influxdata/influxdb_iox#1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache/arrow#9592. # Changes Add some comments to try and help future users save some time during upgrade Closes #9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This implements the LogicalType from v2 of the format, by: - renaming `parquet::basic::LogicalType` to `parquet::basic::ConvertedType` to reflect the change in the spec - implementing `parquet::basic::LogicalType` which maps to `parquet_format::LogicalType` - writing the logical type in `parquet_format::SchemaElement` if v2 of the writer is used - making minor changes to align with the spec on column ordering This lays the groundwork for us to be able to: - support UUID and nanosecond precision timestamps (Arrow and non-Arrow) - support the new text schema format (`INT_32` and friends are deprecated) Closes apache#9592 from nevi-me/parquet-v2-support Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…alType # Rationale While updating arrow deps in https://github.com/influxdata/influxdb_iox/pull/1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache#9592. # Changes Add some comments to try and help future users save some time during upgrade Closes apache#9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This implements the LogicalType from v2 of the format, by: - renaming `parquet::basic::LogicalType` to `parquet::basic::ConvertedType` to reflect the change in the spec - implementing `parquet::basic::LogicalType` which maps to `parquet_format::LogicalType` - writing the logical type in `parquet_format::SchemaElement` if v2 of the writer is used - making minor changes to align with the spec on column ordering This lays the groundwork for us to be able to: - support UUID and nanosecond precision timestamps (Arrow and non-Arrow) - support the new text schema format (`INT_32` and friends are deprecated) Closes apache#9592 from nevi-me/parquet-v2-support Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…rsion Populate LogicalType when converting from Arrow schema to Parquet schema. This is on top of apache#9592 Closes apache#9612 from nevi-me/ARROW-11824 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

…alType # Rationale While updating arrow deps in https://github.com/influxdata/influxdb_iox/pull/1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache#9592. # Changes Add some comments to try and help future users save some time during upgrade Closes apache#9731 from alamb/better-docs-on-rename Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

nevi-me requested a review from sunchao February 27, 2021 08:54

github-actions bot added Component: Rust Component: Parquet labels Feb 27, 2021

nevi-me commented Feb 27, 2021

View reviewed changes

sadikovi approved these changes Feb 27, 2021

View reviewed changes

sunchao changed the title ~~ARROW-11803: [Rust] Parquet] Support v2 LogicalType~~ ARROW-11803: [Rust] [Parquet] Support v2 LogicalType Feb 28, 2021

sunchao reviewed Feb 28, 2021

View reviewed changes

nevi-me force-pushed the parquet-v2-support branch from bbe9e0c to e49e3dd Compare March 1, 2021 04:04

nevi-me force-pushed the parquet-v2-support branch from e49e3dd to 132949a Compare March 2, 2021 04:50

nevi-me mentioned this pull request Mar 2, 2021

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

Closed

sunchao approved these changes Mar 3, 2021

View reviewed changes

nevi-me force-pushed the parquet-v2-support branch 2 times, most recently from fefb4b9 to 48379aa Compare March 6, 2021 06:38

github-actions bot added the Component: Rust - DataFusion label Mar 6, 2021

nevi-me force-pushed the parquet-v2-support branch from bd5f465 to 8a4a984 Compare March 6, 2021 19:33

move schema to v1::schema

578fda7

nevi-me added 9 commits March 7, 2021 20:35

rename LogicalType to ConvertedType

8255004

V2 of the format has a LogicalType that is different to what we were using as a LogicalType. By renaming our one, this allows us to implement the v2 one.

add v2 logical type

b1e351f

add logical type to types, write it with v2 writer

3075d7c

This adds LogicalType to the internal types and builders. It also populates the thrift type with logical types if v2 of the writer is used. Added a TODO for tests that should be added.

Implement conversion from logical to converted type

2a678d2

Use logicaltype for sort order

2757b6d

Also addresses some deviations with the spec on sorting intervals

revert schema version split

107a638

It might be premature to do this now. Can be done as part of ARROW-11365 if necessary.

address review feedback, add more tests

396b971

Always write logical types

7233930

don't write same write for 2 tests

360855f

nevi-me force-pushed the parquet-v2-support branch from 8a4a984 to 360855f Compare March 7, 2021 18:35

nevi-me closed this in 69d436d Mar 7, 2021

alamb mentioned this pull request Mar 16, 2021

ARROW-11992: [Rust][Parquet] Add upgrade notes on 4.0 rename of LogicalType #9731

Closed

This was referenced Mar 7, 2021

[Rust] [Parquet] Support v2 LogicalType #27654

Closed

[Rust] Hang / failure in CI on AMD64 Debian 10 Rust stable test workspace #27738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592

ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592

nevi-me commented Feb 27, 2021

github-actions bot commented Feb 27, 2021

nevi-me left a comment

nevi-me Feb 27, 2021

sunchao Feb 28, 2021

nevi-me Feb 28, 2021

sunchao Mar 1, 2021 •

edited

Loading

nevi-me Mar 2, 2021

codecov-io commented Feb 27, 2021 •

edited

Loading

sadikovi left a comment

sunchao left a comment

sunchao Feb 28, 2021

nevi-me commented Feb 28, 2021

jorgecarleitao commented Feb 28, 2021

nevi-me commented Mar 1, 2021

sunchao left a comment

alamb commented Mar 3, 2021

nevi-me commented Mar 6, 2021

nevi-me commented Mar 6, 2021

alamb commented Mar 6, 2021

alamb commented Mar 6, 2021

alamb commented Mar 6, 2021

nevi-me commented Mar 6, 2021

alamb commented Mar 6, 2021

alamb commented Mar 7, 2021

jorgecarleitao commented Mar 7, 2021

ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592

ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592

Conversation

nevi-me commented Feb 27, 2021

github-actions bot commented Feb 27, 2021

nevi-me left a comment

Choose a reason for hiding this comment

nevi-me Feb 27, 2021

Choose a reason for hiding this comment

sunchao Feb 28, 2021

Choose a reason for hiding this comment

nevi-me Feb 28, 2021

Choose a reason for hiding this comment

sunchao Mar 1, 2021 • edited Loading

Choose a reason for hiding this comment

nevi-me Mar 2, 2021

Choose a reason for hiding this comment

codecov-io commented Feb 27, 2021 • edited Loading

Codecov Report

sadikovi left a comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

sunchao Feb 28, 2021

Choose a reason for hiding this comment

nevi-me commented Feb 28, 2021

jorgecarleitao commented Feb 28, 2021

nevi-me commented Mar 1, 2021

sunchao left a comment

Choose a reason for hiding this comment

alamb commented Mar 3, 2021

nevi-me commented Mar 6, 2021

nevi-me commented Mar 6, 2021

alamb commented Mar 6, 2021

alamb commented Mar 6, 2021

alamb commented Mar 6, 2021

nevi-me commented Mar 6, 2021

alamb commented Mar 6, 2021

alamb commented Mar 7, 2021

jorgecarleitao commented Mar 7, 2021

sunchao Mar 1, 2021 •

edited

Loading

codecov-io commented Feb 27, 2021 •

edited

Loading