Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11803: [Rust] [Parquet] Support v2 LogicalType #9592

Closed
wants to merge 10 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Feb 27, 2021

This implements the LogicalType from v2 of the format, by:

  • renaming parquet::basic::LogicalType to parquet::basic::ConvertedType to reflect the change in the spec
  • implementing parquet::basic::LogicalType which maps to parquet_format::LogicalType
  • writing the logical type in parquet_format::SchemaElement if v2 of the writer is used
  • making minor changes to align with the spec on column ordering

This lays the groundwork for us to be able to:

  • support UUID and nanosecond precision timestamps (Arrow and non-Arrow)
  • support the new text schema format (INT_32 and friends are deprecated)

@nevi-me nevi-me requested a review from sunchao February 27, 2021 08:54
@github-actions
Copy link

Copy link
Contributor Author

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao @sadikovi may you please have a look at this one.

I'd like to merge this into a dedicated branch, in case future work becomes disruptive. I've previously tried doing this in one big PR, but it became too much; so I'm going to try it incrementally.

rust/parquet/src/basic.rs Outdated Show resolved Hide resolved
@@ -972,18 +1011,22 @@ fn from_thrift_helper(
}

/// Method to convert to Thrift.
pub fn to_thrift(schema: &Type) -> Result<Vec<SchemaElement>> {
pub fn to_thrift(schema: &Type, writer_version: i32) -> Result<Vec<SchemaElement>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change enables us to write the logical type if v2 of the format is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit why writer_version is needed? I thought it would just be a straightforward conversion from the SchemaElement to the thrift counterpart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment if you agree with my logic below.

I understand the format to mean that LogicalType is a version 2 only detail, such that someone writing version 1 of the format, would not expect a LogicalType to be populated. So, I'm checking if one is intending on writing v2, and only populating LogicalType in that instance.

This would become relevant when parsing the schema and displaying it (future PR). a v2 file's text schema has INTEGER(32,true) instead of INT_32, so to ensure that this is the case, we shouldn't write the LogicalType for v1 files; as a loose reader could end up misinterpreting the schema.

Copy link
Member

@sunchao sunchao Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, to be precise LogicalType was introduced in 2.4.0, so it feel a bit strange that we choose to write it when version is, say, 2.3, but not when version is 1. It is also an optional field which means backward compatibility. Therefore, I guess it should be fine to write it no matter the writer_version is 1 or 2? A reader > 2.4.0 will try to parse the optional LogicalType first while one < 2.4.0 will just look at the ConvertedType field.

I don't fully understand the implication of parsing schema. When looking at INTEGER(32,true), shouldn't we just populate the LogicalType together with the ConvertedType field? and only populate ConvertedType when seeing INT_32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean @sunchao. I've removed the version check, and always write the logical type. I suppose I'm not thinking of this well from a compatibility perspective. We'll always want to write complying with whatever parquet-format version we're using.

There's still something that's unclear to me about how we'll deal with the text schema format, but I can raise the questions when I work on its relevant PR.

rust/parquet/src/schema/types.rs Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Feb 27, 2021

Codecov Report

Merging #9592 (209622a) into master (dec5ab9) will decrease coverage by 0.09%.
The diff coverage is 79.89%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9592      +/-   ##
==========================================
- Coverage   82.51%   82.42%   -0.10%     
==========================================
  Files         245      245              
  Lines       57329    57646     +317     
==========================================
+ Hits        47306    47515     +209     
- Misses      10023    10131     +108     
Impacted Files Coverage Δ
rust/arrow/src/array/array_list.rs 92.88% <ø> (-0.07%) ⬇️
rust/arrow/src/datatypes/native.rs 76.59% <22.22%> (-12.88%) ⬇️
rust/parquet/src/schema/visitor.rs 68.42% <66.66%> (ø)
rust/parquet/src/arrow/array_reader.rs 77.53% <75.00%> (-0.08%) ⬇️
rust/parquet/src/record/reader.rs 91.09% <75.00%> (ø)
rust/parquet/src/basic.rs 87.22% <77.28%> (-10.04%) ⬇️
rust/parquet/src/schema/types.rs 89.79% <78.48%> (+0.25%) ⬆️
rust/parquet/src/arrow/schema.rs 92.36% <87.23%> (ø)
rust/parquet/src/schema/parser.rs 90.20% <88.88%> (ø)
rust/parquet/src/schema/printer.rs 62.08% <95.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b07027e...209622a. Read the comment docs.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am not very familiar with the codebase, are there integration tests to verify that the code with the new changes can read the file written by the previous version and vice versa?

@sunchao sunchao changed the title ARROW-11803: [Rust] Parquet] Support v2 LogicalType ARROW-11803: [Rust] [Parquet] Support v2 LogicalType Feb 28, 2021
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nevi-me ! I remember this was brought up back in 2019 and great to see it's getting done.

rust/parquet/src/basic.rs Outdated Show resolved Hide resolved
rust/parquet/src/basic.rs Outdated Show resolved Hide resolved
rust/parquet/src/basic.rs Outdated Show resolved Hide resolved
@@ -972,18 +1011,22 @@ fn from_thrift_helper(
}

/// Method to convert to Thrift.
pub fn to_thrift(schema: &Type) -> Result<Vec<SchemaElement>> {
pub fn to_thrift(schema: &Type, writer_version: i32) -> Result<Vec<SchemaElement>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit why writer_version is needed? I thought it would just be a straightforward conversion from the SchemaElement to the thrift counterpart.

rust/parquet/src/schema/types.rs Outdated Show resolved Hide resolved
@nevi-me
Copy link
Contributor Author

nevi-me commented Feb 28, 2021

LGTM. I am not very familiar with the codebase, are there integration tests to verify that the code with the new changes can read the file written by the previous version and vice versa?

Thanks for looking @sadikovi, I'll be able to address this in detail on follow-up PRs when I start using the populated logical_type. I have created ARROW-11824 for this.
For now, I'll write a few files with v2 and test reading them with Spark and pyarrow.

@jorgecarleitao
Copy link
Member

FWIW, I think that we should have a bunch of golden parquet files and the corresponding .json, e.g. generated by C++, so that we can change parquets' implementation without introducing regressions.

I.e. exactly the same as we already do for the IPC, where we have .arrow and .json with a json representation of the expected in-memory data.

@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 1, 2021

@sunchao I've updated the PR, @sadikovi I added a test that checks that we roundtrip the logical type to the file.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

The failing CI check, https://github.com/apache/arrow/pull/9592/checks?check_run_id=2010574916 has the same pattern as was fixed in #9593

I pulled this branch locally and merged with apache/master and re-ran all the tests. One seems to be failing for me locally:

test util::io::tests::test_io_large_read ... ok

failures:

---- file::writer::tests::test_file_writer_with_metadata stdout ----
thread 'file::writer::tests::test_file_writer_with_metadata' panicked at 'called `Result::unwrap()` on an `Err` value: General("Could not parse metadata: end of file")', parquet/src/file/writer.rs:712:54


failures:
    file::writer::tests::test_file_writer_with_metadata

I think I did the merge / updated submodules correctly:

(arrow_dev) alamb@MacBook-Pro:~/Software/arrow2$ git status
On branch parquet-v2-support
Your branch is ahead of 'nevi-me/parquet-v2-support' by 9 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

@nevi-me nevi-me force-pushed the parquet-v2-support branch 2 times, most recently from fefb4b9 to 48379aa Compare March 6, 2021 06:38
@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 6, 2021

One seems to be failing for me locally:

@alamb I was writing to the same file from 2 tests, so it looks like it was a timing issue. I've now fixed this.

@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 6, 2021

@alamb tests are getting stuck after the below, seems to be happening in master too, so likely not related to this PR.

running 17 tests
test src/arrow/array_reader.rs - arrow::array_reader::StructArrayReader::next_batch (line 958) ... ignored
test src/arrow/array_reader.rs - arrow::array_reader::StructArrayReader::next_batch (line 964) ... ignored
test src/arrow/array_reader.rs - arrow::array_reader::StructArrayReader::next_batch (line 969) ... ignored
test src/arrow/mod.rs - arrow (line 26) ... ok
test src/column/mod.rs - column (line 38) ... ok
test src/compression.rs - compression (line 25) ... ok
test src/file/mod.rs - file (line 29) ... ok
test src/file/mod.rs - file (line 64) ... ok
test src/file/mod.rs - file (line 81) ... ok
test src/file/statistics.rs - file::statistics (line 23) ... ok
test src/file/properties.rs - file::properties (line 22) ... ok
test src/record/api.rs - record::api::Row::get_column_iter (line 62) ... ok
test src/schema/mod.rs - schema (line 22) ... ok
test src/schema/parser.rs - schema::parser (line 24) ... ok

On my machine, this is the last set of tests that run

test src/schema/types.rs - schema::types::ColumnPath::string (line 540) ... ok
test src/file/statistics.rs - file::statistics (line 23) ... ok
test src/schema/parser.rs - schema::parser (line 24) ... ok
test src/schema/mod.rs - schema (line 22) ... ok
test src/schema/printer.rs - schema::printer (line 23) ... ok

There's 3 tests missing from the stalled CI. I can't establish what about the missing tests is special to start blocking CI on master and this PR.

@alamb
Copy link
Contributor

alamb commented Mar 6, 2021

@nevi-me -- I see the failure on master too: https://github.com/apache/arrow/runs/2045186826 but it doesn't seem to happen for me locally (though I haven't run cargo update recently). I will file a ticket

@alamb
Copy link
Contributor

alamb commented Mar 6, 2021

@alamb
Copy link
Contributor

alamb commented Mar 6, 2021

@nevi-me I wonder if you can try to back out #9605 locally for you and see if that fixes the hang you are seeing?

@nevi-me
Copy link
Contributor Author

nevi-me commented Mar 6, 2021

@nevi-me I wonder if you can try to back out #9605 locally for you and see if that fixes the hang you are seeing?

You mean on the PR? Everything works locally, even after cargo update. I'll try with #9605 reverted later this evening. I'm stepping out for now

@alamb
Copy link
Contributor

alamb commented Mar 6, 2021

You mean on the PR? Everything works locally,

No I meant locally -- I am sorry I thought you meant you had it reproducing locally for you

@alamb
Copy link
Contributor

alamb commented Mar 7, 2021

FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem

V2 of the format has a LogicalType that is different to what we were using as a LogicalType.
By renaming our one, this allows us to implement the v2 one.
This adds LogicalType to the internal types and builders.
It also populates the thrift type with logical types if v2 of the writer is used.

Added a TODO for tests that should be added.
Also addresses some deviations with the spec on sorting intervals
It might be premature to do this now.
Can be done as part of ARROW-11365 if necessary.
@jorgecarleitao
Copy link
Member

🎉🎉🎉🎉🎉 Super cool! Thanks a lot @nevi-me !

nevi-me added a commit that referenced this pull request Mar 19, 2021
…rsion

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of #9592

Closes #9612 from nevi-me/ARROW-11824

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
alamb added a commit that referenced this pull request Mar 19, 2021
…alType

# Rationale
While updating arrow deps in influxdata/influxdb_iox#1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in #9592.

# Changes
Add some comments to try and help future users save some time during upgrade

Closes #9731 from alamb/better-docs-on-rename

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…alType

# Rationale
While updating arrow deps in influxdata/influxdb_iox#1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache/arrow#9592.

# Changes
Add some comments to try and help future users save some time during upgrade

Closes #9731 from alamb/better-docs-on-rename

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This implements the LogicalType from v2 of the format, by:
- renaming `parquet::basic::LogicalType` to `parquet::basic::ConvertedType` to reflect the change in the spec
- implementing `parquet::basic::LogicalType` which maps to `parquet_format::LogicalType`
- writing the logical type in `parquet_format::SchemaElement` if v2 of the writer is used
- making minor changes to align with the spec on column ordering

This lays the groundwork for us to be able to:
- support UUID and nanosecond precision timestamps (Arrow and non-Arrow)
- support the new text schema format (`INT_32` and friends are deprecated)

Closes apache#9592 from nevi-me/parquet-v2-support

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…rsion

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of apache#9592

Closes apache#9612 from nevi-me/ARROW-11824

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…alType

# Rationale
While updating arrow deps in https://github.com/influxdata/influxdb_iox/pull/1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache#9592.

# Changes
Add some comments to try and help future users save some time during upgrade

Closes apache#9731 from alamb/better-docs-on-rename

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This implements the LogicalType from v2 of the format, by:
- renaming `parquet::basic::LogicalType` to `parquet::basic::ConvertedType` to reflect the change in the spec
- implementing `parquet::basic::LogicalType` which maps to `parquet_format::LogicalType`
- writing the logical type in `parquet_format::SchemaElement` if v2 of the writer is used
- making minor changes to align with the spec on column ordering

This lays the groundwork for us to be able to:
- support UUID and nanosecond precision timestamps (Arrow and non-Arrow)
- support the new text schema format (`INT_32` and friends are deprecated)

Closes apache#9592 from nevi-me/parquet-v2-support

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…rsion

Populate LogicalType when converting from Arrow schema to Parquet schema.

This is on top of apache#9592

Closes apache#9612 from nevi-me/ARROW-11824

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…alType

# Rationale
While updating arrow deps in https://github.com/influxdata/influxdb_iox/pull/1003, I got (very) confused for a while with the parquet upgrade as `LogicalType` was renamed to `ConvertedType` but then a new type called `LogicalType` was added in apache#9592.

# Changes
Add some comments to try and help future users save some time during upgrade

Closes apache#9731 from alamb/better-docs-on-rename

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants