Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4908: [Rust] [DataFusion] Add support for date/time parquet types encoded as INT32/INT64 #3940

Closed
wants to merge 3 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Mar 16, 2019

This PR adds support for the date/time parquet types that can be encoded in INT32/INT64 physical types.

@andygrove
Copy link
Member Author

@nevi-me @liurenjie1024 Could you review please

@codecov-io
Copy link

codecov-io commented Mar 16, 2019

Codecov Report

Merging #3940 into master will increase coverage by 1.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3940      +/-   ##
==========================================
+ Coverage   87.75%   88.77%   +1.01%     
==========================================
  Files         727      594     -133     
  Lines       89499    81603    -7896     
  Branches     1252        0    -1252     
==========================================
- Hits        78541    72443    -6098     
+ Misses      10840     9160    -1680     
+ Partials      118        0     -118
Impacted Files Coverage Δ
cpp/src/arrow/testing/random.cc 97.59% <0%> (-2.41%) ⬇️
cpp/src/arrow/util/bpacking.h 99.81% <0%> (-0.01%) ⬇️
cpp/src/parquet/encoding-test.cc 100% <0%> (ø) ⬆️
cpp/src/arrow/array-test.cc 100% <0%> (ø) ⬆️
cpp/src/arrow/testing/random.h 100% <0%> (ø) ⬆️
cpp/src/arrow/ipc/test-common.h 98.25% <0%> (ø) ⬆️
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
js/src/enum.ts
go/arrow/array/builder.go
... and 139 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04a7b68...4e99e42. Read the comment docs.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove, I've requested changes to align with Arrow and Parquet's spec(s)

@@ -195,6 +195,13 @@ make_type!(
0i64
);
make_type!(Date32Type, i32, DataType::Date32(DateUnit::Day), 32, 0i32);
make_type!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec only provides for Date32 to have a day dateunit (http://arrow.apache.org/docs/format/Metadata.html#date). Parquet dates I believe are only represented in 32-bit number of days (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. According to arrow spec, date with milliseconds should be 64 bit?
  2. According to parquet, there is no date type with milliseconds units?

So I think this may be unnecessary.

ColumnReader::Int32ColumnReader(ref mut r) => {
ArrowReader::<Int32Type>::read(
ColumnReader::Int32ColumnReader(ref mut r) => match dt {
DataType::Date32(DateUnit::Millisecond) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this condition's ever true, then it might be that we've incorrectly implemented the spec in the parquet crate. @sunchao am I misinterpreting the logical types spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me what do you mean? int32 when used as date, represents the number of days since epoch. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that the match arm for date 32 with milli should never evaluate to true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. This should never happen.

)?,
},
ColumnReader::Int64ColumnReader(ref mut r) => match dt {
DataType::Time64(TimeUnit::Microsecond) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's aso Time64(Nanoseconds)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @nevi-me , Time64(Nanoseconds) is missing.

is_nullable,
)?
}
DataType::Timestamp(TimeUnit::Millisecond) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add TimestampNano. This will cover the successor of the deprecated int96 type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where I can get a parquet file with all the valid types so I can write comprehensive unit tests for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the schema for "alltypes"

"id: Int32\n\
        bool_col: Boolean\n\
        tinyint_col: Int32\n\
        smallint_col: Int32\n\
        int_col: Int32\n\
        bigint_col: Int64\n\
        float_col: Float32\n\
        double_col: Float64\n\
        date_string_col: Utf8\n\
        string_col: Utf8\n\
        timestamp_col: Timestamp(Nanosecond)"

The only date/time is the INT96 one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good question for the mailing list

@andygrove
Copy link
Member Author

I've updated the parquet data source to handle all date/time types, but the nanosecond readers can never be hit because the parquet crate does not support nanosecond resolution for time or timestamp (see https://github.com/apache/arrow/blob/master/rust/parquet/src/basic.rs#L91-L105 and https://github.com/apache/arrow/blob/master/rust/parquet/src/reader/schema.rs#L215-L221).

@nevi-me
Copy link
Contributor

nevi-me commented Mar 17, 2019

Thanks @andygrove , that makes sense. We don't yet support v2.6.0 of the spec, which nanosecond came with (apache/parquet-format@b879065#diff-a557947b7d64dfe5f41d51e1acff9f52).

The parquet-format crate is at v2.6.0 (sunchao/parquet-format-rs#3) but parquet is still at v2.5.0. I think there might be some work other than timestamps that would need to be done to support the format, so it might not be trivial to bump up support.

I hadn't thought of that when I expected nano precision, my apologies for that.

@andygrove
Copy link
Member Author

@nevi-me Ah, OK, thanks. I didn't know that either. Well, at least this part is ready for whenever the parquet crate gets updated.

I would have liked to add a parquet file with all types for comprehensive unit testing but I don't think I'm going to be able to do that in time for the 0.13.0 release since I'm traveling next week. I will create a JIRA for that.

@liurenjie1024
Copy link
Contributor

LGTM

@andygrove
Copy link
Member Author

CI failure is not related to Rust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants