New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4908: [Rust] [DataFusion] Add support for date/time parquet types encoded as INT32/INT64 #3940
Conversation
@nevi-me @liurenjie1024 Could you review please |
Codecov Report
@@ Coverage Diff @@
## master #3940 +/- ##
==========================================
+ Coverage 87.75% 88.77% +1.01%
==========================================
Files 727 594 -133
Lines 89499 81603 -7896
Branches 1252 0 -1252
==========================================
- Hits 78541 72443 -6098
+ Misses 10840 9160 -1680
+ Partials 118 0 -118
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove, I've requested changes to align with Arrow and Parquet's spec(s)
rust/arrow/src/datatypes.rs
Outdated
@@ -195,6 +195,13 @@ make_type!( | |||
0i64 | |||
); | |||
make_type!(Date32Type, i32, DataType::Date32(DateUnit::Day), 32, 0i32); | |||
make_type!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec only provides for Date32
to have a day dateunit (http://arrow.apache.org/docs/format/Metadata.html#date). Parquet dates I believe are only represented in 32-bit number of days (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- According to arrow spec, date with milliseconds should be 64 bit?
- According to parquet, there is no date type with milliseconds units?
So I think this may be unnecessary.
ColumnReader::Int32ColumnReader(ref mut r) => { | ||
ArrowReader::<Int32Type>::read( | ||
ColumnReader::Int32ColumnReader(ref mut r) => match dt { | ||
DataType::Date32(DateUnit::Millisecond) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this condition's ever true, then it might be that we've incorrectly implemented the spec in the parquet crate. @sunchao am I misinterpreting the logical types spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevi-me what do you mean? int32 when used as date, represents the number of days since epoch. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant that the match arm for date 32 with milli should never evaluate to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. This should never happen.
)?, | ||
}, | ||
ColumnReader::Int64ColumnReader(ref mut r) => match dt { | ||
DataType::Time64(TimeUnit::Microsecond) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's aso Time64(Nanoseconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @nevi-me , Time64(Nanoseconds) is missing.
is_nullable, | ||
)? | ||
} | ||
DataType::Timestamp(TimeUnit::Millisecond) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add TimestampNano. This will cover the successor of the deprecated int96
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know where I can get a parquet file with all the valid types so I can write comprehensive unit tests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the schema for "alltypes"
"id: Int32\n\
bool_col: Boolean\n\
tinyint_col: Int32\n\
smallint_col: Int32\n\
int_col: Int32\n\
bigint_col: Int64\n\
float_col: Float32\n\
double_col: Float64\n\
date_string_col: Utf8\n\
string_col: Utf8\n\
timestamp_col: Timestamp(Nanosecond)"
The only date/time is the INT96 one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good question for the mailing list
I've updated the parquet data source to handle all date/time types, but the nanosecond readers can never be hit because the parquet crate does not support nanosecond resolution for time or timestamp (see https://github.com/apache/arrow/blob/master/rust/parquet/src/basic.rs#L91-L105 and https://github.com/apache/arrow/blob/master/rust/parquet/src/reader/schema.rs#L215-L221). |
Thanks @andygrove , that makes sense. We don't yet support v2.6.0 of the spec, which nanosecond came with (apache/parquet-format@b879065#diff-a557947b7d64dfe5f41d51e1acff9f52). The I hadn't thought of that when I expected nano precision, my apologies for that. |
@nevi-me Ah, OK, thanks. I didn't know that either. Well, at least this part is ready for whenever the parquet crate gets updated. I would have liked to add a parquet file with all types for comprehensive unit testing but I don't think I'm going to be able to do that in time for the 0.13.0 release since I'm traveling next week. I will create a JIRA for that. |
LGTM |
CI failure is not related to Rust |
This PR adds support for the date/time parquet types that can be encoded in INT32/INT64 physical types.