New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1120: Support for writing timestamp(ns) to Int96 #865
Conversation
Change-Id: I8b32864774b4df345bc028b4cc0f866fb8a0b9d1
python/pyarrow/tests/test_parquet.py
Outdated
data7_us = np.array([start, start + 1, start + 2], dtype='int64') / 1000 | ||
a7_us = pa.Array.from_pandas(data7_us, type=t7_us) | ||
|
||
table = pa.Table.from_arrays([a1, a2, a3, a4, a5, a6, a7], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't too fond of this pattern when I looked originally -- names are opaque, frequent rewriting certain lines, and difficult to reorder items (for instance to put all timestamp columns near each other). Not sure it warrants addressing on this PR, but I wonder if there's a clearer way to structure these tests.
|
||
_check_roundtrip(table, expected=expected, version='2.0') | ||
|
||
# date64 as date32 | ||
# time32[s] to time32[ms] | ||
# 'timestamp[ns]' is saved as INT96 timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this doesn't explicitly check the data type on the column, there's room for a regression. Probably fine if int96 is the only option, but might be worth figuring out how to test, if the prop builder is going to be used for other things in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We at least check here that we support nanosecond precision. This is something that is not supported by any other type in Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Change-Id: I11b4a1136a1775ecacbe39d52e6039216cd7f80d
@@ -675,7 +675,8 @@ def read_pandas(source, columns=None, nthreads=1, metadata=None): | |||
|
|||
|
|||
def write_table(table, where, row_group_size=None, version='1.0', | |||
use_dictionary=True, compression='snappy', **kwargs): | |||
use_dictionary=True, compression='snappy', | |||
use_deprecated_int96_timestamps=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to create some kind of WriteOptions
object soon since we're likely to accumulate plenty more options (like casting timestamps to milliseconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
cc @c-nichols