New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True #18007
Comments
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Krisztian Szucs / @kszucs: So currently only NANO timestamps are supported for Int96 writing, should We support all of the units? |
Wes McKinney / @wesm: It is a bit of a rough edge to have to go through and convert all your timestamps to nanoseconds before writing to Parquet. @xhochy do you have thoughts about this? |
Wes McKinney / @wesm: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L206 It would be unsafe to cast to nanoseconds by multiplication since it may overflow. So the test cases should include values outside the representable range for an int64_t nanosecond timestamp |
Francois Saint-Jacques / @fsaintjacques: |
Francois Saint-Jacques / @fsaintjacques: file: file:/home/fsaintjacques/src/arrow/python/test_file.parquet
creator: parquet-cpp version 1.5.1-SNAPSHOT
file schema: schema
------------------------------------------------------------------------------
last_updated: OPTIONAL INT96 R:0 D:1
row group 1: RC:1 TS:58 OFFSET:4
--------------------------------------------------------------------------------
last_updated: INT96 SNAPPY DO:4 FPO:32 SZ:58/54/0.93 VC:1 ENC:PLAIN_DICTIONARY,PLAIN,R |
Wes McKinney / @wesm: |
When writing to a Parquet file, if
use_deprecated_int96_timestamps
is True, timestamps are only written as 96-bit integers if the timestamp has nanosecond resolution. This is a problem because Amazon Redshift timestamps only have microsecond resolution but require them to be stored in 96-bit format in Parquet files.I'd expect the use_deprecated_int96_timestamps flag to cause all timestamps to be written as 96 bits, regardless of resolution. If this is a deliberate design decision, it'd be immensely helpful if it were explicitly documented as part of the argument.
To reproduce:
1. Create a table with a timestamp having microsecond or millisecond resolution, and save it to a Parquet file. Be sure to set
use_deprecated_int96_timestamps
to True.2. Inspect the file. I used parquet-tools:
Environment: OS: Mac OS X 10.13.2
Python: 3.6.4
PyArrow: 0.8.0
Reporter: Diego Argueta / @dargueta
Assignee: Francois Saint-Jacques / @fsaintjacques
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-2026. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: