Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True #18007

Closed
asfimport opened this issue Jan 24, 2018 · 10 comments

Comments

@asfimport
Copy link

asfimport commented Jan 24, 2018

When writing to a Parquet file, if use_deprecated_int96_timestamps is True, timestamps are only written as 96-bit integers if the timestamp has nanosecond resolution. This is a problem because Amazon Redshift timestamps only have microsecond resolution but require them to be stored in 96-bit format in Parquet files.

I'd expect the use_deprecated_int96_timestamps flag to cause all timestamps to be written as 96 bits, regardless of resolution. If this is a deliberate design decision, it'd be immensely helpful if it were explicitly documented as part of the argument.

 

To reproduce:

 

1. Create a table with a timestamp having microsecond or millisecond resolution, and save it to a Parquet file. Be sure to set use_deprecated_int96_timestamps to True.

 

import datetime
import pyarrow
from pyarrow import parquet

schema = pyarrow.schema([
    pyarrow.field('last_updated', pyarrow.timestamp('us')),
])

data = [
    pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
]

table = pyarrow.Table.from_arrays(data, ['last_updated'])

with open('test_file.parquet', 'wb') as fdesc:
    parquet.write_table(table, fdesc,
                        use_deprecated_int96_timestamps=True)

 

2. Inspect the file. I used parquet-tools:

 


dak@tux ~ $ parquet-tools meta test_file.parquet
file:         file:/Users/dak/test_file.parquet

creator:      parquet-cpp version 1.3.2-SNAPSHOT



file schema:  schema

--------------------------------------------------------------------------------

last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1



row group 1:  RC:1 TS:76 OFFSET:4

--------------------------------------------------------------------------------

last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 ENC:PLAIN,PLAIN_DICTIONARY,RLE

 

Environment: OS: Mac OS X 10.13.2
Python: 3.6.4
PyArrow: 0.8.0
Reporter: Diego Argueta / @dargueta
Assignee: Francois Saint-Jacques / @fsaintjacques

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-2026. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Moving to 0.10.0

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Patches welcome

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@kszucs can you have a look at this?

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
What's the expectation here?

So currently only NANO timestamps are supported for Int96 writing, should We support all of the units?
Or just document it somewhere?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I agree that probably the correct fix here is to coerce all timestamps to nanoseconds with INT96 storage, since if you are passing use_deprecated_int96_timestamps=True then probably you are using Apache Impala (which is what Redshift Spectrum uses as far as I understand it) or another system that expects int96 nanosecond timestamps.

It is a bit of a rough edge to have to go through and convert all your timestamps to nanoseconds before writing to Parquet.

@xhochy do you have thoughts about this?

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
We should coerce all timestamps to nanoseconds when this option is set. People will only use this option when they use a system that only supports INT96 timestamps, systems that also support INT64 timestamps in other resolutions would not need the option.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
+1. @kszucs do you want to give this a shot? You'd need to add converter functions to convert to INT96:

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L206

It would be unsafe to cast to nanoseconds by multiplication since it may overflow. So the test cases should include values outside the representable range for an int64_t nanosecond timestamp

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:
I'll take it, I have a reasonable fix.

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:

file: file:/home/fsaintjacques/src/arrow/python/test_file.parquet 
creator: parquet-cpp version 1.5.1-SNAPSHOT 

file schema: schema 
------------------------------------------------------------------------------
last_updated: OPTIONAL INT96 R:0 D:1

row group 1: RC:1 TS:58 OFFSET:4 
--------------------------------------------------------------------------------
last_updated: INT96 SNAPPY DO:4 FPO:32 SZ:58/54/0.93 VC:1 ENC:PLAIN_DICTIONARY,PLAIN,R

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 3173
#3173

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants