Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Write support for int96 #16721

Closed
asfimport opened this issue Jun 15, 2017 · 7 comments
Closed

[Python] Write support for int96 #16721

asfimport opened this issue Jun 15, 2017 · 7 comments

Comments

@asfimport
Copy link

asfimport commented Jun 15, 2017

Hi there,

I am trying to use pyarrow to convert CSV files to Parquet for use with Redshift Spectrum. I've got everything sorted... almost :)

Unfortunately, the only format they accept for timestamp columns is int96. I understand that int96 timestamps are unofficial/deprecated, but unfortunately it's what I am stuck with for integrating with, at least for the moment. I contacted Amazon support and 64-bit timestamp support has been added to the list of feature requests, but it's unclear when it will be prioritized/added/released.

In the meantime, I am thinking of adding write support for int96 columns to arrow. Would that be a welcome addition?

Thanks,
Colin

Reporter: colin nichols
Assignee: Uwe Korn / @xhochy

Related issues:

Note: This issue was originally created as ARROW-1120. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Feel free to make a PR to add Int96 write support. Most of the work is probably in parquet-cpp where we already have a read support for Int96 thus testing the new code should be quite easily. Also you will have to add a flag in pyarrow & parquet-cpp to select the target timestamp type for the Parquet files.

cc [~rdblue]: Two cases for the highlight: I thought Spectrum/Athena are Presto-based, is there no int64-timestamp support there? Also adding Int96 write support to parquet-cpp will help us get some parquet-cpp/pyarrow usage in this case but also harm the deprecation of Int96.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
It's possible that Redshift's Parquet support was forked from Impala's scanner which might explain the situation. We'd be happy to accept patches for this in parquet-cpp, and it would require threading a couple parameters through the Python bindings in pyarrow. Let us know if you need help!

@asfimport
Copy link
Author

Ryan Blue:
I don't think Presto has support for the int64 timestamps. I'd recommend using an int64 with a timestamp in milliseconds for now. That's what we use because we can control behavior and know what we are getting that way, instead of int96 where you can get different values depending on versions and processing engine. I think it is a better idea to spend time on support for the types that are well defined and don't have the historical baggage.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Support for writing deprecated int96 timestamps landed in parquet-cpp today (thanks Colin and Uwe!) apache/parquet-cpp@e998dfb, so we should be able to thread this option through to the Python API without too much work. I will take a look hopefully soon

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
I'll take care of this today.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
PR: #865

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 865
#865

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants