-
Notifications
You must be signed in to change notification settings - Fork 722
Allow read_parquet to use pandas metadata #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Makes `use_pandas_metadata` a keyword argument for `read_parquet`
igorborgest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alexifm!
|
No problem. Do you need me to resolve the conflicts? |
|
@igorborgest @alexifm Thanks for that PR, that is exactly what we needed. Any idea when that will be released? I don't see the changes in the |
|
Hey @Digma, I too noticed that the changes got wiped out somehow in the merge. I have been meaning to come back to it given the restructuring but haven't had the time. I think I will probably have an interest again soon as I expect something I'm working on will depend on it. |
|
@alexifm Sure, let us know if there is something we can do to help. Did you plan to make other changes other than the one in this PR? |
|
Nope, just this extra keyword for the pandas metadata.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Gael Grosch <notifications@github.com>
Sent: Thursday, August 20, 2020 1:00:19 AM
To: awslabs/aws-data-wrangler <aws-data-wrangler@noreply.github.com>
Cc: Alex Papanicolaou <alex@infima.io>; Mention <mention@noreply.github.com>
Subject: Re: [awslabs/aws-data-wrangler] Allow read_parquet to use pandas metadata (#339)
@alexifm<https://github.com/alexifm> Sure, let us know if there is something we can do to help. Did you plan to make other changes other than the one in this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#339 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALHDVYRMG2WDJBOEFJFZHTDSBTJZHANCNFSM4PNAAOFA>.
|
|
My bad, @alexifm is right, it was vanished accidentally. We also should add tests to coverage this specific functionality and prevent this from breaking in the future. It can be don through our moto tests avoiding s3 charges. Btw, I think it is a good opportunity to understand better your use cases:
|
No worries. An issue sounds good as this should be done right instead of the quick and dirty manner to get it into the library previously.
Sounds good. Will try to mimic some of the other tests.
Not totally sure. I know the index was one thing. Are the types brought through without this flag? I'm basically interested in a true roundtrip read/write.
Both.
I don't have any expectations for Athena. We have accepted that output from Athena is not guaranteed to fit the original Pandas schema so if we need a table (or just part of it) as is, we use |
In our case, we had some issues with dataframes being different after writing and re-reading the same file. My understanding is that there are a few types supported by pandas that are not supported by parquet natively so pandas uses the metadata to store some extra information (https://pandas.pydata.org/pandas-docs/version/1.0.5/development/developer.html)
Both
No |
Makes
use_pandas_metadataa keyword argument forread_parquetNo Issue #
Description of changes:
It would help to be able to tell the wrangler to use the pandas metadata when reading from parquet. This would help preserve indexes and better maintain roundtrips to and from S3. Pandas' own functionality uses this keyword arg. https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L140
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.