Allow read_parquet to use pandas metadata #339

alexifm · 2020-07-30T06:48:14Z

Makes use_pandas_metadata a keyword argument for read_parquet

No Issue #

Description of changes:
It would help to be able to tell the wrangler to use the pandas metadata when reading from parquet. This would help preserve indexes and better maintain roundtrips to and from S3. Pandas' own functionality uses this keyword arg. https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L140

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Makes `use_pandas_metadata` a keyword argument for `read_parquet`

igorborgest

Thanks @alexifm!

alexifm · 2020-08-06T22:21:20Z

No problem. Do you need me to resolve the conflicts?

Digma · 2020-08-20T07:13:06Z

@igorborgest @alexifm Thanks for that PR, that is exactly what we needed. Any idea when that will be released? I don't see the changes in the dev nor master branch

alexifm · 2020-08-20T07:16:03Z

Hey @Digma, I too noticed that the changes got wiped out somehow in the merge. I have been meaning to come back to it given the restructuring but haven't had the time. I think I will probably have an interest again soon as I expect something I'm working on will depend on it.

Digma · 2020-08-20T08:00:04Z

@alexifm Sure, let us know if there is something we can do to help. Did you plan to make other changes other than the one in this PR?

alexifm · 2020-08-20T08:01:45Z

Nope, just this extra keyword for the pandas metadata. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Gael Grosch <notifications@github.com> Sent: Thursday, August 20, 2020 1:00:19 AM To: awslabs/aws-data-wrangler <aws-data-wrangler@noreply.github.com> Cc: Alex Papanicolaou <alex@infima.io>; Mention <mention@noreply.github.com> Subject: Re: [awslabs/aws-data-wrangler] Allow read_parquet to use pandas metadata (#339) @alexifm<https://github.com/alexifm> Sure, let us know if there is something we can do to help. Did you plan to make other changes other than the one in this PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#339 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALHDVYRMG2WDJBOEFJFZHTDSBTJZHANCNFSM4PNAAOFA>.

igorborgest · 2020-08-20T11:27:39Z

@Digma @alexifm

My bad, @alexifm is right, it was vanished accidentally.
I will create an issue to track it and ensure the release.

We also should add tests to coverage this specific functionality and prevent this from breaking in the future. It can be don through our moto tests avoiding s3 charges.

Btw, I think it is a good opportunity to understand better your use cases:

Are you worrying only about the index or is there something else?
Are you using it with datasets or only for single files?
Any expectation with Athena or other service?

alexifm · 2020-08-20T17:55:52Z

My bad, @alexifm is right, it was vanished accidentally.
I will create an issue to track it and ensure the release.

No worries. An issue sounds good as this should be done right instead of the quick and dirty manner to get it into the library previously.

We also should add tests to coverage this specific functionality and prevent this from breaking in the future. It can be don through our moto tests avoiding s3 charges.

Sounds good. Will try to mimic some of the other tests.

Are you worrying only about the index or is there something else?

Not totally sure. I know the index was one thing. Are the types brought through without this flag? I'm basically interested in a true roundtrip read/write.

Are you using it with datasets or only for single files?

Both.

Any expectation with Athena or other service?

I don't have any expectations for Athena. We have accepted that output from Athena is not guaranteed to fit the original Pandas schema so if we need a table (or just part of it) as is, we use wr.s3.read_parquet.

Digma · 2020-08-21T06:21:45Z

Are you worrying only about the index or is there something else?

In our case, we had some issues with dataframes being different after writing and re-reading the same file. My understanding is that there are a few types supported by pandas that are not supported by parquet natively so pandas uses the metadata to store some extra information (https://pandas.pydata.org/pandas-docs/version/1.0.5/development/developer.html)

One of the most important in our case are timezone information. Even assuming you are using UTC, reloading the dataframe after writing it will have a datetime type versus your original dataframe which had a datetimetz types which prevent directly comparing the values unless you add batch the timezone (or remove it)
Also useful would be support for RangeIndex even though we don't use that today yet

Are you using it with datasets or only for single files?

Both

Any expectation with Athena or other service?

No

alexifm and others added 3 commits July 29, 2020 23:45

Allow read_parquet to use pandas metadata

e8c47d3

Makes `use_pandas_metadata` a keyword argument for `read_parquet`

formatting changes for flake

1d95fa5

fix metadata usage; black formatting

6b4f69f

igorborgest added this to the 1.8.0 milestone Jul 30, 2020

igorborgest approved these changes Aug 6, 2020

View reviewed changes

igorborgest changed the base branch from master to dev August 6, 2020 22:19

Merge branch 'dev' into patch-1

fcbc32c

igorborgest self-assigned this Aug 6, 2020

igorborgest merged commit 5ec119b into aws:dev Aug 6, 2020

igorborgest removed this from the 1.8.0 milestone Aug 6, 2020

igorborgest mentioned this pull request Aug 27, 2020

Support for automatic index and timezone recovery from Parquet files. #366

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow read_parquet to use pandas metadata #339

Allow read_parquet to use pandas metadata #339

Uh oh!

alexifm commented Jul 30, 2020

Uh oh!

igorborgest left a comment

Uh oh!

alexifm commented Aug 6, 2020

Uh oh!

Digma commented Aug 20, 2020

Uh oh!

alexifm commented Aug 20, 2020

Uh oh!

Digma commented Aug 20, 2020

Uh oh!

alexifm commented Aug 20, 2020 via email

Uh oh!

igorborgest commented Aug 20, 2020

Uh oh!

alexifm commented Aug 20, 2020 •

edited

Loading

Uh oh!

Digma commented Aug 21, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow read_parquet to use pandas metadata #339

Allow read_parquet to use pandas metadata #339

Uh oh!

Conversation

alexifm commented Jul 30, 2020

Uh oh!

igorborgest left a comment

Choose a reason for hiding this comment

Uh oh!

alexifm commented Aug 6, 2020

Uh oh!

Digma commented Aug 20, 2020

Uh oh!

alexifm commented Aug 20, 2020

Uh oh!

Digma commented Aug 20, 2020

Uh oh!

alexifm commented Aug 20, 2020 via email

Uh oh!

igorborgest commented Aug 20, 2020

Uh oh!

alexifm commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Digma commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexifm commented Aug 20, 2020 •

edited

Loading

Digma commented Aug 21, 2020 •

edited

Loading