Skip to content

Conversation

@alexifm
Copy link
Contributor

@alexifm alexifm commented Jul 30, 2020

Makes use_pandas_metadata a keyword argument for read_parquet

No Issue #

Description of changes:
It would help to be able to tell the wrangler to use the pandas metadata when reading from parquet. This would help preserve indexes and better maintain roundtrips to and from S3. Pandas' own functionality uses this keyword arg. https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L140

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@igorborgest igorborgest added this to the 1.8.0 milestone Jul 30, 2020
Copy link
Contributor

@igorborgest igorborgest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexifm!

@igorborgest igorborgest changed the base branch from master to dev August 6, 2020 22:19
@alexifm
Copy link
Contributor Author

alexifm commented Aug 6, 2020

No problem. Do you need me to resolve the conflicts?

@igorborgest igorborgest self-assigned this Aug 6, 2020
@igorborgest igorborgest merged commit 5ec119b into aws:dev Aug 6, 2020
@igorborgest igorborgest removed this from the 1.8.0 milestone Aug 6, 2020
@Digma
Copy link

Digma commented Aug 20, 2020

@igorborgest @alexifm Thanks for that PR, that is exactly what we needed. Any idea when that will be released? I don't see the changes in the dev nor master branch

@alexifm
Copy link
Contributor Author

alexifm commented Aug 20, 2020

Hey @Digma, I too noticed that the changes got wiped out somehow in the merge. I have been meaning to come back to it given the restructuring but haven't had the time. I think I will probably have an interest again soon as I expect something I'm working on will depend on it.

@Digma
Copy link

Digma commented Aug 20, 2020

@alexifm Sure, let us know if there is something we can do to help. Did you plan to make other changes other than the one in this PR?

@alexifm
Copy link
Contributor Author

alexifm commented Aug 20, 2020 via email

@igorborgest
Copy link
Contributor

@Digma @alexifm

My bad, @alexifm is right, it was vanished accidentally.
I will create an issue to track it and ensure the release.

We also should add tests to coverage this specific functionality and prevent this from breaking in the future. It can be don through our moto tests avoiding s3 charges.

Btw, I think it is a good opportunity to understand better your use cases:

  • Are you worrying only about the index or is there something else?
  • Are you using it with datasets or only for single files?
  • Any expectation with Athena or other service?

@alexifm
Copy link
Contributor Author

alexifm commented Aug 20, 2020

My bad, @alexifm is right, it was vanished accidentally.
I will create an issue to track it and ensure the release.

No worries. An issue sounds good as this should be done right instead of the quick and dirty manner to get it into the library previously.

We also should add tests to coverage this specific functionality and prevent this from breaking in the future. It can be don through our moto tests avoiding s3 charges.

Sounds good. Will try to mimic some of the other tests.

Are you worrying only about the index or is there something else?

Not totally sure. I know the index was one thing. Are the types brought through without this flag? I'm basically interested in a true roundtrip read/write.

Are you using it with datasets or only for single files?

Both.

Any expectation with Athena or other service?

I don't have any expectations for Athena. We have accepted that output from Athena is not guaranteed to fit the original Pandas schema so if we need a table (or just part of it) as is, we use wr.s3.read_parquet.

@Digma
Copy link

Digma commented Aug 21, 2020

Are you worrying only about the index or is there something else?

In our case, we had some issues with dataframes being different after writing and re-reading the same file. My understanding is that there are a few types supported by pandas that are not supported by parquet natively so pandas uses the metadata to store some extra information (https://pandas.pydata.org/pandas-docs/version/1.0.5/development/developer.html)

  • One of the most important in our case are timezone information. Even assuming you are using UTC, reloading the dataframe after writing it will have a datetime type versus your original dataframe which had a datetimetz types which prevent directly comparing the values unless you add batch the timezone (or remove it)

  • Also useful would be support for RangeIndex even though we don't use that today yet

Are you using it with datasets or only for single files?

Both

Any expectation with Athena or other service?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants