Skip to content

Support custom_metadata= argument in to_parquet#7359

Merged
jsignell merged 5 commits intodask:mainfrom
rjzamora:custom-metadata
Apr 5, 2021
Merged

Support custom_metadata= argument in to_parquet#7359
jsignell merged 5 commits intodask:mainfrom
rjzamora:custom-metadata

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Mar 10, 2021

Adds a custom_metadata= argument to DataFrame.to_parquet. If a dictionary is passed in by the user, those key/values will be included in all footer metadata, and in the global _metadata file, along with the usual b"pandas" metadata.

Note that this PR only adds custom_metadata= support for the "pyarrow" engine. @martindurant - Do you know if something similar can be accomplished in the "fastparquet" engine? I have not investigated that end just yet. [EDIT: Both pyarrow and fastparquet engines are now supported.]

@martindurant
Copy link
Copy Markdown
Member

fastparquet itself does not support this, but it would be trivial to add.

@martindurant
Copy link
Copy Markdown
Member

You may want to check for the pandas key, since that could render a dataset unreadable.

if custom_metadata:
_md = t.schema.metadata
_md.update(custom_metadata)
t = t.replace_schema_metadata(metadata=_md)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche - I saw that this method is labeled as "Experimental," but it has been around for years now (as far as I can tell). Is there any reason not to use it here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine to use (we should update the docstring)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjzamora
Copy link
Copy Markdown
Member Author

You may want to check for the pandas key, since that could render a dataset unreadable.

Can you clarify what you mean. Are you saying that we shouldn't allow the user to modify the "pandas" metadata (i.e. add their own b"pandas" key)?

@martindurant
Copy link
Copy Markdown
Member

In FastParquetEngine.initialize_write we create an fmd object, which is then used for the rest of the writes. We could insert metadata directly into this:

fmd.key_value_metadata.extend([
fastparquet.parquet_thrift.KeyValue(key=key, value=value)
for key, value in custom_metadata.items()
])

@martindurant
Copy link
Copy Markdown
Member

Can you clarify what you mean. Are you saying that we shouldn't allow the user to modify the "pandas" metadata (i.e. add their own b"pandas" key)?

Actually, the order matters - either we hand the user the ability to overwrite the pandas tag, or we would overwrite whatever they provide. In either case, I think a warning would be warranted.

@rjzamora
Copy link
Copy Markdown
Member Author

In FastParquetEngine.initialize_write we create an fmd object, which is then used for the rest of the writes. We could insert metadata directly into this

Nice! Good idea - I'll experiment with that :)

Actually, the order matters

Just want to clarify... In the current solution, we are "updating" the key-value metadata with the user-specified metadata. Therefore, we would be replacing an existing b"pandas" key if the user decided to pass in custom pandas metadata. You are saying this is fine, but we want to add a warning, since changing the pandas metadata in the wrong way can render the data completely unreadable - correct?

@martindurant
Copy link
Copy Markdown
Member

Correct. We could choose to error (you mustn't mess it up!) or ignore. Is there a genuine reason to want to update that tag? In our case, b"pandas" should always be present.

@rjzamora
Copy link
Copy Markdown
Member Author

Is there a genuine reason to want to update that tag?

It's hard to imagine a user wanting to open that can of worms :) I'm sure there exists a "power-user use case" where dask.dataframe is just not able produce the ideal pandas metadata for a specific workflow (likely related to index preservation, etc), and a possible solution is to override the key/value metadata with an option like this. With that said, I am totally fine raising an error here until we actually observe a motivating use case in the wild.

@rjzamora rjzamora changed the title Support custom_metadata= argument in to_parquet for pyarrow engine Support custom_metadata= argument in to_parquet Mar 10, 2021
@rjzamora
Copy link
Copy Markdown
Member Author

Thanks for the quick feedback here @martindurant! This PR now includes support for fastparquet, and an error is raised if the custom metadata includes a b'pandas' key.

@jsignell
Copy link
Copy Markdown
Member

Good to merge then?

@martindurant
Copy link
Copy Markdown
Member

The one failure is indeed for parquet, so should be investigated

@rjzamora
Copy link
Copy Markdown
Member Author

The one failure is indeed for parquet, so should be investigated

I couldn't reproduce locally, and it doesn't seem that this PR was the cause (although I am not completely sure).

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jsignell jsignell merged commit 6f774d6 into dask:main Apr 5, 2021
@jsignell
Copy link
Copy Markdown
Member

jsignell commented Apr 5, 2021

Sorry this sat for so long @rjzamora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define Column Metadata with to_parquet

4 participants