Support custom_metadata= argument in to_parquet#7359
Conversation
|
|
|
You may want to check for the pandas key, since that could render a dataset unreadable. |
| if custom_metadata: | ||
| _md = t.schema.metadata | ||
| _md.update(custom_metadata) | ||
| t = t.replace_schema_metadata(metadata=_md) |
There was a problem hiding this comment.
@jorisvandenbossche - I saw that this method is labeled as "Experimental," but it has been around for years now (as far as I can tell). Is there any reason not to use it here?
There was a problem hiding this comment.
I think that's fine to use (we should update the docstring)
There was a problem hiding this comment.
Can you clarify what you mean. Are you saying that we shouldn't allow the user to modify the "pandas" metadata (i.e. add their own b"pandas" key)? |
|
In |
Actually, the order matters - either we hand the user the ability to overwrite the pandas tag, or we would overwrite whatever they provide. In either case, I think a warning would be warranted. |
Nice! Good idea - I'll experiment with that :)
Just want to clarify... In the current solution, we are "updating" the key-value metadata with the user-specified metadata. Therefore, we would be replacing an existing |
|
Correct. We could choose to error (you mustn't mess it up!) or ignore. Is there a genuine reason to want to update that tag? In our case, b"pandas" should always be present. |
It's hard to imagine a user wanting to open that can of worms :) I'm sure there exists a "power-user use case" where |
|
Thanks for the quick feedback here @martindurant! This PR now includes support for fastparquet, and an error is raised if the custom metadata includes a |
|
Good to merge then? |
|
The one failure is indeed for parquet, so should be investigated |
I couldn't reproduce locally, and it doesn't seem that this PR was the cause (although I am not completely sure). |
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
Sorry this sat for so long @rjzamora |
Adds a
custom_metadata=argument toDataFrame.to_parquet. If a dictionary is passed in by the user, those key/values will be included in all footer metadata, and in the global _metadata file, along with the usualb"pandas"metadata.Note that this PR only adds[EDIT: Both pyarrow and fastparquet engines are now supported.]custom_metadata=support for the "pyarrow" engine. @martindurant - Do you know if something similar can be accomplished in the "fastparquet" engine? I have not investigated that end just yet.black dask/flake8 dask