Skip to content

Add dataframe to/from json#3494

Merged
mrocklin merged 9 commits intodask:masterfrom
martindurant:dataframe_json
May 15, 2018
Merged

Add dataframe to/from json#3494
mrocklin merged 9 commits intodask:masterfrom
martindurant:dataframe_json

Conversation

@martindurant
Copy link
Copy Markdown
Member

Posting early to see if there are any comments on the approach. There are other formats that are relatively easy for pandas, unsure if we want to support them all here. There is an argument to attempt to match as much as we can of the pandas API.

Fixes #3491

cc @j-bennet

@martindurant
Copy link
Copy Markdown
Member Author

(tests to follow, working for simple example manually tested)

@martindurant martindurant changed the title start dataframe-json Add dataframe to/from json May 12, 2018
@martindurant
Copy link
Copy Markdown
Member Author

Delayed keys should have appropriate names

@martindurant
Copy link
Copy Markdown
Member Author

Py2 failure:

dask/tests/test_distributed.py::test_to_hdf_distributed 
[gw1] FAILED dask/tests/test_distributed.py::test_to_hdf_distributed 

Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks nice to me.

The deviation from the Pandas convention seems slightly concerning, but I think that what is in here is probably the right choice.

I like the disclaimer in the docstring. I might recommend also using the phrase "line-delimited JSON" in the header of the to_json method. I suspect that this will be more clear to some users (at least me :)) than by using the pandas keyword names in the description. I suspect that these may not be as well known.

return pd.read_json(s, orient='records', lines=True, **kwargs)


def _file_to_partition(f, orient, lines, kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using delayed then this function name will be used in the task graph and in diagnostics. We might either want to suggest a name to delayed, or else rename this function. It would be nice if we could make it so that a name like "write_json" appeared in the diagnostics.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to provide key names to the delayed calls, although I should probably hash the inputs in order to do that right (with dask.base.tokenize?)

@j-bennet
Copy link
Copy Markdown
Contributor

Looks good. Funny enough, I started working on the same thing here:

master...j-bennet:j-bennet/support-json

Feel free to grab my unit test if you think it can be of use.

@martindurant
Copy link
Copy Markdown
Member Author

@j-bennet , see you in court! :)

@j-bennet
Copy link
Copy Markdown
Contributor

j-bennet commented May 13, 2018

Well, you clearly know better what you're doing here, so I'll leave the implementation to you. I only linked my branch with the idea that you might reuse the unit test.

@martindurant
Copy link
Copy Markdown
Member Author

Yes, certainly, and thank you having a go at this. Fwiw, it is gratifying to see that you could solve this so quickly and came up with code very similar to mine, except that your docstrings are nicely more descriptive.

@TomAugspurger
Copy link
Copy Markdown
Member

I'm also OK with deviating from the pandas default here. Does anyone have thoughts on maintaining a list of "deviations from pandas" in the docs, so that we have a single place to find all the intentional differences?

@martindurant
Copy link
Copy Markdown
Member Author

This will currently break if any of the partitions are empty, which is something we've come across in CSV too. The solution would be to compute the meta up front in the function, which I suppose from_delayed does anyway, and pass it into read_json_chunk for the case that there is no data.

@martindurant
Copy link
Copy Markdown
Member Author

All failures here are pyarrow, not related to this PR.

@mrocklin
Copy link
Copy Markdown
Member

It looks like some of the failures are due to relative/absolute imports in python 2, not PyArrow. You might want to ensure that there is a __future__ import at the top.

>           pandas_metadata = json.loads(schema.metadata[b'pandas'].decode('utf8'))
E           AttributeError: 'module' object has no attribute 'loads'

When naming a module the same as a commonly-needed one...
objects, which can be computed at a later time.
"""
kwargs['orient'] = orient
kwargs['lines'] = lines and orient == 'records'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there options that the user might select that aren't valid that we don't check for? I'm thinking about the option options to lines and orient here. What happens if they choose something that is hard to do in parallel?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything except orient='records' and lines=True does not parallelise - you get one partition per file. An attempt to set blocksize for those cases is an error. The reason for the and here, if that lines=True doesn't make sense for any other orient anyway, and would be an error in the pandas method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we give users a sensible error immediately, or does this error only happen at compute time?

If these are the only two options then we might consider just removing them as keywords.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on writing you get one file per partition anyhow. Passing lines through with any other orient would error at compute time, so we correct it here. I prefer leaving them here, since they are the non-standard ones and partner with the read function, which is where block-loading can happen.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if we leave them here then can we provide informative errors at graph construction time? We've gotten complaints from downstream users whenever we don't fail early. Apparently it's often quite hard to track down exactly which line in a complex analysis caused the error. Perhaps something like the following:

if orient not in {'records'}:
    raise NotImplementedError("The only valid value for the orient= keyword is 'records'")
if not lines:
    raise NotImplementedError("The lines= keyword must be set to True")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only potentially surprising behaviour would be that to_json(orient='records') (or no parameters at all) produces line-delimited output. The documentation on it I would say is pretty clear.
Another case like to_json(orient='split') works as expected, and to_json(orient='split', lines=True) also works where for pandas it would be an error - so no unexpected behaviour here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have much experience with this method though, so my personal knowledge of what is best here is sparse.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_json(orient='split', lines=True) also works where for pandas it would be an error - so no unexpected behaviour here.

I think the surprise here is that it doesn't produce a line-delimited output as was explicitly requested by the user. I think that in this situation the correct thing to do is to raise an exception.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about lines=None, and the default for 'records' is True, but for everything else False, and in this case an error if True is selected for anything else, before passing on to pandas.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would satisfy my constraints.

kwargs['lines'] = lines and orient == 'records'
outfiles = open_files(
url_path, 'wt', encoding=kwargs.pop('encoding', 'utf-8'),
errors=kwargs.pop('errors', 'strict'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should make these keywords and defaults explicit?

@mrocklin mrocklin merged commit 48c4a58 into dask:master May 15, 2018
@mrocklin
Copy link
Copy Markdown
Member

This is in. Thanks @martindurant !

@j-bennet
Copy link
Copy Markdown
Contributor

👍

@martindurant martindurant deleted the dataframe_json branch June 20, 2018 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants