Add dataframe to/from json by martindurant · Pull Request #3494 · dask/dask

martindurant · 2018-05-12T17:57:31Z

Posting early to see if there are any comments on the approach. There are other formats that are relatively easy for pandas, unsure if we want to support them all here. There is an argument to attempt to match as much as we can of the pandas API.

Fixes #3491

cc @j-bennet

martindurant · 2018-05-12T17:57:56Z

(tests to follow, working for simple example manually tested)

martindurant · 2018-05-12T18:35:32Z

Delayed keys should have appropriate names

martindurant · 2018-05-12T18:36:59Z

Py2 failure:

dask/tests/test_distributed.py::test_to_hdf_distributed 
[gw1] FAILED dask/tests/test_distributed.py::test_to_hdf_distributed

mrocklin

In general this looks nice to me.

The deviation from the Pandas convention seems slightly concerning, but I think that what is in here is probably the right choice.

I like the disclaimer in the docstring. I might recommend also using the phrase "line-delimited JSON" in the header of the to_json method. I suspect that this will be more clear to some users (at least me :)) than by using the pandas keyword names in the description. I suspect that these may not be as well known.

mrocklin · 2018-05-13T01:15:33Z

dask/dataframe/io/json.py

+    return pd.read_json(s, orient='records', lines=True, **kwargs)
+
+
+def _file_to_partition(f, orient, lines, kwargs):


If we're using delayed then this function name will be used in the task graph and in diagnostics. We might either want to suggest a name to delayed, or else rename this function. It would be nice if we could make it so that a name like "write_json" appeared in the diagnostics.

I was thinking to provide key names to the delayed calls, although I should probably hash the inputs in order to do that right (with dask.base.tokenize?)

j-bennet · 2018-05-13T01:38:11Z

Looks good. Funny enough, I started working on the same thing here:

master...j-bennet:j-bennet/support-json

Feel free to grab my unit test if you think it can be of use.

martindurant · 2018-05-13T01:40:36Z

@j-bennet , see you in court! :)

j-bennet · 2018-05-13T01:43:43Z

Well, you clearly know better what you're doing here, so I'll leave the implementation to you. I only linked my branch with the idea that you might reuse the unit test.

martindurant · 2018-05-13T02:01:56Z

Yes, certainly, and thank you having a go at this. Fwiw, it is gratifying to see that you could solve this so quickly and came up with code very similar to mine, except that your docstrings are nicely more descriptive.

TomAugspurger · 2018-05-13T11:47:28Z

I'm also OK with deviating from the pandas default here. Does anyone have thoughts on maintaining a list of "deviations from pandas" in the docs, so that we have a single place to find all the intentional differences?

martindurant · 2018-05-13T17:44:52Z

This will currently break if any of the partitions are empty, which is something we've come across in CSV too. The solution would be to compute the meta up front in the function, which I suppose from_delayed does anyway, and pass it into read_json_chunk for the case that there is no data.

martindurant · 2018-05-15T13:24:21Z

All failures here are pyarrow, not related to this PR.

mrocklin · 2018-05-15T13:29:43Z

It looks like some of the failures are due to relative/absolute imports in python 2, not PyArrow. You might want to ensure that there is a __future__ import at the top.

>           pandas_metadata = json.loads(schema.metadata[b'pandas'].decode('utf8'))
E           AttributeError: 'module' object has no attribute 'loads'

When naming a module the same as a commonly-needed one...

mrocklin · 2018-05-15T14:49:47Z

dask/dataframe/io/json.py

+        objects, which can be computed at a later time.
+    """
+    kwargs['orient'] = orient
+    kwargs['lines'] = lines and orient == 'records'


Are there options that the user might select that aren't valid that we don't check for? I'm thinking about the option options to lines and orient here. What happens if they choose something that is hard to do in parallel?

Anything except orient='records' and lines=True does not parallelise - you get one partition per file. An attempt to set blocksize for those cases is an error. The reason for the and here, if that lines=True doesn't make sense for any other orient anyway, and would be an error in the pandas method.

Do we give users a sensible error immediately, or does this error only happen at compute time?

If these are the only two options then we might consider just removing them as keywords.

Actually, on writing you get one file per partition anyhow. Passing lines through with any other orient would error at compute time, so we correct it here. I prefer leaving them here, since they are the non-standard ones and partner with the read function, which is where block-loading can happen.

OK, if we leave them here then can we provide informative errors at graph construction time? We've gotten complaints from downstream users whenever we don't fail early. Apparently it's often quite hard to track down exactly which line in a complex analysis caused the error. Perhaps something like the following:

if orient not in {'records'}: raise NotImplementedError("The only valid value for the orient= keyword is 'records'") if not lines: raise NotImplementedError("The lines= keyword must be set to True")

The only potentially surprising behaviour would be that to_json(orient='records') (or no parameters at all) produces line-delimited output. The documentation on it I would say is pretty clear.
Another case like to_json(orient='split') works as expected, and to_json(orient='split', lines=True) also works where for pandas it would be an error - so no unexpected behaviour here.

I don't personally have much experience with this method though, so my personal knowledge of what is best here is sparse.

to_json(orient='split', lines=True) also works where for pandas it would be an error - so no unexpected behaviour here.

I think the surprise here is that it doesn't produce a line-delimited output as was explicitly requested by the user. I think that in this situation the correct thing to do is to raise an exception.

How about lines=None, and the default for 'records' is True, but for everything else False, and in this case an error if True is selected for anything else, before passing on to pandas.

Yes, that would satisfy my constraints.

mrocklin · 2018-05-15T14:50:11Z

dask/dataframe/io/json.py

+    kwargs['lines'] = lines and orient == 'records'
+    outfiles = open_files(
+        url_path, 'wt', encoding=kwargs.pop('encoding', 'utf-8'),
+        errors=kwargs.pop('errors', 'strict'),


Perhaps we should make these keywords and defaults explicit?

mrocklin · 2018-05-15T19:30:26Z

This is in. Thanks @martindurant !

j-bennet · 2018-05-15T19:52:51Z

👍

start dataframe-json

eb13a9d

martindurant changed the title ~~start dataframe-json~~ Add dataframe to/from json May 12, 2018

mrocklin reviewed May 13, 2018

View reviewed changes

Add tests, plumb through, fix some things

c644a87

martindurant mentioned this pull request May 14, 2018

read_json for dask.DataFrame #1236

Closed

Martin Durant added 3 commits May 14, 2018 09:51

flake & py2

6ec5e7e

test chunking

b0b72de

flake

0c104f3

Add absolute_import

4dd8a21

When naming a module the same as a commonly-needed one...

mrocklin reviewed May 15, 2018

View reviewed changes

Martin Durant added 3 commits May 15, 2018 11:00

Make text conversion keywords explicit

f464740

lines keywrod defaults for json

24ecdfb

add to docs

1de1156

mrocklin merged commit 48c4a58 into dask:master May 15, 2018

martindurant deleted the dataframe_json branch June 20, 2018 13:27

		return pd.read_json(s, orient='records', lines=True, **kwargs)


		def _file_to_partition(f, orient, lines, kwargs):

Uh oh!

Conversation

martindurant commented May 12, 2018

Uh oh!

martindurant commented May 12, 2018

Uh oh!

martindurant commented May 12, 2018

Uh oh!

martindurant commented May 12, 2018

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j-bennet commented May 13, 2018

Uh oh!

martindurant commented May 13, 2018

Uh oh!

j-bennet commented May 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented May 13, 2018

Uh oh!

TomAugspurger commented May 13, 2018

Uh oh!

martindurant commented May 13, 2018

Uh oh!

martindurant commented May 15, 2018

Uh oh!

mrocklin commented May 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented May 15, 2018

Uh oh!

j-bennet commented May 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

j-bennet commented May 13, 2018 •

edited

Loading