Conversation
|
(tests to follow, working for simple example manually tested) |
|
Delayed keys should have appropriate names |
|
Py2 failure: |
mrocklin
left a comment
There was a problem hiding this comment.
In general this looks nice to me.
The deviation from the Pandas convention seems slightly concerning, but I think that what is in here is probably the right choice.
I like the disclaimer in the docstring. I might recommend also using the phrase "line-delimited JSON" in the header of the to_json method. I suspect that this will be more clear to some users (at least me :)) than by using the pandas keyword names in the description. I suspect that these may not be as well known.
dask/dataframe/io/json.py
Outdated
| return pd.read_json(s, orient='records', lines=True, **kwargs) | ||
|
|
||
|
|
||
| def _file_to_partition(f, orient, lines, kwargs): |
There was a problem hiding this comment.
If we're using delayed then this function name will be used in the task graph and in diagnostics. We might either want to suggest a name to delayed, or else rename this function. It would be nice if we could make it so that a name like "write_json" appeared in the diagnostics.
There was a problem hiding this comment.
I was thinking to provide key names to the delayed calls, although I should probably hash the inputs in order to do that right (with dask.base.tokenize?)
|
Looks good. Funny enough, I started working on the same thing here: master...j-bennet:j-bennet/support-json Feel free to grab my unit test if you think it can be of use. |
|
@j-bennet , see you in court! :) |
|
Well, you clearly know better what you're doing here, so I'll leave the implementation to you. I only linked my branch with the idea that you might reuse the unit test. |
|
Yes, certainly, and thank you having a go at this. Fwiw, it is gratifying to see that you could solve this so quickly and came up with code very similar to mine, except that your docstrings are nicely more descriptive. |
|
I'm also OK with deviating from the pandas default here. Does anyone have thoughts on maintaining a list of "deviations from pandas" in the docs, so that we have a single place to find all the intentional differences? |
|
This will currently break if any of the partitions are empty, which is something we've come across in CSV too. The solution would be to compute the meta up front in the function, which I suppose from_delayed does anyway, and pass it into read_json_chunk for the case that there is no data. |
|
All failures here are pyarrow, not related to this PR. |
|
It looks like some of the failures are due to relative/absolute imports in python 2, not PyArrow. You might want to ensure that there is a |
When naming a module the same as a commonly-needed one...
| objects, which can be computed at a later time. | ||
| """ | ||
| kwargs['orient'] = orient | ||
| kwargs['lines'] = lines and orient == 'records' |
There was a problem hiding this comment.
Are there options that the user might select that aren't valid that we don't check for? I'm thinking about the option options to lines and orient here. What happens if they choose something that is hard to do in parallel?
There was a problem hiding this comment.
Anything except orient='records' and lines=True does not parallelise - you get one partition per file. An attempt to set blocksize for those cases is an error. The reason for the and here, if that lines=True doesn't make sense for any other orient anyway, and would be an error in the pandas method.
There was a problem hiding this comment.
Do we give users a sensible error immediately, or does this error only happen at compute time?
If these are the only two options then we might consider just removing them as keywords.
There was a problem hiding this comment.
Actually, on writing you get one file per partition anyhow. Passing lines through with any other orient would error at compute time, so we correct it here. I prefer leaving them here, since they are the non-standard ones and partner with the read function, which is where block-loading can happen.
There was a problem hiding this comment.
OK, if we leave them here then can we provide informative errors at graph construction time? We've gotten complaints from downstream users whenever we don't fail early. Apparently it's often quite hard to track down exactly which line in a complex analysis caused the error. Perhaps something like the following:
if orient not in {'records'}:
raise NotImplementedError("The only valid value for the orient= keyword is 'records'")
if not lines:
raise NotImplementedError("The lines= keyword must be set to True")There was a problem hiding this comment.
The only potentially surprising behaviour would be that to_json(orient='records') (or no parameters at all) produces line-delimited output. The documentation on it I would say is pretty clear.
Another case like to_json(orient='split') works as expected, and to_json(orient='split', lines=True) also works where for pandas it would be an error - so no unexpected behaviour here.
There was a problem hiding this comment.
I don't personally have much experience with this method though, so my personal knowledge of what is best here is sparse.
There was a problem hiding this comment.
to_json(orient='split', lines=True) also works where for pandas it would be an error - so no unexpected behaviour here.
I think the surprise here is that it doesn't produce a line-delimited output as was explicitly requested by the user. I think that in this situation the correct thing to do is to raise an exception.
There was a problem hiding this comment.
How about lines=None, and the default for 'records' is True, but for everything else False, and in this case an error if True is selected for anything else, before passing on to pandas.
There was a problem hiding this comment.
Yes, that would satisfy my constraints.
dask/dataframe/io/json.py
Outdated
| kwargs['lines'] = lines and orient == 'records' | ||
| outfiles = open_files( | ||
| url_path, 'wt', encoding=kwargs.pop('encoding', 'utf-8'), | ||
| errors=kwargs.pop('errors', 'strict'), |
There was a problem hiding this comment.
Perhaps we should make these keywords and defaults explicit?
|
This is in. Thanks @martindurant ! |
|
👍 |
Posting early to see if there are any comments on the approach. There are other formats that are relatively easy for pandas, unsure if we want to support them all here. There is an argument to attempt to match as much as we can of the pandas API.
Fixes #3491
cc @j-bennet