Add include_path_column arg to read_json#8603
Add include_path_column arg to read_json#8603jrbourbeau merged 10 commits intodask:mainfrom bryanwweber:add-include-path-column-to-json
include_path_column arg to read_json#8603Conversation
|
Can one of the admins verify this patch? |
|
add to allowlist |
This is now causing some test failures elsewhere in read_json, so those'll need to be debugged. But the happy path with path columns is working!
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @bryanwweber! Looking forward to seeing this merged
On Windows, some of the tests added here are failing with
E AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="path") are different
E
E Attribute "dtype" are different
E [left]: CategoricalDtype(categories=['C:/Users/RUNNER~1/AppData/Local/Temp/tmpupda7hgh.json'], ordered=False)
E [right]: CategoricalDtype(categories=['C:\Users\RUNNER~1\AppData\Local\Temp\tmpupda7hgh.json'], ordered=False)The other failures are unrelated to the changes in this PR (xref #8580, #7406)
include_path_column arg to read_json
Ugh, Windows paths. I'll look at how |
|
|
||
|
|
||
| @pytest.mark.parametrize("blocksize", [5, 15, 33, 200, 90000]) | ||
| def test_read_json_multiple_files_with_path_column(blocksize, tmpdir): |
There was a problem hiding this comment.
It looks like this needs the same path formatting you applied to test_read_json_with_path_column
@martindurant do you have any insight into why pandas.to_json and dask.dataframe.read_json are using different path separators on windows (i.e. pandas outputs files with \ and Dask reads them in with a /)?
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @bryanwweber! This is in
As suggested in #8539, this adds the
include_path_columnargument todd.read_json(). The paths can be munged using thepath_converterargument, which is slightly different from thedd.read_csv()case, becausepd.read_json()does not support theconvertersargument.pre-commit run --all-files