Skip to content

Add include_path_column arg to read_json#8603

Merged
jrbourbeau merged 10 commits intodask:mainfrom
bryanwweber:add-include-path-column-to-json
Jan 31, 2022
Merged

Add include_path_column arg to read_json#8603
jrbourbeau merged 10 commits intodask:mainfrom
bryanwweber:add-include-path-column-to-json

Conversation

@bryanwweber
Copy link
Copy Markdown
Contributor

@bryanwweber bryanwweber commented Jan 21, 2022

As suggested in #8539, this adds the include_path_column argument to dd.read_json(). The paths can be munged using the path_converter argument, which is slightly different from the dd.read_csv() case, because pd.read_json() does not support the converters argument.

@GPUtester
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

@jrbourbeau
Copy link
Copy Markdown
Member

add to allowlist

Bryan Weber added 5 commits January 24, 2022 16:04
This is now causing some test failures elsewhere in read_json, so
those'll need to be debugged. But the happy path with path columns is
working!
@bryanwweber bryanwweber marked this pull request as ready for review January 24, 2022 21:10
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanwweber! Looking forward to seeing this merged

On Windows, some of the tests added here are failing with

 E               AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="path") are different
E               
E               Attribute "dtype" are different
E               [left]:  CategoricalDtype(categories=['C:/Users/RUNNER~1/AppData/Local/Temp/tmpupda7hgh.json'], ordered=False)
E               [right]: CategoricalDtype(categories=['C:\Users\RUNNER~1\AppData\Local\Temp\tmpupda7hgh.json'], ordered=False)

The other failures are unrelated to the changes in this PR (xref #8580, #7406)

@jrbourbeau jrbourbeau changed the title Add include_path_column arg to read_json Add include_path_column arg to read_json Jan 25, 2022
@bryanwweber
Copy link
Copy Markdown
Contributor Author

On Windows, some of the tests added here are failing with

 E               AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="path") are different
E               
E               Attribute "dtype" are different
E               [left]:  CategoricalDtype(categories=['C:/Users/RUNNER~1/AppData/Local/Temp/tmpupda7hgh.json'], ordered=False)
E               [right]: CategoricalDtype(categories=['C:\Users\RUNNER~1\AppData\Local\Temp\tmpupda7hgh.json'], ordered=False)

Ugh, Windows paths. I'll look at how read_csv() deals with this issue, I don't recall seeing normpath anywhere.

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanwweber!



@pytest.mark.parametrize("blocksize", [5, 15, 33, 200, 90000])
def test_read_json_multiple_files_with_path_column(blocksize, tmpdir):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this needs the same path formatting you applied to test_read_json_with_path_column

@martindurant do you have any insight into why pandas.to_json and dask.dataframe.read_json are using different path separators on windows (i.e. pandas outputs files with \ and Dask reads them in with a /)?

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanwweber! This is in

@jrbourbeau jrbourbeau merged commit 1f79f5e into dask:main Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add include_path_column to read_json

3 participants