Add `include_path_column` arg to `read_json` by bryanwweber · Pull Request #8603 · dask/dask

bryanwweber · 2022-01-21T17:19:36Z

As suggested in #8539, this adds the include_path_column argument to dd.read_json(). The paths can be munged using the path_converter argument, which is slightly different from the dd.read_csv() case, because pd.read_json() does not support the converters argument.

Closes Add include_path_column to read_json #8539
Tests added / passed
Passes pre-commit run --all-files

GPUtester · 2022-01-21T17:19:38Z

Can one of the admins verify this patch?

jrbourbeau · 2022-01-21T17:30:06Z

add to allowlist

This is now causing some test failures elsewhere in read_json, so those'll need to be debugged. But the happy path with path columns is working!

jrbourbeau

Thanks @bryanwweber! Looking forward to seeing this merged

On Windows, some of the tests added here are failing with

 E               AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="path") are different
E               
E               Attribute "dtype" are different
E               [left]:  CategoricalDtype(categories=['C:/Users/RUNNER~1/AppData/Local/Temp/tmpupda7hgh.json'], ordered=False)
E               [right]: CategoricalDtype(categories=['C:\Users\RUNNER~1\AppData\Local\Temp\tmpupda7hgh.json'], ordered=False)

The other failures are unrelated to the changes in this PR (xref #8580, #7406)

dask/dataframe/io/json.py

dask/dataframe/io/tests/test_json.py

bryanwweber · 2022-01-25T16:47:59Z

On Windows, some of the tests added here are failing with

 E               AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="path") are different
E               
E               Attribute "dtype" are different
E               [left]:  CategoricalDtype(categories=['C:/Users/RUNNER~1/AppData/Local/Temp/tmpupda7hgh.json'], ordered=False)
E               [right]: CategoricalDtype(categories=['C:\Users\RUNNER~1\AppData\Local\Temp\tmpupda7hgh.json'], ordered=False)

Ugh, Windows paths. I'll look at how read_csv() deals with this issue, I don't recall seeing normpath anywhere.

jrbourbeau

Thanks @bryanwweber!

jrbourbeau · 2022-01-27T00:48:50Z

dask/dataframe/io/tests/test_json.py

+
+
+@pytest.mark.parametrize("blocksize", [5, 15, 33, 200, 90000])
+def test_read_json_multiple_files_with_path_column(blocksize, tmpdir):


It looks like this needs the same path formatting you applied to test_read_json_with_path_column

@martindurant do you have any insight into why pandas.to_json and dask.dataframe.read_json are using different path separators on windows (i.e. pandas outputs files with \ and Dask reads them in with a /)?

jrbourbeau

Thanks @bryanwweber! This is in

github-actions bot added dataframe io labels Jan 21, 2022

Bryan Weber added 5 commits January 24, 2022 16:04

Add include_path_column arg to read_json

ed9db3b

This is now causing some test failures elsewhere in read_json, so those'll need to be debugged. But the happy path with path columns is working!

Add CategoricalDtype and fix existing tests

3cf6ade

Refactor path_converter

fb4ecde

Address code coverage

9290cc0

Add docstrings

8cd721a

bryanwweber marked this pull request as ready for review January 24, 2022 21:10

jrbourbeau reviewed Jan 25, 2022

View reviewed changes

jrbourbeau changed the title ~~Add include_path_column arg to read_json~~ Add include_path_column arg to read_json Jan 25, 2022

crusaderky assigned bryanwweber Jan 25, 2022

Bryan Weber added 4 commits January 25, 2022 14:09

Address some review comments

3f63cbd

Fix paths on Windows and other misc fixups

b53e90b

Defer compute to assert_eq

8347e85

Reformat Windows paths again

de13031

bryanwweber requested a review from jrbourbeau January 26, 2022 21:51

jrbourbeau reviewed Jan 27, 2022

View reviewed changes

Fix one more failing test

bf52ed2

jrbourbeau approved these changes Jan 31, 2022

View reviewed changes

jrbourbeau merged commit 1f79f5e into dask:main Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `include_path_column` arg to `read_json`#8603

Add `include_path_column` arg to `read_json`#8603
jrbourbeau merged 10 commits intodask:mainfrom
bryanwweber:add-include-path-column-to-json

bryanwweber commented Jan 21, 2022 •

edited

Loading

Uh oh!

GPUtester commented Jan 21, 2022

Uh oh!

jrbourbeau commented Jan 21, 2022

Uh oh!

jrbourbeau left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bryanwweber commented Jan 25, 2022

Uh oh!

jrbourbeau left a comment

Uh oh!

jrbourbeau Jan 27, 2022

Uh oh!

jrbourbeau left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@pytest.mark.parametrize("blocksize", [5, 15, 33, 200, 90000])
		def test_read_json_multiple_files_with_path_column(blocksize, tmpdir):

Uh oh!

Conversation

bryanwweber commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GPUtester commented Jan 21, 2022

Uh oh!

jrbourbeau commented Jan 21, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bryanwweber commented Jan 25, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bryanwweber commented Jan 21, 2022 •

edited

Loading