GH-34436: [R] Bindings for JSON Dataset #35055

thisisnic · 2023-04-11T20:47:53Z

Closes: [R] Bindings for JSON Dataset #34436

github-actions · 2023-04-11T20:48:19Z

Closes: [R] Bindings for JSON Dataset #34436

westonpace

The C++ / bindings changes look good to me. I didn't look too much at the R part.

r/R/dataset-format.R

cpp/src/arrow/dataset/api.h

r/tests/testthat/test-dataset-json.R

r/R/dataset-format.R

paleolimbot

This is so close!

I know that we don't necessarily have a way to extract options that we set on a Dataset; however, I think we should add a test with an open_dataset() that passes some though (i.e., use_threads and/or newlines_in_values and/or schema), and if possible, test that they actually did something.

For use_threads = FALSE: I believe that option_use_threads() returns FALSE on Windows, so this is sort of implicitly tested; however, I think at least one test should explicitly attempt to pass it.

For newlines_in_values = TRUE: You could do something like write {"key": "\nvalue"}\n to a file and attempt collecting to make sure the newline in the value comes through.

For schema: You could do something like write {"key": "value"}\n to a file and attempt collecting with schema(key = int32()) and make sure it errors?

r/R/dataset-format.R

westonpace

Just a few minor questions but this seems fine to me otherwise.

westonpace · 2023-06-13T16:05:45Z

r/R/dataset-format.R

+#'  * `use_threads`: Whether to use the global CPU thread pool. Default `TRUE`. If `FALSE`, JSON input must end with an
+#'  empty line.


It seems odd these two things are related. Am I missing something?

I must have copied and pasted this to the wrong place, good catch!

r/R/dataset-format.R

thisisnic · 2023-06-15T20:16:20Z

This is so close!

I know that we don't necessarily have a way to extract options that we set on a Dataset; however, I think we should add a test with an open_dataset() that passes some though (i.e., use_threads and/or newlines_in_values and/or schema), and if possible, test that they actually did something.

Actually, I'm going to push back on this. We don't test these options fully when reading CSV datasets, and given they relate to internal mechanisms of how things run not what is returned, it's going to be difficult to test here, and should be tested in the C++ anyway. Happy to write tests to check that these options are passed through successfully though.

paleolimbot · 2023-06-16T13:13:40Z

I don't think that our previous lack of of test coverage should mean that we continue to add features that are not tested. I am also happy to defer to another review on this, though.

How about:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(testthat, warn.conflicts = FALSE)

test_that("JsonParseOptions are passed through to the dataset", {
  tf <- tempfile()
  on.exit(unlink(tf))
  write('{"key": "value"}', tf)
  
  expect_error(
    # (but instead use open_dataset())
    read_json_arrow(tf, schema = schema(key = int32())),
    "parse error"
  )
})
#> Test passed 😀

(Ideally we'd be able to do something like dataset$parse_options$schema instead, which I think you've started to implement with the CSV options, but this seems slightly easier)

paleolimbot · 2023-06-16T13:21:17Z

and should be tested in the C++ anyway.

That is a good point...we're definitely not here to test C++; however, we do need to check that the wires are plugged in where it's reasonable to do and I think here it is reasonable to do.

thisisnic · 2023-06-16T14:24:50Z

I don't think that our previous lack of of test coverage should mean that we continue to add features that are not tested. I am also happy to defer to another review on this, though.

How about:
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(testthat, warn.conflicts = FALSE)

test_that("JsonParseOptions are passed through to the dataset", {
  tf <- tempfile()
  on.exit(unlink(tf))
  write('{"key": "value"}', tf)
  
  expect_error(
    # (but instead use open_dataset())
    read_json_arrow(tf, schema = schema(key = int32())),
    "parse error"
  )
})
#> Test passed 😀
(Ideally we'd be able to do something like dataset$parse_options$schema instead, which I think you've started to implement with the CSV options, but this seems slightly easier)

Feel free to push to the branch @paleolimbot

paleolimbot

Let's track separately! (See #36138)

conbench-apache-arrow · 2023-06-19T18:15:09Z

Conbench analyzed the 6 benchmark runs on commit 5ecdd945.

There were 4 benchmark results indicating a performance regression:

Commit Run on arm64-t4g-linux-compute at 2023-06-18 22:56:25Z
- params=threads:4/task_cost:100000/real_time, source=cpp-micro, suite=arrow-thread-pool-benchmark
Commit Run on ursa-thinkcentre-m75q at 2023-06-19 01:28:10Z
- params=<Subtract, UInt64Type>/size:524288/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark
and 2 more (see the report linked below)

The full Conbench report has more details.

github-actions bot added Component: C++ Component: R awaiting committer review Awaiting committer review labels Apr 11, 2023

thisisnic force-pushed the GH-3346_json_datasets] branch from 8c5d52f to e4aedc5 Compare April 27, 2023 12:21

thisisnic force-pushed the GH-3346_json_datasets] branch from 11835d1 to 683495f Compare May 10, 2023 11:11

thisisnic marked this pull request as ready for review May 10, 2023 11:26

thisisnic requested review from paleolimbot and westonpace as code owners May 10, 2023 11:26

westonpace approved these changes May 10, 2023

View reviewed changes

r/R/dataset-format.R Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 16, 2023

pitrou reviewed May 16, 2023

View reviewed changes

cpp/src/arrow/dataset/api.h Outdated Show resolved Hide resolved

r/tests/testthat/test-dataset-json.R Show resolved Hide resolved

thisisnic force-pushed the GH-3346_json_datasets] branch from 8198628 to a64afac Compare May 22, 2023 07:48

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting merge Awaiting merge awaiting changes Awaiting changes labels May 22, 2023

thisisnic commented May 22, 2023

View reviewed changes

r/R/dataset-format.R Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels May 22, 2023

thisisnic added 5 commits June 8, 2023 16:47

Expose JSON dataset classes

4c4f291

Start to wire up the machinery for JSON datasets

4e466f9

Add FragmentScanOptions creation

e226b83

Add docs from CSV classes

392e056

Add JsonFileDFormat

f5ee9a8

thisisnic added 3 commits June 8, 2023 16:47

Move test setup into function that uses it

f2199d2

Document parse and read options

ef61b60

Appease linter

eda50a0

thisisnic force-pushed the GH-3346_json_datasets] branch from cca56bb to eda50a0 Compare June 8, 2023 15:48

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 8, 2023

Test creation of FragmentScanOption object

7640f5c

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 8, 2023

paleolimbot reviewed Jun 8, 2023

View reviewed changes

r/R/dataset-format.R Show resolved Hide resolved

Add error handling for and a test of invalid option selection

37b1e9e

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 9, 2023

westonpace approved these changes Jun 13, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Jun 13, 2023

paleolimbot mentioned this pull request Jun 17, 2023

[R] Improve test coverage of dataset option plumbing #36138

Open

paleolimbot approved these changes Jun 17, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 17, 2023

thisisnic merged commit 5ecdd94 into apache:main Jun 18, 2023
37 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34436: [R] Bindings for JSON Dataset #35055

GH-34436: [R] Bindings for JSON Dataset #35055

thisisnic commented Apr 11, 2023 •

edited by github-actions bot

github-actions bot commented Apr 11, 2023

westonpace left a comment

paleolimbot left a comment

westonpace left a comment

westonpace Jun 13, 2023

thisisnic Jun 14, 2023

thisisnic commented Jun 15, 2023

paleolimbot commented Jun 16, 2023

paleolimbot commented Jun 16, 2023

thisisnic commented Jun 16, 2023

paleolimbot left a comment •

edited

conbench-apache-arrow bot commented Jun 19, 2023

		#' * `use_threads`: Whether to use the global CPU thread pool. Default `TRUE`. If `FALSE`, JSON input must end with an
		#' empty line.

GH-34436: [R] Bindings for JSON Dataset #35055

GH-34436: [R] Bindings for JSON Dataset #35055

Conversation

thisisnic commented Apr 11, 2023 • edited by github-actions bot

github-actions bot commented Apr 11, 2023

westonpace left a comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jun 13, 2023

Choose a reason for hiding this comment

thisisnic Jun 14, 2023

Choose a reason for hiding this comment

thisisnic commented Jun 15, 2023

paleolimbot commented Jun 16, 2023

paleolimbot commented Jun 16, 2023

thisisnic commented Jun 16, 2023

paleolimbot left a comment • edited

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jun 19, 2023

thisisnic commented Apr 11, 2023 •

edited by github-actions bot

paleolimbot left a comment •

edited