Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34436: [R] Bindings for JSON Dataset #35055

Merged
merged 24 commits into from Jun 18, 2023

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Apr 11, 2023

@github-actions
Copy link

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ / bindings changes look good to me. I didn't look too much at the R part.

r/R/dataset-format.R Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 16, 2023
cpp/src/arrow/dataset/api.h Outdated Show resolved Hide resolved
r/tests/testthat/test-dataset-json.R Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting merge Awaiting merge awaiting changes Awaiting changes labels May 22, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels May 22, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 8, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 8, 2023
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so close!

I know that we don't necessarily have a way to extract options that we set on a Dataset; however, I think we should add a test with an open_dataset() that passes some though (i.e., use_threads and/or newlines_in_values and/or schema), and if possible, test that they actually did something.

For use_threads = FALSE: I believe that option_use_threads() returns FALSE on Windows, so this is sort of implicitly tested; however, I think at least one test should explicitly attempt to pass it.

For newlines_in_values = TRUE: You could do something like write {"key": "\nvalue"}\n to a file and attempt collecting to make sure the newline in the value comes through.

For schema: You could do something like write {"key": "value"}\n to a file and attempt collecting with schema(key = int32()) and make sure it errors?

r/R/dataset-format.R Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 9, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor questions but this seems fine to me otherwise.

Comment on lines +129 to +130
#' * `use_threads`: Whether to use the global CPU thread pool. Default `TRUE`. If `FALSE`, JSON input must end with an
#' empty line.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd these two things are related. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have copied and pasted this to the wrong place, good catch!

r/R/dataset-format.R Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Jun 13, 2023
@thisisnic
Copy link
Member Author

This is so close!

I know that we don't necessarily have a way to extract options that we set on a Dataset; however, I think we should add a test with an open_dataset() that passes some though (i.e., use_threads and/or newlines_in_values and/or schema), and if possible, test that they actually did something.

Actually, I'm going to push back on this. We don't test these options fully when reading CSV datasets, and given they relate to internal mechanisms of how things run not what is returned, it's going to be difficult to test here, and should be tested in the C++ anyway. Happy to write tests to check that these options are passed through successfully though.

@paleolimbot
Copy link
Member

I don't think that our previous lack of of test coverage should mean that we continue to add features that are not tested. I am also happy to defer to another review on this, though.

How about:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(testthat, warn.conflicts = FALSE)

test_that("JsonParseOptions are passed through to the dataset", {
  tf <- tempfile()
  on.exit(unlink(tf))
  write('{"key": "value"}', tf)
  
  expect_error(
    # (but instead use open_dataset())
    read_json_arrow(tf, schema = schema(key = int32())),
    "parse error"
  )
})
#> Test passed 😀

(Ideally we'd be able to do something like dataset$parse_options$schema instead, which I think you've started to implement with the CSV options, but this seems slightly easier)

@paleolimbot
Copy link
Member

and should be tested in the C++ anyway.

That is a good point...we're definitely not here to test C++; however, we do need to check that the wires are plugged in where it's reasonable to do and I think here it is reasonable to do.

@thisisnic
Copy link
Member Author

I don't think that our previous lack of of test coverage should mean that we continue to add features that are not tested. I am also happy to defer to another review on this, though.

How about:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(testthat, warn.conflicts = FALSE)

test_that("JsonParseOptions are passed through to the dataset", {
  tf <- tempfile()
  on.exit(unlink(tf))
  write('{"key": "value"}', tf)
  
  expect_error(
    # (but instead use open_dataset())
    read_json_arrow(tf, schema = schema(key = int32())),
    "parse error"
  )
})
#> Test passed 😀

(Ideally we'd be able to do something like dataset$parse_options$schema instead, which I think you've started to implement with the CSV options, but this seems slightly easier)

Feel free to push to the branch @paleolimbot

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's track separately! (See #36138)

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 17, 2023
@thisisnic thisisnic merged commit 5ecdd94 into apache:main Jun 18, 2023
37 of 40 checks passed
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 5ecdd945.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Bindings for JSON Dataset
5 participants