New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34436: [R] Bindings for JSON Dataset #35055
Conversation
thisisnic
commented
Apr 11, 2023
•
edited by github-actions
bot
edited by github-actions
bot
- Closes: [R] Bindings for JSON Dataset #34436
8c5d52f
to
e4aedc5
Compare
11835d1
to
683495f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ / bindings changes look good to me. I didn't look too much at the R part.
8198628
to
a64afac
Compare
cca56bb
to
eda50a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so close!
I know that we don't necessarily have a way to extract options that we set on a Dataset; however, I think we should add a test with an open_dataset()
that passes some though (i.e., use_threads
and/or newlines_in_values
and/or schema
), and if possible, test that they actually did something.
For use_threads = FALSE
: I believe that option_use_threads()
returns FALSE on Windows, so this is sort of implicitly tested; however, I think at least one test should explicitly attempt to pass it.
For newlines_in_values = TRUE
: You could do something like write {"key": "\nvalue"}\n
to a file and attempt collecting to make sure the newline in the value comes through.
For schema
: You could do something like write {"key": "value"}\n
to a file and attempt collecting with schema(key = int32())
and make sure it errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor questions but this seems fine to me otherwise.
#' * `use_threads`: Whether to use the global CPU thread pool. Default `TRUE`. If `FALSE`, JSON input must end with an | ||
#' empty line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd these two things are related. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have copied and pasted this to the wrong place, good catch!
Actually, I'm going to push back on this. We don't test these options fully when reading CSV datasets, and given they relate to internal mechanisms of how things run not what is returned, it's going to be difficult to test here, and should be tested in the C++ anyway. Happy to write tests to check that these options are passed through successfully though. |
I don't think that our previous lack of of test coverage should mean that we continue to add features that are not tested. I am also happy to defer to another review on this, though. How about: library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(testthat, warn.conflicts = FALSE)
test_that("JsonParseOptions are passed through to the dataset", {
tf <- tempfile()
on.exit(unlink(tf))
write('{"key": "value"}', tf)
expect_error(
# (but instead use open_dataset())
read_json_arrow(tf, schema = schema(key = int32())),
"parse error"
)
})
#> Test passed 😀 (Ideally we'd be able to do something like |
That is a good point...we're definitely not here to test C++; however, we do need to check that the wires are plugged in where it's reasonable to do and I think here it is reasonable to do. |
Feel free to push to the branch @paleolimbot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's track separately! (See #36138)
Conbench analyzed the 6 benchmark runs on commit There were 4 benchmark results indicating a performance regression:
The full Conbench report has more details. |