New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35542 : [R] Implement schema extraction function #35543
Conversation
I am wondering if the function name |
r/R/schema.R
Outdated
schema.RecordBatchReader <- function(x) x$schema | ||
|
||
#' @export | ||
schema.Dataset <- function(x) x$schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this for arrow_dplyr_query? Might involve calling implicit_schema, or x[["schema"]] %||% implicit_schema(x)
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Just to check, when do we end up with arrow_dplyr_query
objects where x[["schema"]]
isn't NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to recall that when we collapse a query and call implicit_schema, it assigns $schema in the nested adq, but maybe I remember wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tiny bit of a hack, but in nanoarrow I did:
nanoarrow:::infer_nanoarrow_schema.arrow_dplyr_query
#> function (x, ...)
#> {
#> infer_nanoarrow_schema.RecordBatchReader(arrow::as_record_batch_reader(x))
#> }
#> <bytecode: 0x1076e60a8>
#> <environment: namespace:nanoarrow>
Created on 2023-05-12 with reprex v2.0.2
(which works because as_record_batch_reader()
on a query doesn't start evaluating but does record the calculated final schema)
I think that's a good point...particularly since it's an S3 method. |
For what it's worth, I'm personally not a fan of renaming to |
Yes, but then The |
Yeah I totally get your concern and this is why I didn't do this before, despite many times typing
I'm not sure that feels natural in these cases: |
While I agree that it's not the best that I also considered whether there are other attributes we might want to implement extraction functions like this for, but I think schemas are the only case and so here's no general problem to solve. Given that multiple people (Neal, myself, and also Francois on the original ticket) report having tried to use |
Though it looks like it's not possible to get this working sensibly as an S3 generic anyway, may need to redesign. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I left a note about how you might be able to get the S3 to work although there's no perfect way to go about it (notably, schema(!!!stuff)
is going to be problematic because the first argument will always be evaluated by not rlang).
Second, how about:
- The S3 generic is
infer_schema()
. We already haveinfer_type()
(changed fromtype()
for similar reasons)...as_schema()
definitely feels wrong for all the reasons that have been mentioned. - We use
infer_schema()
internally, but - If somebody calls
schema(one_unnamed_arg)
, dispatch toinfer_schema()
for interactive use. I too have typedschema(something)
and done mild cursing because I couldn't remember what combination of$
andas_whatever()
I needed.
r/R/schema.R
Outdated
schema.RecordBatchReader <- function(x) x$schema | ||
|
||
#' @export | ||
schema.Dataset <- function(x) x$schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tiny bit of a hack, but in nanoarrow I did:
nanoarrow:::infer_nanoarrow_schema.arrow_dplyr_query
#> function (x, ...)
#> {
#> infer_nanoarrow_schema.RecordBatchReader(arrow::as_record_batch_reader(x))
#> }
#> <bytecode: 0x1076e60a8>
#> <environment: namespace:nanoarrow>
Created on 2023-05-12 with reprex v2.0.2
(which works because as_record_batch_reader()
on a query doesn't start evaluating but does record the calculated final schema)
Thanks @paleolimbot, sounds like a good middle ground there, will give that a go. |
ebde099
to
ee09f8f
Compare
@@ -192,5 +192,6 @@ implicit_schema <- function(.data) { | |||
aggregate_types(.data, hash, old_schm) | |||
) | |||
} | |||
schema(!!!new_fields) | |||
|
|||
schema(new_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel strongly about the ability to pass a bare list instead of !!!bare_list
I think this is a better fit for a separate PR: it is not related to the ability to use schema()
on a Dataset/ArrowTabular/arrow_dplyr_query and is a departure from behaviour elsewhere in the package (e.g., none of record_batch()
, arrow_table()
, or struct()
support this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your comment; this is functionality we already support rather than functionality I added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I had no idea it was an existing feature. Could it be in the documentation for ...
in schema()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, yeah, I should totally add it to those docs (and perhaps an example too) - if it's not obvious to you as one of the main devs here, it won't be obvious to others either! Thanks for catching this!
r/R/schema.R
Outdated
return(infer_schema(dots[[1]])) | ||
} | ||
|
||
Schema$create(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, if you feel strongly about this I think it is a better fit for another PR: our usage everywhere else in the package is to capture the dots once using list2()
and use !!!
if they need to be passed on to another function.
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you!
Conbench analyzed the 6 benchmark runs on commit There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. |
Created on 2023-05-11 with reprex v2.0.2