GH-35542 : [R] Implement schema extraction function #35543

thisisnic · 2023-05-11T09:51:10Z

library(arrow)
mtcarrow <- arrow_table(mtcars)
schema(mtcarrow)
#> Schema
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> 
#> See $metadata for additional Schema metadata

^{Created on 2023-05-11 with reprex v2.0.2}

Closes: [R] Implement schema extraction function #35542

github-actions · 2023-05-11T09:51:32Z

Closes: [R] Implement schema extraction function #35542

eitsupi · 2023-05-12T11:33:53Z

I am wondering if the function name schema is too generic and how about to be changed to something like arrow_schmea?

nealrichardson · 2023-05-12T12:53:16Z

r/R/schema.R

+schema.RecordBatchReader <- function(x) x$schema
+
+#' @export
+schema.Dataset <- function(x) x$schema


Can you also add this for arrow_dplyr_query? Might involve calling implicit_schema, or x[["schema"]] %||% implicit_schema(x) or something.

Will do. Just to check, when do we end up with arrow_dplyr_query objects where x[["schema"]] isn't NULL?

I seem to recall that when we collapse a query and call implicit_schema, it assigns $schema in the nested adq, but maybe I remember wrong.

It's a tiny bit of a hack, but in nanoarrow I did:

nanoarrow:::infer_nanoarrow_schema.arrow_dplyr_query #> function (x, ...) #> { #> infer_nanoarrow_schema.RecordBatchReader(arrow::as_record_batch_reader(x)) #> } #> <bytecode: 0x1076e60a8> #> <environment: namespace:nanoarrow>

^{Created on 2023-05-12 with reprex v2.0.2}

(which works because as_record_batch_reader() on a query doesn't start evaluating but does record the calculated final schema)

paleolimbot · 2023-05-12T13:43:13Z

I am wondering if the function name schema is too generic and how about to be changed to something like arrow_schmea?

I think that's a good point...particularly since it's an S3 method. infer_schema() or extract_schema() might also be good choices...we already have as_schema() for converting a schema-like object to a Schema. (FWIW I use infer_nanoarrow_schema() to do this kind of thing in nanoarrow).

nealrichardson · 2023-05-12T14:05:46Z

I am wondering if the function name schema is too generic and how about to be changed to something like arrow_schmea?

I think that's a good point...particularly since it's an S3 method. infer_schema() or extract_schema() might also be good choices...we already have as_schema() for converting a schema-like object to a Schema. (FWIW I use infer_nanoarrow_schema() to do this kind of thing in nanoarrow).

schema() exists as a function now (and has since the beginning of the package), this PR seems to be adding additional cases for using it to extract the $schema attribute from Arrow objects, and it's doing it via S3 methods rather than ifs. If we're worried about the name "schema" colliding with other packages, we would have already seen it already--has anyone?

For what it's worth, I'm personally not a fan of renaming to verb_schema(): it's pulling an attribute of the object, so it feels natural that the accessor have the same name as the attribute, as we have e.g. length.ArrowDatum <- function(x) x$length(). As a (👴) developer, I don't want to have to remember what verb_ goes in front, I'd rather just $schema.

paleolimbot · 2023-05-12T14:40:39Z

Yes, but then schema() both constructs a schema AND pulls the schema attribute...one function doing more than one thing is also confusing although there's certainly precedent for it in pre-tidyverse R.

The as_schema() method is perhaps more appropriate (and also perhaps easier to remember than "infer" or "extract")?

nealrichardson · 2023-05-12T14:50:37Z

Yes, but then schema() both constructs a schema AND pulls the schema attribute...one function doing more than one thing is also confusing although there's certainly precedent for it in pre-tidyverse R.

Yeah I totally get your concern and this is why I didn't do this before, despite many times typing schema(ds) and then remembering that instead I had to type ds$schema to get the schema. I'm just not convinced (anymore) that it's less confusing than the status quo, where getter() exists for some attributes but not others.

The as_schema() method is perhaps more appropriate (and also perhaps easier to remember than "infer" or "extract")?

I'm not sure that feels natural in these cases: as_schema(ds) feels like I'm converting the Dataset to something else, where what I really want is to access the Schema that the Dataset contains.

thisisnic · 2023-05-12T21:11:53Z

While I agree that it's not the best that schema() would do both construction and extraction here, the aim here was to come up with something intuitive for users that avoids having to use the $ for extraction, and I don't think that any of the alternatives quite fit that; I agree that the as_* functions sound more conversion than extraction. I just spent ~30 minutes browsing through docs for other packages (mainly tidyverse/tidymodels/r-lib etc) to see how they handle this kind of thing, but there aren't any good analogous examples.

I also considered whether there are other attributes we might want to implement extraction functions like this for, but I think schemas are the only case and so here's no general problem to solve. Given that multiple people (Neal, myself, and also Francois on the original ticket) report having tried to use schema() to extract a schema despite knowing that schema() is (also) the function for creating schemas, it might be the best intuitive option we have right now?

thisisnic · 2023-05-12T21:17:01Z

Though it looks like it's not possible to get this working sensibly as an S3 generic anyway, may need to redesign.

paleolimbot

First, I left a note about how you might be able to get the S3 to work although there's no perfect way to go about it (notably, schema(!!!stuff) is going to be problematic because the first argument will always be evaluated by not rlang).

Second, how about:

The S3 generic is infer_schema(). We already have infer_type() (changed from type() for similar reasons)...as_schema() definitely feels wrong for all the reasons that have been mentioned.
We use infer_schema() internally, but
If somebody calls schema(one_unnamed_arg), dispatch to infer_schema() for interactive use. I too have typed schema(something) and done mild cursing because I couldn't remember what combination of $ and as_whatever() I needed.

r/R/schema.R

paleolimbot · 2023-05-13T00:59:09Z

r/R/schema.R

+schema.RecordBatchReader <- function(x) x$schema
+
+#' @export
+schema.Dataset <- function(x) x$schema


It's a tiny bit of a hack, but in nanoarrow I did:

nanoarrow:::infer_nanoarrow_schema.arrow_dplyr_query #> function (x, ...) #> { #> infer_nanoarrow_schema.RecordBatchReader(arrow::as_record_batch_reader(x)) #> } #> <bytecode: 0x1076e60a8> #> <environment: namespace:nanoarrow>

^{Created on 2023-05-12 with reprex v2.0.2}

(which works because as_record_batch_reader() on a query doesn't start evaluating but does record the calculated final schema)

thisisnic · 2023-05-13T13:19:36Z

Thanks @paleolimbot, sounds like a good middle ground there, will give that a go.

paleolimbot · 2023-06-21T15:49:02Z

r/R/dplyr-collect.R

@@ -192,5 +192,6 @@ implicit_schema <- function(.data) {
      aggregate_types(.data, hash, old_schm)
    )
  }
-  schema(!!!new_fields)
+
+  schema(new_fields)


If you feel strongly about the ability to pass a bare list instead of !!!bare_list I think this is a better fit for a separate PR: it is not related to the ability to use schema() on a Dataset/ArrowTabular/arrow_dplyr_query and is a departure from behaviour elsewhere in the package (e.g., none of record_batch(), arrow_table(), or struct() support this).

I don't understand your comment; this is functionality we already support rather than functionality I added?

I'm sorry, I had no idea it was an existing feature. Could it be in the documentation for ... in schema()?

Actually, yeah, I should totally add it to those docs (and perhaps an example too) - if it's not obvious to you as one of the main devs here, it won't be obvious to others either! Thanks for catching this!

paleolimbot · 2023-06-21T15:53:25Z

r/R/schema.R

+    return(infer_schema(dots[[1]]))
+  }
+
+  Schema$create(...)


As above, if you feel strongly about this I think it is a better fit for another PR: our usage everywhere else in the package is to capture the dots once using list2() and use !!! if they need to be passed on to another function.

r/tests/testthat/test-schema.R

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

…act_schema

paleolimbot

Looks great! Thank you!

conbench-apache-arrow · 2023-06-26T13:15:12Z

Conbench analyzed the 6 benchmark runs on commit 29a339f5.

There was 1 benchmark result indicating a performance regression:

Commit Run on ursa-thinkcentre-m75q at 2023-06-25 16:55:21Z
- params=1048576/1, source=cpp-micro, suite=arrow-acero-aggregate-benchmark

The full Conbench report has more details.

github-actions bot added Component: R awaiting committer review Awaiting committer review labels May 11, 2023

thisisnic mentioned this pull request May 11, 2023

[R] Split out the docs for our R6 objects from the user-facing APIs #35544

Open

thisisnic marked this pull request as ready for review May 11, 2023 10:07

thisisnic requested a review from paleolimbot as a code owner May 11, 2023 10:07

nealrichardson reviewed May 12, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 12, 2023

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 12, 2023

paleolimbot reviewed May 13, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 13, 2023

thisisnic added 7 commits June 19, 2023 12:54

Implement schema extraction methods and split out docs

0c3bdf9

Add schema docs to pkgdown

583c9e7

Add test for schema extraction function

3294383

Remove !!!

2ed1951

Remove other unnecessary uses of !!!

ad86d69

Implement scheme.arrow_dplyr_query

c3da30a

Use infer_schema as the S3

ee09f8f

thisisnic force-pushed the extract_schema branch from ebde099 to ee09f8f Compare June 19, 2023 13:23

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2023

thisisnic added 2 commits June 21, 2023 08:45

Add more tests

e8df44e

Fix typo in func name

5b1bc2b

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 21, 2023

thisisnic requested a review from paleolimbot June 21, 2023 15:24

paleolimbot reviewed Jun 21, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2023

paleolimbot reviewed Jun 21, 2023

View reviewed changes

r/tests/testthat/test-schema.R Show resolved Hide resolved

Update r/R/schema.R

76706e6

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 21, 2023

Skip if datasets not available

7f75aef

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2023

Merge branch 'extract_schema' of github.com:thisisnic/arrow into extr…

7b069fa

…act_schema

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 22, 2023

Document the fact that schemas can accept lists

add41f8

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 22, 2023

paleolimbot approved these changes Jun 22, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 22, 2023

thisisnic merged commit 29a339f into apache:main Jun 23, 2023
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35542 : [R] Implement schema extraction function #35543

GH-35542 : [R] Implement schema extraction function #35543

thisisnic commented May 11, 2023 •

edited

github-actions bot commented May 11, 2023

eitsupi commented May 12, 2023

nealrichardson May 12, 2023

thisisnic May 12, 2023

nealrichardson May 12, 2023

paleolimbot May 13, 2023

paleolimbot commented May 12, 2023 •

edited

nealrichardson commented May 12, 2023

paleolimbot commented May 12, 2023

nealrichardson commented May 12, 2023

thisisnic commented May 12, 2023

thisisnic commented May 12, 2023

paleolimbot left a comment

paleolimbot May 13, 2023

thisisnic commented May 13, 2023

paleolimbot Jun 21, 2023

thisisnic Jun 21, 2023

paleolimbot Jun 22, 2023

thisisnic Jun 22, 2023

paleolimbot Jun 21, 2023

paleolimbot left a comment

conbench-apache-arrow bot commented Jun 26, 2023

GH-35542 : [R] Implement schema extraction function #35543

GH-35542 : [R] Implement schema extraction function #35543

Conversation

thisisnic commented May 11, 2023 • edited

github-actions bot commented May 11, 2023

eitsupi commented May 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented May 12, 2023 • edited

nealrichardson commented May 12, 2023

paleolimbot commented May 12, 2023

nealrichardson commented May 12, 2023

thisisnic commented May 12, 2023

thisisnic commented May 12, 2023

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic commented May 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jun 26, 2023

thisisnic commented May 11, 2023 •

edited

paleolimbot commented May 12, 2023 •

edited