Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35542 : [R] Implement schema extraction function #35543

Merged
merged 19 commits into from Jun 23, 2023

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented May 11, 2023

library(arrow)
mtcarrow <- arrow_table(mtcars)
schema(mtcarrow)
#> Schema
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> 
#> See $metadata for additional Schema metadata

Created on 2023-05-11 with reprex v2.0.2

@github-actions
Copy link

@eitsupi
Copy link
Contributor

eitsupi commented May 12, 2023

I am wondering if the function name schema is too generic and how about to be changed to something like arrow_schmea?

r/R/schema.R Outdated
schema.RecordBatchReader <- function(x) x$schema

#' @export
schema.Dataset <- function(x) x$schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add this for arrow_dplyr_query? Might involve calling implicit_schema, or x[["schema"]] %||% implicit_schema(x) or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Just to check, when do we end up with arrow_dplyr_query objects where x[["schema"]] isn't NULL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to recall that when we collapse a query and call implicit_schema, it assigns $schema in the nested adq, but maybe I remember wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tiny bit of a hack, but in nanoarrow I did:

nanoarrow:::infer_nanoarrow_schema.arrow_dplyr_query
#> function (x, ...) 
#> {
#>     infer_nanoarrow_schema.RecordBatchReader(arrow::as_record_batch_reader(x))
#> }
#> <bytecode: 0x1076e60a8>
#> <environment: namespace:nanoarrow>

Created on 2023-05-12 with reprex v2.0.2

(which works because as_record_batch_reader() on a query doesn't start evaluating but does record the calculated final schema)

@paleolimbot
Copy link
Member

paleolimbot commented May 12, 2023

I am wondering if the function name schema is too generic and how about to be changed to something like arrow_schmea?

I think that's a good point...particularly since it's an S3 method. infer_schema() or extract_schema() might also be good choices...we already have as_schema() for converting a schema-like object to a Schema. (FWIW I use infer_nanoarrow_schema() to do this kind of thing in nanoarrow).

@nealrichardson
Copy link
Contributor

I am wondering if the function name schema is too generic and how about to be changed to something like arrow_schmea?

I think that's a good point...particularly since it's an S3 method. infer_schema() or extract_schema() might also be good choices...we already have as_schema() for converting a schema-like object to a Schema. (FWIW I use infer_nanoarrow_schema() to do this kind of thing in nanoarrow).

schema() exists as a function now (and has since the beginning of the package), this PR seems to be adding additional cases for using it to extract the $schema attribute from Arrow objects, and it's doing it via S3 methods rather than ifs. If we're worried about the name "schema" colliding with other packages, we would have already seen it already--has anyone?

For what it's worth, I'm personally not a fan of renaming to verb_schema(): it's pulling an attribute of the object, so it feels natural that the accessor have the same name as the attribute, as we have e.g. length.ArrowDatum <- function(x) x$length(). As a (👴) developer, I don't want to have to remember what verb_ goes in front, I'd rather just $schema.

@paleolimbot
Copy link
Member

Yes, but then schema() both constructs a schema AND pulls the schema attribute...one function doing more than one thing is also confusing although there's certainly precedent for it in pre-tidyverse R.

The as_schema() method is perhaps more appropriate (and also perhaps easier to remember than "infer" or "extract")?

@nealrichardson
Copy link
Contributor

Yes, but then schema() both constructs a schema AND pulls the schema attribute...one function doing more than one thing is also confusing although there's certainly precedent for it in pre-tidyverse R.

Yeah I totally get your concern and this is why I didn't do this before, despite many times typing schema(ds) and then remembering that instead I had to type ds$schema to get the schema. I'm just not convinced (anymore) that it's less confusing than the status quo, where getter() exists for some attributes but not others.

The as_schema() method is perhaps more appropriate (and also perhaps easier to remember than "infer" or "extract")?

I'm not sure that feels natural in these cases: as_schema(ds) feels like I'm converting the Dataset to something else, where what I really want is to access the Schema that the Dataset contains.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 12, 2023
@thisisnic
Copy link
Member Author

While I agree that it's not the best that schema() would do both construction and extraction here, the aim here was to come up with something intuitive for users that avoids having to use the $ for extraction, and I don't think that any of the alternatives quite fit that; I agree that the as_* functions sound more conversion than extraction. I just spent ~30 minutes browsing through docs for other packages (mainly tidyverse/tidymodels/r-lib etc) to see how they handle this kind of thing, but there aren't any good analogous examples.

I also considered whether there are other attributes we might want to implement extraction functions like this for, but I think schemas are the only case and so here's no general problem to solve. Given that multiple people (Neal, myself, and also Francois on the original ticket) report having tried to use schema() to extract a schema despite knowing that schema() is (also) the function for creating schemas, it might be the best intuitive option we have right now?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 12, 2023
@thisisnic
Copy link
Member Author

Though it looks like it's not possible to get this working sensibly as an S3 generic anyway, may need to redesign.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I left a note about how you might be able to get the S3 to work although there's no perfect way to go about it (notably, schema(!!!stuff) is going to be problematic because the first argument will always be evaluated by not rlang).

Second, how about:

  • The S3 generic is infer_schema(). We already have infer_type() (changed from type() for similar reasons)...as_schema() definitely feels wrong for all the reasons that have been mentioned.
  • We use infer_schema() internally, but
  • If somebody calls schema(one_unnamed_arg), dispatch to infer_schema() for interactive use. I too have typed schema(something) and done mild cursing because I couldn't remember what combination of $ and as_whatever() I needed.

r/R/schema.R Outdated Show resolved Hide resolved
r/R/schema.R Outdated
schema.RecordBatchReader <- function(x) x$schema

#' @export
schema.Dataset <- function(x) x$schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tiny bit of a hack, but in nanoarrow I did:

nanoarrow:::infer_nanoarrow_schema.arrow_dplyr_query
#> function (x, ...) 
#> {
#>     infer_nanoarrow_schema.RecordBatchReader(arrow::as_record_batch_reader(x))
#> }
#> <bytecode: 0x1076e60a8>
#> <environment: namespace:nanoarrow>

Created on 2023-05-12 with reprex v2.0.2

(which works because as_record_batch_reader() on a query doesn't start evaluating but does record the calculated final schema)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 13, 2023
@thisisnic
Copy link
Member Author

Thanks @paleolimbot, sounds like a good middle ground there, will give that a go.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 21, 2023
@@ -192,5 +192,6 @@ implicit_schema <- function(.data) {
aggregate_types(.data, hash, old_schm)
)
}
schema(!!!new_fields)

schema(new_fields)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel strongly about the ability to pass a bare list instead of !!!bare_list I think this is a better fit for a separate PR: it is not related to the ability to use schema() on a Dataset/ArrowTabular/arrow_dplyr_query and is a departure from behaviour elsewhere in the package (e.g., none of record_batch(), arrow_table(), or struct() support this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment; this is functionality we already support rather than functionality I added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I had no idea it was an existing feature. Could it be in the documentation for ... in schema()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, yeah, I should totally add it to those docs (and perhaps an example too) - if it's not obvious to you as one of the main devs here, it won't be obvious to others either! Thanks for catching this!

r/R/schema.R Outdated
return(infer_schema(dots[[1]]))
}

Schema$create(...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, if you feel strongly about this I think it is a better fit for another PR: our usage everywhere else in the package is to capture the dots once using list2() and use !!! if they need to be passed on to another function.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2023
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 21, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 21, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 22, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 22, 2023
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 22, 2023
@thisisnic thisisnic merged commit 29a339f into apache:main Jun 23, 2023
11 checks passed
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 29a339f5.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Implement schema extraction function
4 participants