Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Implement extension type registration/conversion #288

Merged
merged 22 commits into from
Sep 1, 2023

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Aug 24, 2023

This PR implements extension type registration, enabling nanoarrow or other R packages to register conversions to and from R vectors. It also implements the "vctrs" extension type, which is also implemented in the Arrow R package: this extension type allows any R vector whose storage type is supported by nanoarrow to be roundtripped through an Arrow array. It's also a good test for the extension type mechanism since it has diverse storage and R vector type requirements.

This required some changes to the conversion system: when calling as_nanoarrow_array(x, type = some_extension_type()), the extension needs to have the first chance at handling the conversion. When calling convert_array(some_extension_array), similarly, it needs to have a chance to perform the conversion. The conversion process in both directions has become sufficiently complicated that I think it may need some refactoring; however, I think the intended logic and test coverage is there in this PR.

library(nanoarrow)

# Works from Arrow
vctr <- wk::wkt("POINT (0 1)", crs = "OGC:CRS84")

arrow_ext_array <- arrow::vctrs_extension_array(vctr)
(array <- as_nanoarrow_array(arrow_ext_array))
#> <nanoarrow_array arrow.r.vctrs{string}[1]>
#>  $ length    : int 1
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 3
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data_offset<int32>[2][8 b]> `0 11`
#>   ..$ :<nanoarrow_buffer data<string>[11 b]> `POINT (0 1)`
#>  $ dictionary: NULL
#>  $ children  : list()
infer_nanoarrow_ptype(array)
#> <wk_wkt[0] with CRS=OGC:CRS84>
convert_array(array)
#> <wk_wkt[1] with CRS=OGC:CRS84>
#> [1] POINT (0 1)

# Can also create vctrs extension arrays
array <- as_nanoarrow_array(
  vctr,
  schema = na_vctrs(vctr)
)
array
#> <nanoarrow_array arrow.r.vctrs{string}[1]>
#>  $ length    : int 1
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 3
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data_offset<int32>[2][8 b]> `0 11`
#>   ..$ :<nanoarrow_buffer data<string>[11 b]> `POINT (0 1)`
#>  $ dictionary: NULL
#>  $ children  : list()
infer_nanoarrow_ptype(array)
#> <wk_wkt[0] with CRS=OGC:CRS84>
convert_array(array)
#> <wk_wkt[1] with CRS=OGC:CRS84>
#> [1] POINT (0 1)

Created on 2023-08-29 with reprex v2.0.2

@codecov-commenter
Copy link

Codecov Report

Merging #288 (17eaf46) into main (4ace1a2) will decrease coverage by 0.19%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #288      +/-   ##
==========================================
- Coverage   86.97%   86.79%   -0.19%     
==========================================
  Files          66       64       -2     
  Lines       10322     9981     -341     
==========================================
- Hits         8978     8663     -315     
+ Misses       1344     1318      -26     
Files Changed Coverage Δ
r/R/type-extension.R 0.00% <0.00%> (ø)

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@paleolimbot paleolimbot marked this pull request as ready for review August 29, 2023 19:54
@paleolimbot
Copy link
Member Author

@krlmlr I hate to ask for your time again; however, this PR would definitely benefit from your 👀 if you have the bandwidth to take a look!

Copy link
Contributor

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very exciting!

I only skimmed over it. I wonder if it's worth splitting out some of the refactorings and merging them individually.

Comment on lines +90 to +91
schema <- .Call(nanoarrow_c_infer_schema_array, array)
parsed <- .Call(nanoarrow_c_schema_parse, schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this seems to throw away schema, perhaps offer a C entry point that combines both functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good observation that exposes a larger design flaw in the current system: at the point that convert_array() is called, we pretty much need the parsed version of a schema (i.e., ArrowSchemaView) and the parsed version of the array (i.e., ArrowArrayView). I wrote the Python bindings to handle this from the ground up but here there are a number of hacks like this one (and more than a few in C). I'm going to try to fix all of those in a more systemic way (but don't quite have the time to do it yet).

Comment on lines 105 to 109
if (is.data.frame(values)) {
return(values[indices + 1L, , drop = FALSE])
} else {
return(values[indices + 1L])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vec_slice() :

Suggested change
if (is.data.frame(values)) {
return(values[indices + 1L, , drop = FALSE])
} else {
return(values[indices + 1L])
}
return(vec_slice(values, indices + 1L))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also exposes a larger design question...currently, nanoarrow does not require vctrs for most data types (it does require it for converting lists-of-things into R vectors, and I think this is a good thing). I will extract a utility function so that it is more obvious what this does/easier to switch if we take on a vctrs dependency.

spec <- resolve_nanoarrow_extension(parsed$extension_name)
return(convert_array_extension(spec, array, to, ...))
}

# Handle default dictionary conversion since it's the same for all types
dictionary <- array$dictionary

if (!is.null(dictionary)) {
values <- .Call(nanoarrow_c_convert_array, dictionary, to)
array$dictionary <- NULL
indices <- .Call(nanoarrow_c_convert_array, array, integer())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps enable this function to add an offset directly, to avoid the + 1L below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is definitely a good optimization...ideally there would be an offset + scale for all numerics (scale is used for timestamp and difftime conversion). I would like to move the conversion process such to something like get_converter(array, to) returning something that could be constructed like int_vector_converter(offset = 1). I'm currently trying to (for better or possibly worse) provide full type coverage (and associated test coverage), at the expense of speed for the less frequently used types.

r/R/extension-vctrs.R Outdated Show resolved Hide resolved
r/R/extension-vctrs.R Show resolved Hide resolved
}
break;
default:
Rf_error("Unhandled SEXP type in copy_vec_into()");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPLXSXP ?

posixlt = as.POSIXlt("2000-01-01 12:23", tz = "UTC"),
date = as.Date("2000-01-01"),
difftime = as.difftime(123, units = "secs"),
data_frame_simple = tibble::tibble(x = 1:5),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you support packed columns (structs) here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (I think what you're referring to is tested on the next line?)

data_frame_nested = tibble::tibble(x = 1:5, y = tibble::tibble(z = letters[1:5]))
)

for (nm in names(vectors)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experience from DBItest has taught me that you'll be much better off writing multiple test_that(...) blocks, one per element. Why not create a small script that generates this code for you?

Canonical reference: https://mtlynch.io/good-developers-bad-tests/ .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point...I wish testthat had better support for parameterized tests or SCOPED_TRACE() like in GTest. I will see if there is any way to restructure this to improve reporting.

stream <- basic_array_stream(list(array, array))
expect_identical(convert_array_stream(stream), vctrs::vec_rep(vctr, 2))

if (requireNamespace("arrow", quietly = TRUE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also allow you to use skip_if_not_installed("arrow") here.

)
on.exit(unregister_nanoarrow_extension("some_ext"))

infer_nanoarrow_ptype_extension.some_spec_class <- function(spec, x, ...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's rlang::local_bindings() .

Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit 65bb552 into apache:main Sep 1, 2023
13 checks passed
@paleolimbot paleolimbot deleted the r-extension-types branch September 1, 2023 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants