Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] handling of various blob representations #343

Closed
nbenn opened this issue Dec 22, 2023 · 5 comments · Fixed by #344
Closed

[R] handling of various blob representations #343

nbenn opened this issue Dec 22, 2023 · 5 comments · Fixed by #344

Comments

@nbenn
Copy link

nbenn commented Dec 22, 2023

There are several ways to handle "blob" data in a data.frame:

str(tbl1 <- data.frame(id = 1L, a = blob::as_blob(as.raw(0:10))))
#> 'data.frame':	1 obs. of  2 variables:
#>  $ id: int 1
#>  $ a : blob [1:1]
#>   ..$ : raw  00 01 02 03 ...
#>   ..@ ptype: raw
str(tbl2 <- data.frame(id = 1L, a = I(list(as.raw(0:10)))))
#> 'data.frame':	1 obs. of  2 variables:
#>  $ id: int 1
#>  $ a :List of 1
#>   ..$ : raw  00 01 02 03 ...
#>   ..- attr(*, "class")= chr "AsIs"
str(tbl3 <- tibble::tibble(id = 1L, a = list(as.raw(0:10))))
#> tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ id: int 1
#>  $ a :List of 1
#>   ..$ : raw [1:11] 00 01 02 03 ...

of which only the first currently is accepted by nanoarrow. In DBI this is handled by

  1. stripping the AsIs class: https://github.com/r-dbi/DBI/blob/main/R/dbiDataType_AsIs.R
  2. handling lists that contain only raw vectors (and erroring otherwise): https://github.com/r-dbi/DBI/blob/main/R/dbiDataType_list.R

For now I'm doing this in adbi by implementing the generic for the two types i need (https://github.com/r-dbi/adbi/blob/main/R/nanoarrow.R).

Not sure how you feel about this. If you think this could also live in nanoarrow, I'm happy to submit a PR.

One problem with this is that it might pose a conflict with arbitrary nested types such as

str(tibble::tibble(id = 1L, a = list(list(a = 1, b = 2))))
#> tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ id: int 1
#>  $ a :List of 1
#>   ..$ :List of 2
#>   .. ..$ a: num 1
#>   .. ..$ b: num 2

But then again, this could (and I believe currently is) handled via vctrs_list_of. Maybe bare lists are only allowed to contain raw vectors and this is some sort of "legacy" interface?

@krlmlr maybe DBI should change here to fix the root cause? Then again with that there's always backwards compatibility to worry about there.

@krlmlr
Copy link
Contributor

krlmlr commented Dec 22, 2023

Thanks. I don't understand "root cause". In what way do you propose that DBI should change?

I don't mind changing DBItest here and testing only a vctrs_list_of for list input.

@nbenn
Copy link
Author

nbenn commented Dec 22, 2023

@krlmlr Yeah, sorry, that's what I meant: Currently, being able to handle AsIs is enforced, for example in roundtrip_raw (this might also be the only example). If that test were to change, we could propose vctrs_list_of as the (more generic) alternative to blob::as_blob().

@paleolimbot
Copy link
Member

DBI aside, I do think that it's pretty reasonable to expect tibble::tibble(x = list(raw()) to convert to struct<binary> in nanoarrow. I forget the exact details of what exactly I need to change to make that work in nanoarrow but I don't think it's overly complicated.

tibble::tibble(x = list(raw())) |> nanoarrow::as_nanoarrow_array()
#> Error in infer_nanoarrow_schema.default(X[[i]], ...): Can't infer Arrow type for object of class list

For comparison, Arrow handles this fine:

tibble::tibble(x = list(raw())) |> arrow::as_arrow_array()
#> StructArray
#> <struct<x: binary>>
#> -- is_valid: all not null
#> -- child 0 type: binary
#>   [
#>     
#>   ]

Created on 2023-12-22 with reprex v2.0.2

@nbenn
Copy link
Author

nbenn commented Dec 22, 2023

@paleolimbot the following does the trick for me:

infer_nanoarrow_schema.AsIs <- function(x, ...) {
  # unfortunately NextMethod() goes directly to `default`
  class(x) <- class(x)[-1]
  infer_nanoarrow_schema(x)
}

infer_nanoarrow_schema.list <- function(x, ...) {

  is_raw <- vapply(x, is.raw, logical(1))

  if (!all(is_raw)) {
    stop("Only lists of raw vectors are currently supported", call. = FALSE)
  }

  if (length(x) > 0 && sum(lengths(x)) > .Machine$integer.max) {
    nanoarrow::na_large_binary()
  } else {
    nanoarrow::na_binary()
  }
}

where the list implementation would be simple in nanoarrow, as it's a copy of the blob implementation.

@paleolimbot
Copy link
Member

I'd ideally push that into C but will at the very least add the R version to the next nanoarrow release (early Jan).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants