Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Add arrow package integration #85

Merged
merged 9 commits into from
Dec 13, 2023

Conversation

paleolimbot
Copy link
Contributor

@paleolimbot paleolimbot commented Dec 6, 2023

This adds an arrow::ExtensionType implementation and a number of S3 methods to enable hopefully seamless interactions between the arrow and sf. There are almost certainly other methods that need to be added (maybe some for wk_wkt and friends in the wk package) but this should be a start.

library(arrow, warn.conflicts = FALSE)
library(geoarrow)
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
(tbl <- as_arrow_table(nc))
#> Table
#> 100 rows x 15 columns
#> $AREA <double>
#> $PERIMETER <double>
#> $CNTY_ <double>
#> $CNTY_ID <double>
#> $NAME <string>
#> $FIPS <string>
#> $FIPSNO <double>
#> $CRESS_ID <int32>
#> $BIR74 <double>
#> $SID74 <double>
#> $NWBIR74 <double>
#> $BIR79 <double>
#> $SID79 <double>
#> $NWBIR79 <double>
#> $geometry: geoarrow.multipolygon <CRS <{>
#> $  "$schema" <"https://pro...>
#> 
#> See $metadata for additional Schema metadata
tbl$geometry$type
#> GeometryExtensionType
#> geoarrow.multipolygon <CRS: {
#>   "$schema": "https://pro...
sf::st_as_sf(tbl)
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 100 × 15
#>     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#>    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
#>  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
#>  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
#>  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
#>  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
#>  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
#>  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
#>  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
#>  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
#>  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
#> 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
#> # ℹ 90 more rows
#> # ℹ 4 more variables: BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> #   geometry <MULTIPOLYGON [°]>

Created on 2023-12-06 with reprex v2.0.2

@paleolimbot paleolimbot marked this pull request as ready for review December 6, 2023 20:45
@paleolimbot
Copy link
Contributor Author

@anthonynorth if you have the bandwidth I'd be grateful for any of the thoughts you have time to write here!


as_arrow_array.geoarrow_vctr <- function(x, ..., type = NULL) {
chunked <- as_chunked_array.geoarrow_vctr(x, ..., type = type)
if (chunked$num_chunks == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does arrow::as_array_array() copy if we have a single chunk and we're avoiding that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that it does (which should be fixed in arrow, but I haven't gotten there yet!). I only found out recently that concat_arrays() is the only way to force a copy of an Arrow thing.

}
}

as_chunked_array.geoarrow_vctr <- function(x, ..., type = NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 definitely! I fixed this and added a test.

Comment on lines 12 to 21
if (is.null(type)) {
type <- arrow::as_data_type(attr(x, "schema", exact = TRUE))
chunks <- attr(x, "chunks", exact = TRUE)
} else {
stream <- as_nanoarrow_array_stream(x, schema = as_nanoarrow_schema(type))
chunks <- nanoarrow::collect_array_stream(stream, validate = FALSE)
type <- arrow::as_data_type(type)
}

schema <- as_nanoarrow_schema(type)
Copy link
Contributor

@anthonynorth anthonynorth Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using this conversion pattern in as_geoarrow_vctr() also. Should we extract it into a utility, or just convert x with as_geoarrow_vctr()? E.g.

Suggested change
if (is.null(type)) {
type <- arrow::as_data_type(attr(x, "schema", exact = TRUE))
chunks <- attr(x, "chunks", exact = TRUE)
} else {
stream <- as_nanoarrow_array_stream(x, schema = as_nanoarrow_schema(type))
chunks <- nanoarrow::collect_array_stream(stream, validate = FALSE)
type <- arrow::as_data_type(type)
}
schema <- as_nanoarrow_schema(type)
if (!is.null(type)) {
array <- arrow::as_chunked_array(as_geoarrow_vctr(x, schema = as_nanoarrow_schema(type)))
return(array)
}
schema <- attr(x, "schema", exact = TRUE)
type <- arrow::as_data_type(schema)
chunks <- attr(x, "chunks", exact = TRUE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that slices are involved changed this a bit! I think there is more room for optimization (in particular, Arrow C++ doesn't import/export chunked arrays as array streams, which would help a lot!).

Comment on lines 22 to 31
arrays <- vector("list", length(chunks))
for (i in seq_along(arrays)) {
tmp_schema <- nanoarrow::nanoarrow_allocate_schema()
nanoarrow::nanoarrow_pointer_export(schema, tmp_schema)
tmp_array <- nanoarrow::nanoarrow_allocate_array()
nanoarrow::nanoarrow_pointer_export(chunks[[i]], tmp_array)
arrays[[i]] <- arrow::Array$import_from_c(tmp_array, tmp_schema)
}

arrow::ChunkedArray$create(!!!arrays, type = type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the following isn't equivalent? My lack of knowledge of arrow and nanoarrow internals is probably obvious with this question.

arrow::chunked_array(!!!chunks, type = type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added chunked_array() pretty recently but it's definitely the suggested constructor!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what these lines are for? Can we just create the array from chunks?

arrays <- vector("list", length(chunks))
for (i in seq_along(arrays)) {
tmp_schema <- nanoarrow::nanoarrow_allocate_schema()
nanoarrow::nanoarrow_pointer_export(schema, tmp_schema)
tmp_array <- nanoarrow::nanoarrow_allocate_array()
nanoarrow::nanoarrow_pointer_export(chunks[[i]], tmp_array)
arrays[[i]] <- arrow::Array$import_from_c(tmp_array, tmp_schema)
}

}

st_as_sf.ArrowTabular <- function(x, ..., promote_multi = FALSE) {
df <- tibble::as_tibble(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a tibble?

Suggested change
df <- tibble::as_tibble(x)
df <- as.data.frame(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I forgot that Arrow doesn't "depend" on tibble (although it returns them quite a lot of the time)

@anthonynorth
Copy link
Contributor

Looks good. A couple of very minor things that might be improvements (see comments)

@paleolimbot paleolimbot merged commit a9293f3 into geoarrow:main Dec 13, 2023
6 checks passed
@paleolimbot paleolimbot deleted the r-arrow branch December 13, 2023 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants