-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] How to filter array columns? #31991
Comments
Will Jones / @wjones127: Is the following example helpful? library(arrow)
library(dplyr)
# Filter `tab` for and `tab$x` in `valid`
valid <- Array$create(c(2))
tab <- arrow_table(
x = Array$create(list(c(1, 2), c(3, 2), c(1, 3))),
y = Array$create(c("a", "b", "c"))
)
tab_exploded <- arrow_table(
i = call_function("list_parent_indices", tab$x),
x_flat = call_function("list_flatten", tab$x)
)
to_keep <- tab_exploded %>%
group_by(i) %>%
summarise(keep = any(x_flat %in% valid)) %>%
compute() %>%
.$keep
res <- tab[to_keep,]
as_tibble(res)
#> # A tibble: 2 × 2
#> x y
#> <list<double>> <chr>
#> 1 [2] a
#> 2 [2] b
res$x
#> ChunkedArray
#> [
#> [
#> [
#> 1,
#> 2
#> ],
#> [
#> 3,
#> 2
#> ]
#> ]
#> ] |
Will Jones / @wjones127: x_flat = call_function("list_flatten", tab$x) |
Vladimir: As I understand,
tab <- arrow_table(
x = Array$create(list(c(1, 2), 1, NULL)),
y = Array$create(c("a", "b", "c"))
)
tab_exploded <- arrow_table(
i = call_function("list_parent_indices", tab$x, options = list(skip_nulls = "false")),
x_flat = call_function("list_flatten", tab$x)
)
|
Will Jones / @wjones127: It does make this set filtering more awkward though; this probably deserves it's own function. A different approach might be constructing the indices to exclude: library(arrow)
library(dplyr)
# Filter `tab` for any `tab$x` in that doesn't contain value in `exclude`
exclude <- Array$create(c(1))
tab <- arrow_table(
x = Array$create(list(c(1, 2), c(3, 2), c(), c(1, 3))),
y = Array$create(c("a", "b", "c", "d"))
)
tab_exploded <- arrow_table(
i = call_function("list_parent_indices", tab$x),
x_flat = call_function("list_flatten", tab$x)
#x_flat = tab$x$chunk(0)$values()
)
to_drop <- tab_exploded %>%
group_by(i) %>%
summarise(to_drop = any((x_flat %in% exclude))) %>%
filter(to_drop) %>%
compute() %>%
.$i
selection <- !(1:nrow(tab) %in% as.vector(to_drop + 1))
res <- tab[selection,]
as_tibble(res)
#> # A tibble: 2 × 2
#> x y
#> <list<double>> <chr>
#> 1 [2] b
#> 2 c
res$x
#> ChunkedArray
#> [
#> [
#> [
#> 3,
#> 2
#> ],
#> null
#> ]
#> ] |
Vladimir: However, there is an issue applying it to a real dataset. It looks like the I've also tried the other way to filter the records - using So probably currently, there is no simple way to perform this filtering in one pass, and we need to split the data into chunks. It would be pretty cool if one day PS. The dataset we are working on is in open access - it's GBIF occurrence records. It has almost 2 billion records, and some of the records could have ~10 flags (column
|
|
In the parquet data we have, there is a column with the array data type ({}list<array_element >{}), which flags records that have different issues. For each record, multiple values could be stored in the column. For example,
{_}[A, B, C]{_}
.I'm trying to perform a data filtering step and exclude some flagged records.
Filtering is trivial for the regular columns that contain just a single value. E.g.,
Given the array column, is it possible to exclude records with at least one of the flags from
flags_to_exclude
using the arrow R package?I really appreciate any advice you can provide!
Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127
Related issues:
Note: This issue was originally created as ARROW-16641. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: