Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Filter with regular expressions #26296

Closed
asfimport opened this issue Oct 14, 2020 · 3 comments
Closed

[R] Filter with regular expressions #26296

asfimport opened this issue Oct 14, 2020 · 3 comments

Comments

@asfimport
Copy link

asfimport commented Oct 14, 2020

Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering a dataset (after open_datatset() ). Specifically, the code below :

library(dplyr)
library(arrow)
data = data.frame(a = c("a", "a2", "a3"))
write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")

gives this error :

Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a"
 Call collect() first to pull data into R.

These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ?

Thank you.

Reporter: Pal

Related issues:

Note: This issue was originally created as ARROW-10305. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~palgal] Thanks for opening the issue. Such a substring matching filter is indeed not yet implemented.

A first step to enable this, would be to have a "compute kernel" for substrings (from looking at the overview at https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst, I don't think we currently have functionality to create such substrings).
A related compute kernel is actually match_substring with which you could check that (using your example) "a" is present in the string. But, that doesn't easily guarantee anything about the position of the substring in the string (although with a regular expression pattern, you could achieve this in some ways).

Then, a second step would be to be able to "express" such a compute kernel in an Expression that can be used to filter the dataset (although this might not be needed for the dplyr syntax? It could maybe also be done with an actual compute filter kernel? cc @nealrichardson?).

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
In terms of compute kernels, I think at least some of the pattern matching and extraction is happening in ARROW-10195.

Another missing piece, which @bkietz  was writing up a JIRA for, is being able to create dataset expressions that call any arbitrary compute function.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Marking this as resolved; many string operations will be included in the 4.0 release and any remaining ones have their own JIRAs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant