Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] stringr modifier functions cannot be called with namespace prefix #36720

Closed
giocomai opened this issue Jul 17, 2023 · 3 comments · Fixed by #36758
Closed

[R] stringr modifier functions cannot be called with namespace prefix #36720

giocomai opened this issue Jul 17, 2023 · 3 comments · Fixed by #36758

Comments

@giocomai
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

When using stringr's str_detect() and str_count(), stringr's own documentation recommends to use stringr::regex() and stringr::fixed() "for finer control of the matching behaviour."

This can be used, for example, to set "ignore_case" to TRUE, which is not available as an argument to str_detect() directly.

The resulting functions have the following structure:

stringr::str_detect(
  string = "eXample",
  pattern = stringr::regex("x", ignore_case = TRUE)
)
#> [1] TRUE

Unfortunately, arguments passed via stringr::regex() and stringr::fixed() are silently ignored by arrow, which leads to unexpected and quite possibly wrong results.

If one prints the arrow call, it is possible to see that indeed even if ignore_case is set to TRUE, the call is passed with ignore_case as FALSE.

bool (match_substring_regex(text, {pattern="x", ignore_case=false}))

I suppose arrow should either get this right, or throw an error.

The following reprex (run with arrow version 12.0.1) shows:

  • how the ignore.case argument works nicely when passed via the base function grepl
  • how it is simply ignored when passed to stringr::str_detect(), stringr::str_count() (and possibly other stringr functions) through stringr::regex() and stringr::str_detect()
  • how it works nicely if the ignore_case is passed directly in the pattern with (?i)
  • how arrow throws an error when using stringi::stri_detect_regex() (rather than stringr) with case_insensitive = TRUE (which is still preferrable to ignoring the argument silently).

There are obviously many workarounds, but this has led to errors when I applied functions that were not originally written and tested with arrow in mind.

library("arrow")
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

apple_df <- tibble::tibble(
  text = c(
    "apple",
    "APPLE"
  )
)

arrow::write_dataset(dataset = apple_df, path = "apple.parquet")

apple_parquet <- arrow::open_dataset(sources = "apple.parquet")



## with grepl, it works

apple_parquet |>
  dplyr::mutate(
    a_check = grepl(
      x = text,
      pattern = "a",
      ignore.case = TRUE
    )
  )
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (if_else(is_null(match_substring_regex(text, {pattern="a", ignore_case=true}), {nan_is_null=true}), false, match_substring_regex(text, {pattern="a", ignore_case=true})))
#> 
#> See $.data for the source Arrow object

apple_parquet |>
  dplyr::mutate(
    a_check = grepl(x = text, pattern = "a", ignore.case = TRUE)
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 2
#>   text  a_check
#>   <chr> <lgl>  
#> 1 apple TRUE   
#> 2 APPLE TRUE


## with stringr::str_detect it does not work

apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = "a"
    )
  )
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a", ignore_case=false}))
#> 
#> See $.data for the source Arrow object


apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = stringr::regex(
        pattern = "a",
        ignore_case = TRUE
      )
    )
  )
#> FileSystemDataset (query)
#> text: string
#> a_check: bool (match_substring_regex(text, {pattern="a", ignore_case=false}))
#> 
#> See $.data for the source Arrow object


apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = stringr::regex(
        pattern = "a",
        ignore_case = TRUE
      )
    ),
    p_count = stringr::str_count(
      string = text,
      pattern = stringr::regex(
        pattern = "p",
        ignore_case = TRUE
      )
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 3
#>   text  a_check p_count
#>   <chr> <lgl>     <int>
#> 1 apple TRUE          2
#> 2 APPLE FALSE         0

## Same result with stringr::fixed


apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = stringr::fixed(
        pattern = "a",
        ignore_case = TRUE
      )
    ),
    p_count = stringr::str_count(
      string = text,
      pattern = stringr::fixed(
        pattern = "p",
        ignore_case = TRUE
      )
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 3
#>   text  a_check p_count
#>   <chr> <lgl>     <int>
#> 1 apple TRUE          2
#> 2 APPLE FALSE         0

## it works nicely just including the case insensitive in the regex

apple_parquet |>
  dplyr::mutate(
    a_check = stringr::str_detect(
      string = text,
      pattern = "(?i)a"
    ),
    p_count = stringr::str_count(
      string = text,
      pattern = "(?i)p"
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 3
#>   text  a_check p_count
#>   <chr> <lgl>     <int>
#> 1 apple TRUE          2
#> 2 APPLE TRUE          2



## With stringi

apple_df |>
  dplyr::mutate(
    a_check = stringi::stri_detect_regex(
      str = text,
      pattern = "a",
      case_insensitive = TRUE
    )
  ) |>
  dplyr::collect()
#> # A tibble: 2 × 2
#>   text  a_check
#>   <chr> <lgl>  
#> 1 apple TRUE   
#> 2 APPLE TRUE



apple_parquet |>
  dplyr::mutate(
    a_check = stringi::stri_detect_regex(
      str = text,
      pattern = "a",
      case_insensitive = TRUE
    )
  ) |>
  dplyr::collect()
#> Error: Expression stringi::stri_detect_regex(str = text, pattern = "a", case_insensitive = TRUE) not supported in Arrow
#> Call collect() first to pull data into R.

Created on 2023-07-17 with reprex v2.0.2

Component(s)

R

@thisisnic
Copy link
Member

Thanks for reporting this @giocomai!

In this first example, I think it's how we implemented stringr::regex() - we need to update the code to work both with and without the stringr prefix:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

df <- tibble::tibble(x =  "eXample")

# dplyr
df %>% 
  mutate(y = stringr::str_detect(x, pattern = stringr::regex("x", ignore_case = TRUE)))
#> # A tibble: 1 × 2
#>   x       y    
#>   <chr>   <lgl>
#> 1 eXample TRUE

# doesn't work
arrow_table(df) %>%
  mutate(y = stringr::str_detect(x, pattern = stringr::regex("x", ignore_case = TRUE))) %>%
  collect()
#> # A tibble: 1 × 2
#>   x       y    
#>   <chr>   <lgl>
#> 1 eXample FALSE

# works
arrow_table(df) %>%
  mutate(y = stringr::str_detect(x, pattern = regex("x", ignore_case = TRUE))) %>%
  collect()
#> # A tibble: 1 × 2
#>   x       y    
#>   <chr>   <lgl>
#> 1 eXample TRUE

Created on 2023-07-18 with reprex v2.0.2

@thisisnic
Copy link
Member

We have implemented fixed() similarly, so I'd expect similar results - works without prefix but not with - we should fix this.

@thisisnic
Copy link
Member

We don't have bindings for stringi::str_detect_regex() in arrow (what we do support can be found here), which is the source of the error message for that function call.

@thisisnic thisisnic changed the title [R] Inconsistent results with stringr::str_detect and str_count, when case_insensitive is set to TRUE via stringr::regex or stringr::fixed [R] stringr modifier functions cannot be called with namespace prefix Jul 18, 2023
thisisnic added a commit that referenced this issue Jul 20, 2023
…ace prefix (#36758)

### Rationale for this change

Bug in implementation of string modified functions caused them to be swallowed when prefixed with stringr namespace

### What changes are included in this PR?

Strips out the `stringr::` prefix when expressions contain `stringr::fixed`, `stringr::coll`, `string::boundary` or `stringr::regex`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes
* Closes: #36720

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic modified the milestones: 13.0.0, 14.0.0 Jul 20, 2023
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this issue Jul 20, 2023
…namespace prefix (apache#36758)

### Rationale for this change

Bug in implementation of string modified functions caused them to be swallowed when prefixed with stringr namespace

### What changes are included in this PR?

Strips out the `stringr::` prefix when expressions contain `stringr::fixed`, `stringr::coll`, `string::boundary` or `stringr::regex`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes
* Closes: apache#36720

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…namespace prefix (apache#36758)

### Rationale for this change

Bug in implementation of string modified functions caused them to be swallowed when prefixed with stringr namespace

### What changes are included in this PR?

Strips out the `stringr::` prefix when expressions contain `stringr::fixed`, `stringr::coll`, `string::boundary` or `stringr::regex`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes
* Closes: apache#36720

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…namespace prefix (apache#36758)

### Rationale for this change

Bug in implementation of string modified functions caused them to be swallowed when prefixed with stringr namespace

### What changes are included in this PR?

Strips out the `stringr::` prefix when expressions contain `stringr::fixed`, `stringr::coll`, `string::boundary` or `stringr::regex`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes
* Closes: apache#36720

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants