-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711
Conversation
Test behaviour of grepl, str_detect, startsWith, str_starts, endsWith, str_ends with NA.
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this really excellent PR + catch! The test additions that you made are fantastic, and a few nice cleanups along the way.
I've only got one substantive comment about str_starts
and str_ends
with the negate
option.
Thank you again!
r/R/dplyr-funcs-string.R
Outdated
if (negate) { | ||
out <- !out | ||
out <- create_string_match_expr( | ||
arrow_fun = "match_substring_regex", | ||
string = string, | ||
pattern = paste0(opts$pattern, "$"), | ||
ignore_case = opts$ignore_case, | ||
negate = negate | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think we have tests that would cover this (though we should!) but what would happen if someone called str_ends("bar_foo", fixed("foo"), negate = TRUE)
? I think we might still need the separate if (negate) {...}
block here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a great catch. I'll add a test and modify accordingly. It's probably worth taking that out of create_string_match_expr()
and just doing it explicitly in the three functions that need it, as before.
expect_equal( | ||
df %>% | ||
Table$create() %>% | ||
mutate( | ||
a = grepl("O", x, ignore.case = TRUE, fixed = TRUE) | ||
) %>% | ||
collect(), | ||
tibble( | ||
x = c("Foo", "bar", NA_character_), | ||
a = c(TRUE, FALSE, FALSE) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_equal( | |
df %>% | |
Table$create() %>% | |
mutate( | |
a = grepl("O", x, ignore.case = TRUE, fixed = TRUE) | |
) %>% | |
collect(), | |
tibble( | |
x = c("Foo", "bar", NA_character_), | |
a = c(TRUE, FALSE, FALSE) | |
) | |
) | |
compare_dplyr_binding( | |
.input %>% | |
mutate( | |
a = grepl("O", x, ignore.case = TRUE, fixed = TRUE) | |
) %>% | |
collect(), | |
df | |
) |
This is more stylistic than anything else, but you should be able to swap out this code
This uses one of our custom expectations, which can be a little hairy at first, but test a few different routes (as a table, as a record batch) + confirm we're getting the same behavior as base R | the tidyverse functions we are binding
arrow/r/tests/testthat/helper-expectation.R
Lines 69 to 137 in acc6c2e
#' Ensure that dplyr methods on Arrow objects return the same as for data frames | |
#' | |
#' This function compares the output of running a dplyr expression on a tibble | |
#' or data.frame object against the output of the same expression run on | |
#' Arrow Table and RecordBatch objects. | |
#' | |
#' | |
#' @param expr A dplyr pipeline which must have `.input` as its start | |
#' @param tbl A tibble or data.frame which will be substituted for `.input` | |
#' @param skip_record_batch The skip message to show (if you should skip the | |
#' RecordBatch test) | |
#' @param skip_table The skip message to show (if you should skip the Table test) | |
#' @param warning The expected warning from the RecordBatch and Table comparison | |
#' paths, passed to `expect_warning()`. Special values: | |
#' * `NA` (the default) for ensuring no warning message | |
#' * `TRUE` is a special case to mean to check for the | |
#' "not supported in Arrow; pulling data into R" message. | |
#' @param ... additional arguments, passed to `expect_equal()` | |
compare_dplyr_binding <- function(expr, | |
tbl, | |
skip_record_batch = NULL, | |
skip_table = NULL, | |
warning = NA, | |
...) { | |
# Quote the contents of `expr` so that we can evaluate it a few different ways | |
expr <- rlang::enquo(expr) | |
# Get the expected output by evaluating expr on the .input data.frame using regular dplyr | |
expected <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(.input = tbl))) | |
if (isTRUE(warning)) { | |
# Special-case the simple warning: | |
warning <- "not supported in Arrow; pulling data into R" | |
} | |
skip_msg <- NULL | |
# Evaluate `expr` on a RecordBatch object and compare with `expected` | |
if (is.null(skip_record_batch)) { | |
expect_warning( | |
via_batch <- rlang::eval_tidy( | |
expr, | |
rlang::new_data_mask(rlang::env(.input = record_batch(tbl))) | |
), | |
warning | |
) | |
expect_equal(via_batch, expected, ...) | |
} else { | |
skip_msg <- c(skip_msg, skip_record_batch) | |
} | |
# Evaluate `expr` on a Table object and compare with `expected` | |
if (is.null(skip_table)) { | |
expect_warning( | |
via_table <- rlang::eval_tidy( | |
expr, | |
rlang::new_data_mask(rlang::env(.input = arrow_table(tbl))) | |
), | |
warning | |
) | |
expect_equal(via_table, expected, ...) | |
} else { | |
skip_msg <- c(skip_msg, skip_table) | |
} | |
if (!is.null(skip_msg)) { | |
skip(paste(skip_msg, collapse = "\n")) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was impressed with that compare_dplyr_bindings
expectation. Very clever! I left it this way (explicitly comparing to the manually-created tibble) due to the comment here. Admittedly I didn't try to make it work with compare_dplyr_bindings
, just modified the existing test, but can have a go...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah, yup you're absolutely right. I missed that comment up there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that base grepl
gives that warning when ignore.case = TRUE
and fixed = TRUE
, should the binding for grepl
(as well as for sub
and gsub
) also emit that warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, yeah maybe. Would you mind opening a Jira for that (we can get some discussion there + implement it separately so as not to extend the scope of this PR too much!)
r/R/dplyr-funcs-string.R
Outdated
out <- call_binding("grepl", pattern = paste0("^", opts$pattern), x = string, fixed = FALSE) | ||
} | ||
|
||
if (negate) { | ||
out <- !out | ||
out <- create_string_match_expr( | ||
arrow_fun = "match_substring_regex", | ||
string = string, | ||
pattern = paste0("^", opts$pattern), | ||
ignore_case = opts$ignore_case, | ||
negate = negate | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here about negate as str_ends
below (sorry the comments are backwards! I saw it there first 🙈 )
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
…r_ends. Add test for str_starts Addresses apache#12711 (comment)
I think I've addressed all of your comments now @jonkeane - let me know what you need me to do next! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I'm going to wait for CI to finish and then merge. Thank you so much for the excellent PR!
My pleasure! Thank you for the help and tips... and kudos on the Arrow dev guide. It's really good. |
Benchmark runs are scheduled for baseline = bfa3bca and contender = 5bd4d8e. 5bd4d8e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This ensures that the
grepl
binding mimics R's basegrepl
by returningFALSE
forNA
inputs (previously it returnedNA
). As several other bindings called thegrepl
binding and we don't want thegrepl
behaviour withNA
to propagate to those bindings (they all returnNA
withNA
inputs), I had to change how they were constructed as well. I abstracted out the main parts of theregister_binding
for the string matching functions (those that return alogical
) into a helper functioncreate_string_match_expr()
, which is used by the bindings forgrepl
,str_detect
,str_starts
, andstr_ends
.I added several tests for
NA
behaviour - which failed before the changes and now pass (at least locally, will wait for CI to finish here)