Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711

Closed
wants to merge 5 commits into from

Conversation

ateucher
Copy link
Contributor

This ensures that the grepl binding mimics R's base grepl by returning FALSE for NA inputs (previously it returned NA). As several other bindings called the grepl binding and we don't want the grepl behaviour with NA to propagate to those bindings (they all return NA with NA inputs), I had to change how they were constructed as well. I abstracted out the main parts of the register_binding for the string matching functions (those that return a logical) into a helper function create_string_match_expr(), which is used by the bindings for grepl, str_detect, str_starts, and str_ends.

I added several tests for NA behaviour - which failed before the changes and now pass (at least locally, will wait for CI to finish here)

Test behaviour of grepl, str_detect, startsWith, str_starts, endsWith, str_ends with NA.
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this really excellent PR + catch! The test additions that you made are fantastic, and a few nice cleanups along the way.

I've only got one substantive comment about str_starts and str_endswith the negate option.

Thank you again!

Comment on lines 277 to 294
if (negate) {
out <- !out
out <- create_string_match_expr(
arrow_fun = "match_substring_regex",
string = string,
pattern = paste0(opts$pattern, "$"),
ignore_case = opts$ignore_case,
negate = negate
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think we have tests that would cover this (though we should!) but what would happen if someone called str_ends("bar_foo", fixed("foo"), negate = TRUE)? I think we might still need the separate if (negate) {...} block here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a great catch. I'll add a test and modify accordingly. It's probably worth taking that out of create_string_match_expr() and just doing it explicitly in the three functions that need it, as before.

Comment on lines +261 to +272
expect_equal(
df %>%
Table$create() %>%
mutate(
a = grepl("O", x, ignore.case = TRUE, fixed = TRUE)
) %>%
collect(),
tibble(
x = c("Foo", "bar", NA_character_),
a = c(TRUE, FALSE, FALSE)
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_equal(
df %>%
Table$create() %>%
mutate(
a = grepl("O", x, ignore.case = TRUE, fixed = TRUE)
) %>%
collect(),
tibble(
x = c("Foo", "bar", NA_character_),
a = c(TRUE, FALSE, FALSE)
)
)
compare_dplyr_binding(
.input %>%
mutate(
a = grepl("O", x, ignore.case = TRUE, fixed = TRUE)
) %>%
collect(),
df
)

This is more stylistic than anything else, but you should be able to swap out this code

This uses one of our custom expectations, which can be a little hairy at first, but test a few different routes (as a table, as a record batch) + confirm we're getting the same behavior as base R | the tidyverse functions we are binding

#' Ensure that dplyr methods on Arrow objects return the same as for data frames
#'
#' This function compares the output of running a dplyr expression on a tibble
#' or data.frame object against the output of the same expression run on
#' Arrow Table and RecordBatch objects.
#'
#'
#' @param expr A dplyr pipeline which must have `.input` as its start
#' @param tbl A tibble or data.frame which will be substituted for `.input`
#' @param skip_record_batch The skip message to show (if you should skip the
#' RecordBatch test)
#' @param skip_table The skip message to show (if you should skip the Table test)
#' @param warning The expected warning from the RecordBatch and Table comparison
#' paths, passed to `expect_warning()`. Special values:
#' * `NA` (the default) for ensuring no warning message
#' * `TRUE` is a special case to mean to check for the
#' "not supported in Arrow; pulling data into R" message.
#' @param ... additional arguments, passed to `expect_equal()`
compare_dplyr_binding <- function(expr,
tbl,
skip_record_batch = NULL,
skip_table = NULL,
warning = NA,
...) {
# Quote the contents of `expr` so that we can evaluate it a few different ways
expr <- rlang::enquo(expr)
# Get the expected output by evaluating expr on the .input data.frame using regular dplyr
expected <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(.input = tbl)))
if (isTRUE(warning)) {
# Special-case the simple warning:
warning <- "not supported in Arrow; pulling data into R"
}
skip_msg <- NULL
# Evaluate `expr` on a RecordBatch object and compare with `expected`
if (is.null(skip_record_batch)) {
expect_warning(
via_batch <- rlang::eval_tidy(
expr,
rlang::new_data_mask(rlang::env(.input = record_batch(tbl)))
),
warning
)
expect_equal(via_batch, expected, ...)
} else {
skip_msg <- c(skip_msg, skip_record_batch)
}
# Evaluate `expr` on a Table object and compare with `expected`
if (is.null(skip_table)) {
expect_warning(
via_table <- rlang::eval_tidy(
expr,
rlang::new_data_mask(rlang::env(.input = arrow_table(tbl)))
),
warning
)
expect_equal(via_table, expected, ...)
} else {
skip_msg <- c(skip_msg, skip_table)
}
if (!is.null(skip_msg)) {
skip(paste(skip_msg, collapse = "\n"))
}
}

Copy link
Contributor Author

@ateucher ateucher Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was impressed with that compare_dplyr_bindings expectation. Very clever! I left it this way (explicitly comparing to the manually-created tibble) due to the comment here. Admittedly I didn't try to make it work with compare_dplyr_bindings, just modified the existing test, but can have a go...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, yup you're absolutely right. I missed that comment up there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that base grepl gives that warning when ignore.case = TRUE and fixed = TRUE, should the binding for grepl (as well as for sub and gsub) also emit that warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yeah maybe. Would you mind opening a Jira for that (we can get some discussion there + implement it separately so as not to extend the scope of this PR too much!)

r/tests/testthat/test-dplyr-funcs-string.R Show resolved Hide resolved
r/tests/testthat/test-dplyr-funcs-string.R Outdated Show resolved Hide resolved
Comment on lines 260 to 278
out <- call_binding("grepl", pattern = paste0("^", opts$pattern), x = string, fixed = FALSE)
}

if (negate) {
out <- !out
out <- create_string_match_expr(
arrow_fun = "match_substring_regex",
string = string,
pattern = paste0("^", opts$pattern),
ignore_case = opts$ignore_case,
negate = negate
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here about negate as str_ends below (sorry the comments are backwards! I saw it there first 🙈 )

ateucher and others added 2 commits March 25, 2022 09:16
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
@ateucher
Copy link
Contributor Author

I think I've addressed all of your comments now @jonkeane - let me know what you need me to do next!

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I'm going to wait for CI to finish and then merge. Thank you so much for the excellent PR!

@ateucher
Copy link
Contributor Author

My pleasure! Thank you for the help and tips... and kudos on the Arrow dev guide. It's really good.

@jonkeane jonkeane closed this in 5bd4d8e Mar 26, 2022
@ursabot
Copy link

ursabot commented Mar 26, 2022

Benchmark runs are scheduled for baseline = bfa3bca and contender = 5bd4d8e. 5bd4d8e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️25.0% ⚠️ Contender and baseline run contexts do not match] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.75% ⬆️0.21%] test-mac-arm
[Finished ⬇️2.14% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.38% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ateucher ateucher deleted the r-grepl-na-false branch March 26, 2022 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants