ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711

ateucher · 2022-03-24T23:21:13Z

This ensures that the grepl binding mimics R's base grepl by returning FALSE for NA inputs (previously it returned NA). As several other bindings called the grepl binding and we don't want the grepl behaviour with NA to propagate to those bindings (they all return NA with NA inputs), I had to change how they were constructed as well. I abstracted out the main parts of the register_binding for the string matching functions (those that return a logical) into a helper function create_string_match_expr(), which is used by the bindings for grepl, str_detect, str_starts, and str_ends.

I added several tests for NA behaviour - which failed before the changes and now pass (at least locally, will wait for CI to finish here)

Test behaviour of grepl, str_detect, startsWith, str_starts, endsWith, str_ends with NA.

…OW-16007

github-actions · 2022-03-24T23:21:32Z

https://issues.apache.org/jira/browse/ARROW-16007

github-actions · 2022-03-24T23:21:33Z

⚠️ Ticket has no components in JIRA, make sure you assign one.

github-actions · 2022-03-24T23:21:34Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jonkeane

Thank you for this really excellent PR + catch! The test additions that you made are fantastic, and a few nice cleanups along the way.

I've only got one substantive comment about str_starts and str_endswith the negate option.

Thank you again!

jonkeane · 2022-03-25T14:13:47Z

r/R/dplyr-funcs-string.R

-    if (negate) {
-      out <- !out
+      out <- create_string_match_expr(
+        arrow_fun = "match_substring_regex",
+        string = string,
+        pattern = paste0(opts$pattern, "$"),
+        ignore_case = opts$ignore_case,
+        negate = negate
+      )


Hmm, I don't think we have tests that would cover this (though we should!) but what would happen if someone called str_ends("bar_foo", fixed("foo"), negate = TRUE)? I think we might still need the separate if (negate) {...} block here

Oh that's a great catch. I'll add a test and modify accordingly. It's probably worth taking that out of create_string_match_expr() and just doing it explicitly in the three functions that need it, as before.

jonkeane · 2022-03-25T14:18:33Z

r/tests/testthat/test-dplyr-funcs-string.R

+  expect_equal(
+    df %>%
+      Table$create() %>%
+      mutate(
+        a = grepl("O", x, ignore.case = TRUE, fixed = TRUE)
+      ) %>%
+      collect(),
+    tibble(
+      x = c("Foo", "bar", NA_character_),
+      a = c(TRUE, FALSE, FALSE)
+    )
+  )


Suggested change

expect_equal(

df %>%

Table$create() %>%

mutate(

a = grepl("O", x, ignore.case = TRUE, fixed = TRUE)

) %>%

collect(),

tibble(

x = c("Foo", "bar", NA_character_),

a = c(TRUE, FALSE, FALSE)

)

)

compare_dplyr_binding(

.input %>%

mutate(

a = grepl("O", x, ignore.case = TRUE, fixed = TRUE)

) %>%

collect(),

df

)

This is more stylistic than anything else, but you should be able to swap out this code

This uses one of our custom expectations, which can be a little hairy at first, but test a few different routes (as a table, as a record batch) + confirm we're getting the same behavior as base R | the tidyverse functions we are binding

arrow/r/tests/testthat/helper-expectation.R

Lines 69 to 137 in acc6c2e

#' Ensure that dplyr methods on Arrow objects return the same as for data frames

#'

#' This function compares the output of running a dplyr expression on a tibble

#' or data.frame object against the output of the same expression run on

#' Arrow Table and RecordBatch objects.

#'

#'

#' @param expr A dplyr pipeline which must have `.input` as its start

#' @param tbl A tibble or data.frame which will be substituted for `.input`

#' @param skip_record_batch The skip message to show (if you should skip the

#' RecordBatch test)

#' @param skip_table The skip message to show (if you should skip the Table test)

#' @param warning The expected warning from the RecordBatch and Table comparison

#' paths, passed to `expect_warning()`. Special values:

#' * `NA` (the default) for ensuring no warning message

#' * `TRUE` is a special case to mean to check for the

#' "not supported in Arrow; pulling data into R" message.

#' @param ... additional arguments, passed to `expect_equal()`

compare_dplyr_binding <- function(expr,

tbl,

skip_record_batch = NULL,

skip_table = NULL,

warning = NA,

...) {

# Quote the contents of `expr` so that we can evaluate it a few different ways

expr <- rlang::enquo(expr)

# Get the expected output by evaluating expr on the .input data.frame using regular dplyr

expected <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(.input = tbl)))

if (isTRUE(warning)) {

# Special-case the simple warning:

warning <- "not supported in Arrow; pulling data into R"

}

skip_msg <- NULL

# Evaluate `expr` on a RecordBatch object and compare with `expected`

if (is.null(skip_record_batch)) {

expect_warning(

via_batch <- rlang::eval_tidy(

expr,

rlang::new_data_mask(rlang::env(.input = record_batch(tbl)))

),

warning

)

expect_equal(via_batch, expected, ...)

} else {

skip_msg <- c(skip_msg, skip_record_batch)

}

# Evaluate `expr` on a Table object and compare with `expected`

if (is.null(skip_table)) {

expect_warning(

via_table <- rlang::eval_tidy(

expr,

rlang::new_data_mask(rlang::env(.input = arrow_table(tbl)))

),

warning

)

expect_equal(via_table, expected, ...)

} else {

skip_msg <- c(skip_msg, skip_table)

}

if (!is.null(skip_msg)) {

skip(paste(skip_msg, collapse = "\n"))

}

}

Yeah, I was impressed with that compare_dplyr_bindings expectation. Very clever! I left it this way (explicitly comparing to the manually-created tibble) due to the comment here. Admittedly I didn't try to make it work with compare_dplyr_bindings, just modified the existing test, but can have a go...

Aaah, yup you're absolutely right. I missed that comment up there

Given that base grepl gives that warning when ignore.case = TRUE and fixed = TRUE, should the binding for grepl (as well as for sub and gsub) also emit that warning?

Hmmm, yeah maybe. Would you mind opening a Jira for that (we can get some discussion there + implement it separately so as not to extend the scope of this PR too much!)

r/tests/testthat/test-dplyr-funcs-string.R

jonkeane · 2022-03-25T14:21:34Z

r/R/dplyr-funcs-string.R

-      out <- call_binding("grepl", pattern = paste0("^", opts$pattern), x = string, fixed = FALSE)
-    }
-
-    if (negate) {
-      out <- !out
+      out <- create_string_match_expr(
+        arrow_fun = "match_substring_regex",
+        string = string,
+        pattern = paste0("^", opts$pattern),
+        ignore_case = opts$ignore_case,
+        negate = negate
+      )


Same thing here about negate as str_ends below (sorry the comments are backwards! I saw it there first 🙈 )

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

…r_ends. Add test for str_starts Addresses apache#12711 (comment)

ateucher · 2022-03-25T21:08:46Z

I think I've addressed all of your comments now @jonkeane - let me know what you need me to do next!

jonkeane

This looks good. I'm going to wait for CI to finish and then merge. Thank you so much for the excellent PR!

ateucher · 2022-03-25T23:24:07Z

My pleasure! Thank you for the help and tips... and kudos on the Arrow dev guide. It's really good.

ursabot · 2022-03-26T00:31:25Z

Benchmark runs are scheduled for baseline = bfa3bca and contender = 5bd4d8e. 5bd4d8e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️25.0% ⚠️ Contender and baseline run contexts do not match] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.75% ⬆️0.21%] test-mac-arm
[Finished ⬇️2.14% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.38% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ateucher added 3 commits March 24, 2022 15:06

Tests for ARROW-16007

e99787b

Test behaviour of grepl, str_detect, startsWith, str_starts, endsWith, str_ends with NA.

Use mutate not transmute in test

ef075dc

grepl binding returns FALSE with NA inputs, all others return NA. ARR…

ceaac57

…OW-16007

github-actions bot added the Component: R label Mar 24, 2022

jonkeane requested changes Mar 25, 2022

View reviewed changes

ateucher and others added 2 commits March 25, 2022 09:16

Add test for str_ends with fixed & negate = TRUE

e476fda

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

ensure negate is respected with fixed() in str_detect, str_starts, st…

4a95081

…r_ends. Add test for str_starts Addresses apache#12711 (comment)

jonkeane approved these changes Mar 25, 2022

View reviewed changes

jonkeane closed this in 5bd4d8e Mar 26, 2022

ateucher deleted the r-grepl-na-false branch March 26, 2022 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711

ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711

ateucher commented Mar 24, 2022

github-actions bot commented Mar 24, 2022

github-actions bot commented Mar 24, 2022

github-actions bot commented Mar 24, 2022

jonkeane left a comment

jonkeane Mar 25, 2022

ateucher Mar 25, 2022

jonkeane Mar 25, 2022

ateucher Mar 25, 2022 •

edited

Loading

jonkeane Mar 25, 2022

ateucher Mar 25, 2022

jonkeane Mar 25, 2022

jonkeane Mar 25, 2022

ateucher commented Mar 25, 2022

jonkeane left a comment

ateucher commented Mar 25, 2022

ursabot commented Mar 26, 2022 •

edited

Loading

	#' Ensure that dplyr methods on Arrow objects return the same as for data frames
	#'
	#' This function compares the output of running a dplyr expression on a tibble
	#' or data.frame object against the output of the same expression run on
	#' Arrow Table and RecordBatch objects.
	#'
	#'
	#' @param expr A dplyr pipeline which must have `.input` as its start
	#' @param tbl A tibble or data.frame which will be substituted for `.input`
	#' @param skip_record_batch The skip message to show (if you should skip the
	#' RecordBatch test)
	#' @param skip_table The skip message to show (if you should skip the Table test)
	#' @param warning The expected warning from the RecordBatch and Table comparison
	#' paths, passed to `expect_warning()`. Special values:
	#' * `NA` (the default) for ensuring no warning message
	#' * `TRUE` is a special case to mean to check for the
	#' "not supported in Arrow; pulling data into R" message.
	#' @param ... additional arguments, passed to `expect_equal()`
	compare_dplyr_binding <- function(expr,
	tbl,
	skip_record_batch = NULL,
	skip_table = NULL,
	warning = NA,
	...) {

	# Quote the contents of `expr` so that we can evaluate it a few different ways
	expr <- rlang::enquo(expr)
	# Get the expected output by evaluating expr on the .input data.frame using regular dplyr
	expected <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(.input = tbl)))

	if (isTRUE(warning)) {
	# Special-case the simple warning:
	warning <- "not supported in Arrow; pulling data into R"
	}

	skip_msg <- NULL

	# Evaluate `expr` on a RecordBatch object and compare with `expected`
	if (is.null(skip_record_batch)) {
	expect_warning(
	via_batch <- rlang::eval_tidy(
	expr,
	rlang::new_data_mask(rlang::env(.input = record_batch(tbl)))
	),
	warning
	)
	expect_equal(via_batch, expected, ...)
	} else {
	skip_msg <- c(skip_msg, skip_record_batch)
	}

	# Evaluate `expr` on a Table object and compare with `expected`
	if (is.null(skip_table)) {
	expect_warning(
	via_table <- rlang::eval_tidy(
	expr,
	rlang::new_data_mask(rlang::env(.input = arrow_table(tbl)))
	),
	warning
	)
	expect_equal(via_table, expected, ...)
	} else {
	skip_msg <- c(skip_msg, skip_table)
	}

	if (!is.null(skip_msg)) {
	skip(paste(skip_msg, collapse = "\n"))
	}
	}

ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711

ARROW-16007: [R] grepl bindings return FALSE for NA inputs #12711

Conversation

ateucher commented Mar 24, 2022

github-actions bot commented Mar 24, 2022

github-actions bot commented Mar 24, 2022

github-actions bot commented Mar 24, 2022

jonkeane left a comment

Choose a reason for hiding this comment

jonkeane Mar 25, 2022

Choose a reason for hiding this comment

ateucher Mar 25, 2022

Choose a reason for hiding this comment

jonkeane Mar 25, 2022

Choose a reason for hiding this comment

ateucher Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

jonkeane Mar 25, 2022

Choose a reason for hiding this comment

ateucher Mar 25, 2022

Choose a reason for hiding this comment

jonkeane Mar 25, 2022

Choose a reason for hiding this comment

jonkeane Mar 25, 2022

Choose a reason for hiding this comment

ateucher commented Mar 25, 2022

jonkeane left a comment

Choose a reason for hiding this comment

ateucher commented Mar 25, 2022

ursabot commented Mar 26, 2022 • edited Loading

ateucher Mar 25, 2022 •

edited

Loading

ursabot commented Mar 26, 2022 •

edited

Loading