Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] binding for grepl has different behaviour with NA compared to R base grepl #31430

Closed
asfimport opened this issue Mar 22, 2022 · 11 comments
Closed

Comments

@asfimport
Copy link

The arrow binding to grepl behaves slightly differently than the base R {}grepl{}, in that it returns NA for NA inputs, whereas base grepl returns {}FALSE with NA inputs. arrow's implementation is consistent with stringr::str_detect(){}, and both str_detect() and grepl() are bound to match_substring_regex and match_substring in arrow.

I don't know if this is something you would want to change so that the grepl behaviour aligns with base {}grepl{}, or simply document this difference?

Reprex:
 

library(arrow, warn.conflicts = FALSE, quietly = TRUE)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(stringr, quietly = TRUE)

alpha_df <- data.frame(alpha = c("alpha", "bet", NA_character_))
alpha_dataset <- InMemoryDataset$create(alpha_df)

mutate(alpha_df, 
       grepl_is_a = grepl("a", alpha), 
       stringr_is_a = str_detect(alpha, "a"))
#>   alpha grepl_is_a stringr_is_a
#> 1 alpha       TRUE         TRUE
#> 2   bet      FALSE        FALSE
#> 3  <NA>      FALSE           NA

mutate(alpha_dataset, 
       grepl_is_a = grepl("a", alpha), 
       stringr_is_a = str_detect(alpha, "a")) |> 
  collect()
#>   alpha grepl_is_a stringr_is_a
#> 1 alpha       TRUE         TRUE
#> 2   bet      FALSE        FALSE
#> 3  <NA>         NA           NA

# base R grepl returns FALSE for NA
grepl("a", alpha_df$alpha) # bound to arrow_match_substring_regex
#> [1]  TRUE FALSE FALSE

grepl("a", alpha_df$alpha, fixed = TRUE) # bound to arrow_match_substring
#> [1]  TRUE FALSE FALSE

# stringr::str_dectect returns NA for NA
str_detect(alpha_df$alpha, "a")
#> [1]  TRUE FALSE    NA

alpha_array <- Array$create(alpha_df$alpha)

# arrow functions return null for null (NA)
call_function("match_substring_regex", alpha_array, options = list(pattern = "a"))
#> Array
#> <bool>
#> [
#>   true,
#>   false,
#>   null
#> ]

call_function("match_substring", alpha_array, options = list(pattern = "a"))
#> Array
#> <bool>
#> [
#>   true,
#>   false,
#>   null
#> ]

 

 

Reporter: Andy Teucher / @ateucher
Assignee: Andy Teucher / @ateucher

PRs and other links:

Note: This issue was originally created as ARROW-16007. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
I meant to add - whether you'd like to change the behaviour or document the differences, I'd be keen to take it on!

@asfimport
Copy link
Author

Jonathan Keane / @jonkeane:
Good catch, as far as I see there was not a principled reason to diverge like this.

We would of course welcome a PR to add tests for this + update the Arrow grepl behavior to match base R's. Let us know if you would like any pointers

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
Great! I anticipate that I will need some pointers, but will do some investigation first.

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
@jonkeane I'd like a bit of advice if you don't mind. My initial thoughts on this were to do it in the C++ code in the arrow library, and add a null_as_false boolean option to MatchSubstringOptions, and set null_as_false to true when RegexSubstringMatcher is called by grepl, and false otherwise (to match stringr behaviour with starts_with, str_detect etc). I'm muddling my way through and making decent progress on that path, but then it occurred to me that if this is just a special case in R, maybe it's better to do it on the R side and just change NA to FALSE in the return value of the binding of grepl?

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
I have pushed up my work so far trying to implement null_as_false in the C++ code here.

I am struggling with a couple of things:

  1. how to detect NULL values in a string_view (differentiated from an empty string). Right now I am using string_view::empty() but I don't think that's right.

  2. The code logic I've written is working in that the argument null_as_false is going to the right place (tested with a bunch of std::cout peppered around), but the return false; here is being shortcut somewhere that I can't find, as it is still returning NULL.

    I'm actually struggling to figure out where the R vector gets passed into (and out of) the C++ innards, as I'm guessing that's where those NULLs are captured and returned as NULLs, and probably where the casting to FALSE should happen.

    This is my first foray into C++ and this is a big complex codebase, so I know it's entirely possible I'm totally on the wrong track :)

@asfimport
Copy link
Author

Jonathan Keane / @jonkeane:
You've definitely identified the two paths for this. I agree with your hesitance that empty string and nulls shouldn't be conflated.

The string conversion from R is a bit complicated, but

arrow/r/src/r_to_arrow.cpp

Lines 777 to 824 in ddb663b

template <typename T>
class RPrimitiveConverter<T, enable_if_string_like<T>>
: public PrimitiveConverter<T, RConverter> {
public:
using OffsetType = typename T::offset_type;
Status Extend(SEXP x, int64_t size, int64_t offset = 0) override {
RVectorType rtype = GetVectorType(x);
if (rtype != STRING) {
return Status::Invalid("Expecting a character vector");
}
return UnsafeAppendUtf8Strings(arrow::r::utf8_strings(x), size, offset);
}
void DelayedExtend(SEXP values, int64_t size, RTasks& tasks) override {
auto task = [this, values, size]() { return this->Extend(values, size); };
// TODO: refine this., e.g. extract setup from Extend()
tasks.Append(false, std::move(task));
}
private:
Status UnsafeAppendUtf8Strings(const cpp11::strings& s, int64_t size, int64_t offset) {
RETURN_NOT_OK(this->primitive_builder_->Reserve(s.size()));
const SEXP* p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s));
// we know all the R strings are utf8 already, so we can get
// a definite size and then use UnsafeAppend*()
int64_t total_length = 0;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
total_length += si == NA_STRING ? 0 : LENGTH(si);
}
RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length));
// append
p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s));
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
if (si == NA_STRING) {
this->primitive_builder_->UnsafeAppendNull();
} else {
this->primitive_builder_->UnsafeAppend(CHAR(si), LENGTH(si));
}
}
return Status::OK();
}
};
(and the rest of that file) is a good starting point.

All of that being said, I would probably go the second route you mention (and sorry for not responding with this earlier!):

but then it occurred to me that if this is just a special case in R, maybe it's better to do it on the R side and just change NA to FALSE in the return value of the binding of grepl?

You could put a call to if_else + is.na bindings inside of the grepl binding and get the behavior in R. We do have some support via options for different null handling behaviors for other functions, but I suspect R is a bit of an outlier here (I tried to construct a reprex in Python to see if I could see what it does, but every re.match() with anything missing-like is a type error!).

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
Ok, great thanks. No need to apologize about the timeliness of your reply - even if not the right path it's been fun trying to figure it out.

I do think the R grepl behaviour is definitely an outlier (and IMO probably not the best).

Just to clarify, using if_else + is.na inside the grepl binding will utilize those bindings and thus happen in arrow at collect time?

@asfimport
Copy link
Author

Jonathan Keane / @jonkeane:
Yeah the (possibly slightly misnamed) call_binding builds an Expression that will later be evaluated at collect time. A similar setup is used at

sec <- call_binding(
"if_else",
call_binding("is.na", sec),
0,
sec
)
in ISOdatetime to run sec=NA into 0 here (another R oddity!)

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
Perfect. I've got it working, just need to rejig things a bit since many of the other bindings call the grepl binding, and we don't want that behaviour to propagate into str_detect etc.

@asfimport
Copy link
Author

Andy Teucher / @ateucher:
I've created a PR here - happy to receive feedback on the approach!

@asfimport
Copy link
Author

Jonathan Keane / @jonkeane:
Issue resolved by pull request 12711
#12711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant