Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9856: [R] Add bindings for string compute functions #9423

Closed
wants to merge 4 commits into from

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Feb 5, 2021

Includes dependency toolchain updates (utf8proc, re2) and bindings for

  • "nchar", "str_length" -> "binary_length"
  • "tolower", "str_to_lower" -> "utf8_lower"
  • "toupper", "str_to_upper" -> "utf8_upper"
  • "str_trim" is handled a little differently since it's one function in stringr but three functions in arrow (utf8_ltrim_whitespace, utf8_rtrim_whitespace, utf8_trim_whitespace)

@github-actions
Copy link

github-actions bot commented Feb 5, 2021

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit homebrew-r-autobrew

@github-actions
Copy link

github-actions bot commented Feb 5, 2021

Revision: 66d81b8

Submitted crossbow builds: ursacomputing/crossbow @ actions-79

Task Status
homebrew-r-autobrew Github Actions

@nealrichardson nealrichardson marked this pull request as ready for review February 5, 2021 21:54
@nealrichardson
Copy link
Member Author

Aside from possibly adding more tests here (arguably redundant since the functions themselves are tested in C++, but perhaps worth more assurance that they behave as an R user would expect), this is done. Other TODOs have been made into followup JIRAs.

Comment on lines 266 to 271
expect_dplyr_equal(
input %>%
filter(dbl > 2, toupper(chr) %in% c("D", "F")) %>%
collect(),
tbl
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would pass without the changes in this PR. Do you want to try to test toupper() and tolower() in a way that would have failed before but succeeds now?

Copy link
Member Author

@nealrichardson nealrichardson Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky with the R string functions like this that start with

    if (!is.character(x)) 
        x <- as.character(x)

because that will work by coercing the Array data silently.

Maybe we can test that by defining as.character <- function(...) stop("🙅") in the test?

Copy link
Member Author

@nealrichardson nealrichardson Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this. You can't just define as.character in the test because you need to replace the one that toupper calls. You can't use testthat::with_mock to mock base functions anymore, and my other hack of using trace to insert a crash seems to crash too many things (presumably legitimate places where as.character gets called elsewhere in the dplyr code). mockery might allow this more targeted but I haven't had great success setting it up before; there are other ways we could instrument our code to make sure the Arrow function is called, but I'm not sure it's worth it. It definitely won't work if you called toupper on a Dataset expression, so we get some related test coverage there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did work out a test for this, PTAL

@nealrichardson
Copy link
Member Author

I'm going to merge this now. We're going to be iterating on this code for a while so I don't feel compelled to make everything perfect and thorough here.

@nealrichardson nealrichardson deleted the r-strings branch February 16, 2021 21:30
collect(),
tbl
),
NA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clever solution

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Includes dependency toolchain updates (utf8proc, re2) and bindings for

* "nchar", "str_length" -> "binary_length"
* "tolower", "str_to_lower"  -> "utf8_lower"
* "toupper", "str_to_upper" -> "utf8_upper"
* "str_trim" is handled a little differently since it's one function in `stringr` but three functions in arrow (utf8_ltrim_whitespace, utf8_rtrim_whitespace, utf8_trim_whitespace)

Closes apache#9423 from nealrichardson/r-strings

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Includes dependency toolchain updates (utf8proc, re2) and bindings for

* "nchar", "str_length" -> "binary_length"
* "tolower", "str_to_lower"  -> "utf8_lower"
* "toupper", "str_to_upper" -> "utf8_upper"
* "str_trim" is handled a little differently since it's one function in `stringr` but three functions in arrow (utf8_ltrim_whitespace, utf8_rtrim_whitespace, utf8_trim_whitespace)

Closes apache#9423 from nealrichardson/r-strings

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants