Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624

Closed
wants to merge 40 commits into from

Conversation

pachadotdev
Copy link
Contributor

No description provided.

@github-actions
Copy link

@pachadotdev pachadotdev changed the title ARROW-12992: [R] bindings for substr(), substring(), str_sub() - JUST STR SUB SKETCH ARROW-12992: [R] bindings for substr(), substring(), str_sub() Jun 30, 2021
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is non-trivial to implement bindings for stringr::str_sub so I have opened this ticket to see if there's a C++ route to doing this before trying something with R: https://issues.apache.org/jira/browse/ARROW-13259

r/R/dplyr-functions.R Show resolved Hide resolved
r/R/expression.R Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
Copy link
Contributor

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests. I'd like this included in 5.0 if possible.

Pachá and others added 5 commits July 12, 2021 11:55
Co-authored-by: Nic <thisisnic@gmail.com>
Co-authored-by: Nic <thisisnic@gmail.com>
@pachadotdev
Copy link
Contributor Author

This needs tests. I'd like this included in 5.0 if possible.

thx, I re-added some of the tests

if (func_name == "utf8_slice_codeunits") {
using Options = arrow::compute::SliceOptions;
int64_t start = 1;
int64_t stop = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still may need updating to reflect the default value of stop supplied by C++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the final comment here for more info but happy to explain if that doesn't make sense: https://issues.apache.org/jira/browse/ARROW-13259

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes !!!!
at the moment, I have added the additional "+1" to the R functions
if the backend changes, Iĺl adapt the functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready !! i added new tests and I commented the expected results

Copy link
Member

@thisisnic thisisnic Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't explain this properly, my fault! What I mean is that here stop has been set to -1 but the default C++ value isn't -1, so the default value here probably shouldn't be -1 either.

Check out this line of code here that shows the default value of stop in the C++:

explicit SliceOptions(int64_t start, int64_t stop = std::numeric_limits<int64_t>::max(),
int64_t step = 1);

The default value of stop is std::numeric_limits<int64_t>::max() which is the biggest int64. So if you don't supply a value for stop and just use the default, this allows you to slice to the end of the string.

In some of the other functions in this file, this line has been used to set the default values to the ones from the C++ code:
std::make_shared<Options>(Options::Defaults());

Maybe instead of manually setting the value of stop to -1, if you use the line above, it might make it easier to fix some of the failing tests as now you'll be able to slice to the end of a string. If this doesn't make sense, let me know!

@thisisnic
Copy link
Member

This PR now also contains some unrelated styling changes as I ran styler on the files I changed before pushing my changes.

@thisisnic
Copy link
Member

@nealrichardson - have made some updates; please could you re-review this when you have a chance? Tomorrow I'm going to add in the tests for warnings/errors raised when incorrect parameter values are supplied but is there anything else that needs updating in this PR?

Comment on lines 297 to 302
if (start <= 0) {
start <- 1
}

if (stop < start) {
stop <- 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain these? It's not obvious why this is correct, and the base R code and docs don't discuss these cases.

)
}

nse_funcs$substring <- function(text, first, last = 1000000L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just implement this one by calling nse_funcs$substr. The validation messages won't be exactly right because the argument names are different, but per the docs this function is only kept for compatibility with S, so I don't think it's a big deal.

msg = "`end` must be length 1 - other lengths are not supported in Arrow"
)

if (start == 0) start <- 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use braces even for short if statements


if (end < start) end <- 0

if (start > 0) start <- start - 1L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this version subtract 1 up here but the others don't? Why only if start > 0? Is start <= 0 valid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subtracting of 1 is done in this code block in this function, because it's conditional on start being greater than 0.

The other versions don't allow using negative values to count from the end backwards, so while in the others start <= 0 isn't valid, here it is.

We only subtract 1 from start when start is > 0 because:

  • we normally need to subtract 1 from start because C++ is 0-based and R is 1-based so they're counting from different points
  • we don't need to subtract 1 from end as R counts inclusively (i.e. returned string includes the item at position end) whereas C++ counts exclusively (i.e. returned string includes up to the item at position end which effectively cancels out the difference due to indexing
  • str_sub treats a start value of 0 or 1 as the same thing, which is why here, the subtraction is not done when start == 0 (and so resulting in them both passing a start value of 0 being passed to utf8_slice_codeunits)
  • a start value < 0 is valid as both str_sub and utf8_slice_codeunits can count backwards from the end with -1 being the final character in the string, -2 being the second to last character, etc.

I'll add some of this to the code in the form of a comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this also makes setting start to 1 if it's 0 totally redundant so I'll remove that too

Expression$create(
"utf8_slice_codeunits",
string,
options = list(start = start - 1L, stop = stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only subtract 1 from start but not stop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to subtract 1 from end as R counts inclusively (i.e. returned string includes the item at position end) whereas C++ counts exclusively (i.e. returned string includes up to the item at position end which effectively cancels out the difference due to indexing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves an explanatory comment

Comment on lines 472 to 473
pattern =
opts$pattern,
opts$pattern,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just linting but IDK why opts$pattern is on its own line instead of next to = above it.

)
})

test_that("substring", {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you made substring just call substr then you could delete all but one of these tests.

@thisisnic
Copy link
Member

@nealrichardson - have made updates, when you get the chance, please could you re-review this?

r/R/dplyr-functions.R Outdated Show resolved Hide resolved
Expression$create(
"utf8_slice_codeunits",
string,
options = list(start = start - 1L, stop = stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves an explanatory comment

)
}

nse_funcs$substring <- nse_funcs$substr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't exactly work because the arguments are named differently

end <- 0
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change



if (start > 0) {
start <- start - 1L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These deserve explanatory comments, particularly noting the differences in behavior among base::substr, stringr::str_sub, and arrow's utf8_slice_codeunits. Basically a concise version of what you answered me in the PR review. (It's a good rule of thumb that if I ask you why something is a certain way because it wasn't obvious to me, the answer should probably be a code comment and not (just) a PR comment--if it's not obvious to the reader now, it definitely won't be obvious to us in 6 months or whenever we have to revise this code and wonder why things are munged a certain way.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good to know, will update with this

thisisnic and others added 5 commits July 15, 2021 12:26
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
@thisisnic
Copy link
Member

@nealrichardson - ready for re-review, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants