New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is non-trivial to implement bindings for stringr::str_sub
so I have opened this ticket to see if there's a C++ route to doing this before trying something with R: https://issues.apache.org/jira/browse/ARROW-13259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs tests. I'd like this included in 5.0 if possible.
Co-authored-by: Nic <thisisnic@gmail.com>
Co-authored-by: Nic <thisisnic@gmail.com>
thx, I re-added some of the tests |
r/src/compute.cpp
Outdated
if (func_name == "utf8_slice_codeunits") { | ||
using Options = arrow::compute::SliceOptions; | ||
int64_t start = 1; | ||
int64_t stop = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still may need updating to reflect the default value of stop supplied by C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the final comment here for more info but happy to explain if that doesn't make sense: https://issues.apache.org/jira/browse/ARROW-13259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes !!!!
at the moment, I have added the additional "+1" to the R functions
if the backend changes, Iĺl adapt the functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready !! i added new tests and I commented the expected results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't explain this properly, my fault! What I mean is that here stop has been set to -1
but the default C++ value isn't -1
, so the default value here probably shouldn't be -1
either.
Check out this line of code here that shows the default value of stop
in the C++:
arrow/cpp/src/arrow/compute/api_scalar.h
Lines 205 to 206 in 7114c4b
explicit SliceOptions(int64_t start, int64_t stop = std::numeric_limits<int64_t>::max(), | |
int64_t step = 1); |
The default value of stop
is std::numeric_limits<int64_t>::max()
which is the biggest int64. So if you don't supply a value for stop
and just use the default, this allows you to slice to the end of the string.
In some of the other functions in this file, this line has been used to set the default values to the ones from the C++ code:
std::make_shared<Options>(Options::Defaults());
Maybe instead of manually setting the value of stop to -1
, if you use the line above, it might make it easier to fix some of the failing tests as now you'll be able to slice to the end of a string. If this doesn't make sense, let me know!
…o ARROW_12992_str_sub
This PR now also contains some unrelated styling changes as I ran styler on the files I changed before pushing my changes. |
@nealrichardson - have made some updates; please could you re-review this when you have a chance? Tomorrow I'm going to add in the tests for warnings/errors raised when incorrect parameter values are supplied but is there anything else that needs updating in this PR? |
r/R/dplyr-functions.R
Outdated
if (start <= 0) { | ||
start <- 1 | ||
} | ||
|
||
if (stop < start) { | ||
stop <- 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain these? It's not obvious why this is correct, and the base R code and docs don't discuss these cases.
r/R/dplyr-functions.R
Outdated
) | ||
} | ||
|
||
nse_funcs$substring <- function(text, first, last = 1000000L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just implement this one by calling nse_funcs$substr. The validation messages won't be exactly right because the argument names are different, but per the docs this function is only kept for compatibility with S, so I don't think it's a big deal.
r/R/dplyr-functions.R
Outdated
msg = "`end` must be length 1 - other lengths are not supported in Arrow" | ||
) | ||
|
||
if (start == 0) start <- 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use braces even for short if statements
r/R/dplyr-functions.R
Outdated
|
||
if (end < start) end <- 0 | ||
|
||
if (start > 0) start <- start - 1L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this version subtract 1 up here but the others don't? Why only if start > 0? Is start <= 0 valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subtracting of 1 is done in this code block in this function, because it's conditional on start
being greater than 0.
The other versions don't allow using negative values to count from the end backwards, so while in the others start <= 0 isn't valid, here it is.
We only subtract 1 from start when start is > 0 because:
- we normally need to subtract 1 from
start
because C++ is 0-based and R is 1-based so they're counting from different points - we don't need to subtract 1 from
end
as R counts inclusively (i.e. returned string includes the item at positionend
) whereas C++ counts exclusively (i.e. returned string includes up to the item at positionend
which effectively cancels out the difference due to indexing str_sub
treats astart
value of 0 or 1 as the same thing, which is why here, the subtraction is not done whenstart
== 0 (and so resulting in them both passing astart
value of 0 being passed toutf8_slice_codeunits
)- a
start
value < 0 is valid as bothstr_sub
andutf8_slice_codeunits
can count backwards from the end with -1 being the final character in the string, -2 being the second to last character, etc.
I'll add some of this to the code in the form of a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this also makes setting start
to 1 if it's 0 totally redundant so I'll remove that too
Expression$create( | ||
"utf8_slice_codeunits", | ||
string, | ||
options = list(start = start - 1L, stop = stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only subtract 1 from start but not stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to subtract 1 from end
as R counts inclusively (i.e. returned string includes the item at position end
) whereas C++ counts exclusively (i.e. returned string includes up to the item at position end
which effectively cancels out the difference due to indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves an explanatory comment
r/R/dplyr-functions.R
Outdated
pattern = | ||
opts$pattern, | ||
opts$pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just linting but IDK why opts$pattern is on its own line instead of next to = above it.
) | ||
}) | ||
|
||
test_that("substring", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you made substring just call substr then you could delete all but one of these tests.
…ng and substr, fix brace usage, fix weird linting
@nealrichardson - have made updates, when you get the chance, please could you re-review this? |
Expression$create( | ||
"utf8_slice_codeunits", | ||
string, | ||
options = list(start = start - 1L, stop = stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves an explanatory comment
r/R/dplyr-functions.R
Outdated
) | ||
} | ||
|
||
nse_funcs$substring <- nse_funcs$substr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't exactly work because the arguments are named differently
r/R/dplyr-functions.R
Outdated
end <- 0 | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
if (start > 0) { | ||
start <- start - 1L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These deserve explanatory comments, particularly noting the differences in behavior among base::substr, stringr::str_sub, and arrow's utf8_slice_codeunits. Basically a concise version of what you answered me in the PR review. (It's a good rule of thumb that if I ask you why something is a certain way because it wasn't obvious to me, the answer should probably be a code comment and not (just) a PR comment--if it's not obvious to the reader now, it definitely won't be obvious to us in 6 months or whenever we have to revise this code and wonder why things are munged a certain way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good to know, will update with this
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
…s, add explanatory comment
…o ARROW_12992_str_sub
@nealrichardson - ready for re-review, thanks! |
No description provided.