ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624

pachadotdev · 2021-06-29T22:17:39Z

No description provided.

github-actions · 2021-06-29T22:18:03Z

https://issues.apache.org/jira/browse/ARROW-12992

thisisnic

It is non-trivial to implement bindings for stringr::str_sub so I have opened this ticket to see if there's a C++ route to doing this before trying something with R: https://issues.apache.org/jira/browse/ARROW-13259

r/R/dplyr-functions.R

r/R/expression.R

r/tests/testthat/test-dplyr-string-functions.R

…o arrow12992_str_sub

r/R/dplyr-functions.R

nealrichardson

This needs tests. I'd like this included in 5.0 if possible.

Co-authored-by: Nic <thisisnic@gmail.com>

pachadotdev · 2021-07-12T16:32:46Z

This needs tests. I'd like this included in 5.0 if possible.

thx, I re-added some of the tests

thisisnic · 2021-07-12T16:35:31Z

r/src/compute.cpp

+  if (func_name == "utf8_slice_codeunits") {
+    using Options = arrow::compute::SliceOptions;
+    int64_t start = 1;
+    int64_t stop = -1;


I think this still may need updating to reflect the default value of stop supplied by C++

See the final comment here for more info but happy to explain if that doesn't make sense: https://issues.apache.org/jira/browse/ARROW-13259

yes !!!!
at the moment, I have added the additional "+1" to the R functions
if the backend changes, Iĺl adapt the functions

ready !! i added new tests and I commented the expected results

Sorry, I didn't explain this properly, my fault! What I mean is that here stop has been set to -1 but the default C++ value isn't -1, so the default value here probably shouldn't be -1 either.

Check out this line of code here that shows the default value of stop in the C++:

arrow/cpp/src/arrow/compute/api_scalar.h

Lines 205 to 206 in 7114c4b

explicit SliceOptions(int64_t start, int64_t stop = std::numeric_limits<int64_t>::max(),

int64_t step = 1);

The default value of stop is std::numeric_limits<int64_t>::max() which is the biggest int64. So if you don't supply a value for stop and just use the default, this allows you to slice to the end of the string.

In some of the other functions in this file, this line has been used to set the default values to the ones from the C++ code:
std::make_shared<Options>(Options::Defaults());

Maybe instead of manually setting the value of stop to -1, if you use the line above, it might make it easier to fix some of the failing tests as now you'll be able to slice to the end of a string. If this doesn't make sense, let me know!

…o ARROW_12992_str_sub

thisisnic · 2021-07-14T18:25:12Z

This PR now also contains some unrelated styling changes as I ran styler on the files I changed before pushing my changes.

thisisnic · 2021-07-14T18:50:53Z

@nealrichardson - have made some updates; please could you re-review this when you have a chance? Tomorrow I'm going to add in the tests for warnings/errors raised when incorrect parameter values are supplied but is there anything else that needs updating in this PR?

nealrichardson · 2021-07-14T20:00:03Z

r/R/dplyr-functions.R

+  if (start <= 0) {
+    start <- 1
+  }
+
+  if (stop < start) {
+    stop <- 0


Can you explain these? It's not obvious why this is correct, and the base R code and docs don't discuss these cases.

nealrichardson · 2021-07-14T20:01:45Z

r/R/dplyr-functions.R

+  )
+}
+
+nse_funcs$substring <- function(text, first, last = 1000000L) {


You could just implement this one by calling nse_funcs$substr. The validation messages won't be exactly right because the argument names are different, but per the docs this function is only kept for compatibility with S, so I don't think it's a big deal.

nealrichardson · 2021-07-14T20:04:33Z

r/R/dplyr-functions.R

+    msg = "`end` must be length 1 - other lengths are not supported in Arrow"
+  )
+
+  if (start == 0) start <- 1


Please use braces even for short if statements

nealrichardson · 2021-07-14T20:05:29Z

r/R/dplyr-functions.R

+
+  if (end < start) end <- 0
+
+  if (start > 0) start <- start - 1L


Why does this version subtract 1 up here but the others don't? Why only if start > 0? Is start <= 0 valid?

The subtracting of 1 is done in this code block in this function, because it's conditional on start being greater than 0.

The other versions don't allow using negative values to count from the end backwards, so while in the others start <= 0 isn't valid, here it is.

We only subtract 1 from start when start is > 0 because:

we normally need to subtract 1 from start because C++ is 0-based and R is 1-based so they're counting from different points

we don't need to subtract 1 from end as R counts inclusively (i.e. returned string includes the item at position end) whereas C++ counts exclusively (i.e. returned string includes up to the item at position end which effectively cancels out the difference due to indexing

str_sub treats a start value of 0 or 1 as the same thing, which is why here, the subtraction is not done when start == 0 (and so resulting in them both passing a start value of 0 being passed to utf8_slice_codeunits)

a start value < 0 is valid as both str_sub and utf8_slice_codeunits can count backwards from the end with -1 being the final character in the string, -2 being the second to last character, etc.

I'll add some of this to the code in the form of a comment

Hmm, this also makes setting start to 1 if it's 0 totally redundant so I'll remove that too

nealrichardson · 2021-07-14T20:06:48Z

r/R/dplyr-functions.R

+  Expression$create(
+    "utf8_slice_codeunits",
+    string,
+    options = list(start = start - 1L, stop = stop)


Why do we only subtract 1 from start but not stop?

we don't need to subtract 1 from end as R counts inclusively (i.e. returned string includes the item at position end) whereas C++ counts exclusively (i.e. returned string includes up to the item at position end which effectively cancels out the difference due to indexing

This deserves an explanatory comment

nealrichardson · 2021-07-14T20:07:38Z

r/R/dplyr-functions.R

      pattern =
-      opts$pattern,
+        opts$pattern,


I know this is just linting but IDK why opts$pattern is on its own line instead of next to = above it.

nealrichardson · 2021-07-14T20:08:58Z

r/tests/testthat/test-dplyr-string-functions.R

+  )
+})
+
+test_that("substring", {


If you made substring just call substr then you could delete all but one of these tests.

…ng and substr, fix brace usage, fix weird linting

thisisnic · 2021-07-15T09:04:05Z

@nealrichardson - have made updates, when you get the chance, please could you re-review this?

r/R/dplyr-functions.R

nealrichardson · 2021-07-15T12:15:05Z

r/R/dplyr-functions.R

+  Expression$create(
+    "utf8_slice_codeunits",
+    string,
+    options = list(start = start - 1L, stop = stop)


This deserves an explanatory comment

nealrichardson · 2021-07-15T12:15:29Z

r/R/dplyr-functions.R

+  )
+}
+
+nse_funcs$substring <- nse_funcs$substr


This won't exactly work because the arguments are named differently

nealrichardson · 2021-07-15T12:16:31Z

r/R/dplyr-functions.R

+    end <- 0
+  }
+
+


Suggested change

nealrichardson · 2021-07-15T12:19:44Z

r/R/dplyr-functions.R

+
+
+  if (start > 0) {
+    start <- start - 1L


These deserve explanatory comments, particularly noting the differences in behavior among base::substr, stringr::str_sub, and arrow's utf8_slice_codeunits. Basically a concise version of what you answered me in the PR review. (It's a good rule of thumb that if I ask you why something is a certain way because it wasn't obvious to me, the answer should probably be a code comment and not (just) a PR comment--if it's not obvious to the reader now, it definitely won't be obvious to us in 6 months or whenever we have to revise this code and wonder why things are munged a certain way.)

Thanks, good to know, will update with this

r/tests/testthat/test-dplyr-string-functions.R

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

…s, add explanatory comment

…o ARROW_12992_str_sub

thisisnic · 2021-07-15T13:28:08Z

@nealrichardson - ready for re-review, thanks!

sketch

1d4f6de

github-actions bot added the Component: R label Jun 29, 2021

pachadotdev added 5 commits June 29, 2021 18:35

match

f4814a2

almost ok

0d26eb2

substr/str_sub ok

a5aa76e

str_trim indent

309f22d

str_trim whitesapce

bf79c39

pachadotdev changed the title ~~ARROW-12992: [R] bindings for substr(), substring(), str_sub() - JUST STR SUB SKETCH~~ ARROW-12992: [R] bindings for substr(), substring(), str_sub() Jun 30, 2021

pachadotdev added 2 commits June 30, 2021 13:09

substring

bd72399

Merge remote-tracking branch 'apache/master' into arrow12992_str_sub

7b5ca48

thisisnic reviewed Jul 5, 2021

View reviewed changes

r/R/dplyr-functions.R Show resolved Hide resolved

r/R/expression.R Show resolved Hide resolved

thisisnic reviewed Jul 5, 2021

View reviewed changes

r/tests/testthat/test-dplyr-string-functions.R Outdated Show resolved Hide resolved

pachadotdev and others added 6 commits July 5, 2021 17:15

copying from python

8ea6e17

Merge branch 'apache:master' into arrow12992_str_sub

df0ea19

Nic comments

461bfcf

Merge branch 'arrow12992_str_sub' of github.com:pachadotdev/arrow int…

36da8c2

…o arrow12992_str_sub

commit before merge

be55c4e

remove reordering start/end

d1eb2fe

thisisnic reviewed Jul 12, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

thisisnic reviewed Jul 12, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

thisisnic reviewed Jul 12, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

nealrichardson requested changes Jul 12, 2021

View reviewed changes

Pachá and others added 5 commits July 12, 2021 11:55

Update r/R/dplyr-functions.R

fc3967c

Co-authored-by: Nic <thisisnic@gmail.com>

Update r/R/dplyr-functions.R

439dc66

Co-authored-by: Nic <thisisnic@gmail.com>

dplyr funs spacing

60f36d1

tests

7bd9e78

re-added tests

bd2e688

thisisnic reviewed Jul 12, 2021

View reviewed changes

pachadotdev and others added 12 commits July 12, 2021 16:11

Merge remote-tracking branch 'apache/master' into arrow12992_str_sub

3d76023

indices <=0 in tests

0053b37

Implement step and make stop implementation match the C++

4611e32

Update tests and function

fc33fc7

remove comments from tests

e866cf8

set stop val

543bb8c

Run styler

2f20db6

Run styler

c67f11b

Save space

989a4f2

Merge branch 'master' into arrow12992_str_sub

748becb

Lint the C++

d1612ba

Merge branch 'arrow12992_str_sub' of github.com:pachadotdev/arrow int…

21d0306

…o ARROW_12992_str_sub

nealrichardson requested changes Jul 14, 2021

View reviewed changes

thisisnic added 4 commits July 15, 2021 09:44

Add comments to explain non-obvious stuff, use 1 function for substri…

25a9208

…ng and substr, fix brace usage, fix weird linting

Remove redundant tests

7351f3d

more comments to explain

6564200

Add tests for errors

d4c9614

nealrichardson reviewed Jul 15, 2021

View reviewed changes

thisisnic and others added 5 commits July 15, 2021 12:26

Update r/tests/testthat/test-dplyr-string-functions.R

313e986

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Update r/R/dplyr-functions.R

57ac6e2

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Fix substr argument name, make substring work with original parameter…

e0d30b5

…s, add explanatory comment

Add more comments

5ee76f5

Merge branch 'arrow12992_str_sub' of github.com:pachadotdev/arrow int…

4f08bdf

…o ARROW_12992_str_sub

nealrichardson approved these changes Jul 15, 2021

View reviewed changes

nealrichardson closed this in d55383d Jul 15, 2021

asfimport mentioned this pull request Jul 15, 2021

[R] bindings for substr(), substring(), str_sub() #28709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624

ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624

pachadotdev commented Jun 29, 2021

github-actions bot commented Jun 29, 2021

thisisnic left a comment

nealrichardson left a comment

pachadotdev commented Jul 12, 2021

thisisnic Jul 12, 2021

thisisnic Jul 12, 2021

pachadotdev Jul 12, 2021

pachadotdev Jul 12, 2021

thisisnic Jul 13, 2021 •

edited

thisisnic commented Jul 14, 2021

thisisnic commented Jul 14, 2021

nealrichardson Jul 14, 2021

nealrichardson Jul 14, 2021

nealrichardson Jul 14, 2021

nealrichardson Jul 14, 2021

thisisnic Jul 15, 2021

thisisnic Jul 15, 2021

nealrichardson Jul 14, 2021

thisisnic Jul 15, 2021

nealrichardson Jul 15, 2021

nealrichardson Jul 14, 2021

nealrichardson Jul 14, 2021

thisisnic commented Jul 15, 2021

nealrichardson Jul 15, 2021

nealrichardson Jul 15, 2021

nealrichardson Jul 15, 2021

nealrichardson Jul 15, 2021

thisisnic Jul 15, 2021

thisisnic commented Jul 15, 2021

	explicit SliceOptions(int64_t start, int64_t stop = std::numeric_limits<int64_t>::max(),
	int64_t step = 1);

ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624

ARROW-12992: [R] bindings for substr(), substring(), str_sub() #10624

Conversation

pachadotdev commented Jun 29, 2021

github-actions bot commented Jun 29, 2021

thisisnic left a comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

pachadotdev commented Jul 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Jul 13, 2021 • edited

Choose a reason for hiding this comment

thisisnic commented Jul 14, 2021

thisisnic commented Jul 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic commented Jul 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic commented Jul 15, 2021

thisisnic Jul 13, 2021 •

edited