New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stri_split, stri_extract_all: add `simplify` arg #106

Closed
gagolews opened this Issue Oct 23, 2014 · 5 comments

Comments

Projects
None yet
2 participants
@gagolews
Owner

gagolews commented Oct 23, 2014

Add a new arg to stri_split_* and stri_extract_all_*: simplify (defaults FALSE for backward compatibility.

If simplify=TRUE, then call stri_list2matrix(RESULT_OBJECT, byrow=TRUE)

Currently we have:

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab" "c" 

[[2]]
[1] "d"  "ef" "g" 

[[3]]
[1] "h"

[[4]]
character(0)

with simplify=TRUE we'd get:

> stri_list2matrix(stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE), byrow=TRUE)
     [,1] [,2] [,3]
[1,] "ab" "c"  NA  
[2,] "d"  "ef" "g" 
[3,] "h"  NA   NA  
[4,] NA   NA   NA  

which is by far more proper than doing:

> do.call(rbind, stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE))
     [,1] [,2] [,3]
[1,] "ab" "c"  "ab"
[2,] "d"  "ef" "g" 
[3,] "h"  "h"  "h" 
Warning message:
In (function (..., deparse.level = 1)  :
  number of columns of result is not a multiple of vector length (arg 1)

especially if there are non-equal numbers of tokens in the result after a split op.

@gagolews gagolews self-assigned this Oct 23, 2014

@gagolews gagolews added this to the stringi-0.3 milestone Oct 23, 2014

@gagolews gagolews changed the title from stri_split, stri_extract: add `simplify` arg to stri_split, stri_extract_all: add `simplify` arg Oct 23, 2014

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 23, 2014

Owner

A benchmark for stri_split, see this SO post

> library(data.table)
> library(stringr)
> 
> df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
> df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
> df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')
> 
> st1 <- str_split_fixed(df1$combCol2, ',', 2)
> 
> identical(do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), invert = TRUE)), stri_split_fixed(df1$combCol2, ",", 2, simplify=TRUE))
[1] TRUE
> identical(do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), invert = TRUE)), do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
[1] TRUE
>
> library("microbenchmark")
> microbenchmark(
+ #   str_split_fixed(df1$combCol2, ',', 2),
+ #   str_split_fixed(df1$combCol2, fixed(','), 2),
+    do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
+                             invert = TRUE)),
+    do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)),
+    stri_split_fixed(df1$combCol2, ",", 2, simplify=TRUE),
+    times = 10
+ )
Unit: milliseconds
                                                                                     expr       min        lq      mean    median       uq       max neval
 do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),      invert = TRUE)) 611.12734 631.18264 654.67182 648.09246 660.6387 727.21374    10
                                   do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))  52.32473  57.10677  62.85562  59.56059  66.4952  79.91838    10
                                  stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)  18.10234  18.53057  20.04524  19.30985  21.2226  24.62089    10
Owner

gagolews commented Oct 23, 2014

A benchmark for stri_split, see this SO post

> library(data.table)
> library(stringr)
> 
> df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
> df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
> df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')
> 
> st1 <- str_split_fixed(df1$combCol2, ',', 2)
> 
> identical(do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), invert = TRUE)), stri_split_fixed(df1$combCol2, ",", 2, simplify=TRUE))
[1] TRUE
> identical(do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), invert = TRUE)), do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
[1] TRUE
>
> library("microbenchmark")
> microbenchmark(
+ #   str_split_fixed(df1$combCol2, ',', 2),
+ #   str_split_fixed(df1$combCol2, fixed(','), 2),
+    do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
+                             invert = TRUE)),
+    do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)),
+    stri_split_fixed(df1$combCol2, ",", 2, simplify=TRUE),
+    times = 10
+ )
Unit: milliseconds
                                                                                     expr       min        lq      mean    median       uq       max neval
 do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),      invert = TRUE)) 611.12734 631.18264 654.67182 648.09246 660.6387 727.21374    10
                                   do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))  52.32473  57.10677  62.85562  59.56059  66.4952  79.91838    10
                                  stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)  18.10234  18.53057  20.04524  19.30985  21.2226  24.62089    10
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 23, 2014

Owner

All right, still TO DO: stri_extract_words, stri_split_boundaries. Good night!

Owner

gagolews commented Oct 23, 2014

All right, still TO DO: stri_extract_words, stri_split_boundaries. Good night!

@mrdwab

This comment has been minimized.

Show comment
Hide comment
@mrdwab

mrdwab Oct 24, 2014

This is awesome!

But... I'm now going to have to rewrite my "splitstackshape" package once "stringi-0.3" comes out :-)

I'm not sure that I understand the omit_empty argument in stri_split* and how it interacts with simplify. Row 2 is what confuses me most, but I'm also curious about row 3.

Consider the following:

vec <- c("A, B, ", "", ",C, D", "E, F,G")
stri_split_fixed(vec, ",", simplify = TRUE, omit_empty = TRUE)
#      [,1] [,2] [,3]
# [1,] "A"  " B" " " 
# [2,] NA   NA   NA  
# [3,] "C"  " D" NA  
# [4,] "E"  " F" "G" 

stri_split_fixed(vec, ",", simplify = TRUE, omit_empty = FALSE)
#      [,1] [,2] [,3]
# [1,] "A"  " B" " " 
# [2,] ""   NA   NA  
# [3,] ""   "C"  " D"
# [4,] "E"  " F" "G" 

What would be the best way to make sure that all empties have NA (or my preferred fill character)? Calling stri_list2matrix directly?

mrdwab commented Oct 24, 2014

This is awesome!

But... I'm now going to have to rewrite my "splitstackshape" package once "stringi-0.3" comes out :-)

I'm not sure that I understand the omit_empty argument in stri_split* and how it interacts with simplify. Row 2 is what confuses me most, but I'm also curious about row 3.

Consider the following:

vec <- c("A, B, ", "", ",C, D", "E, F,G")
stri_split_fixed(vec, ",", simplify = TRUE, omit_empty = TRUE)
#      [,1] [,2] [,3]
# [1,] "A"  " B" " " 
# [2,] NA   NA   NA  
# [3,] "C"  " D" NA  
# [4,] "E"  " F" "G" 

stri_split_fixed(vec, ",", simplify = TRUE, omit_empty = FALSE)
#      [,1] [,2] [,3]
# [1,] "A"  " B" " " 
# [2,] ""   NA   NA  
# [3,] ""   "C"  " D"
# [4,] "E"  " F" "G" 

What would be the best way to make sure that all empties have NA (or my preferred fill character)? Calling stri_list2matrix directly?

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 24, 2014

Owner

@mrdwab , OK, I filed a feature request in #107 - That'll be very easy.

To get

#      [,1] [,2] [,3]
# [1,] "A"  "B" NA
# [2,] NA   NA   NA  
# [3,] NA   "C"  "D"
# [4,] "E"  "F" "G" 

we'll have to write:

stri_split_regex(vec, "\\p{White_space}*,\\p{White_space}*", simplify = TRUE, omit_empty = NA)
Owner

gagolews commented Oct 24, 2014

@mrdwab , OK, I filed a feature request in #107 - That'll be very easy.

To get

#      [,1] [,2] [,3]
# [1,] "A"  "B" NA
# [2,] NA   NA   NA  
# [3,] NA   "C"  "D"
# [4,] "E"  "F" "G" 

we'll have to write:

stri_split_regex(vec, "\\p{White_space}*,\\p{White_space}*", simplify = TRUE, omit_empty = NA)
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 30, 2014

Owner

Hmm.. we don't need that for stri_split_words, there's no na_omit / n_max argument. Closing == DONE

Owner

gagolews commented Oct 30, 2014

Hmm.. we don't need that for stri_split_words, there's no na_omit / n_max argument. Closing == DONE

@gagolews gagolews closed this Oct 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment