New version of stri_replace_all_fixed needed (vectorize_all=FALSE) #102

Closed
potockan opened this Issue Oct 18, 2014 · 17 comments

Comments

Projects
None yet
4 participants
@potockan

To proceed with another project I need a new version of stri_replace_all_fixed. Is it possible you could make it?

It should take 3 arguments the same as in the original function. But it should replace the i-th found pattern with the i-th element of the replacement pattern.

Example:
stri_replace_all_fixed("AAAA abcdef BBBB",c("AAAA", "BBBB"), c("1111", "2222"))

In result I would like to have one string looking like this: "1111 abcdef 2222" (right now I get a vector of two strings: "1111 abcdef BBBB" "AAAA abcdef 2222").

@gagolews gagolews self-assigned this Oct 18, 2014

@gagolews gagolews added this to the stringi-0.3 milestone Oct 18, 2014

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 18, 2014

Owner

Thanks for the suggestion. We'll get that done ASAP - that's a very useful function.
[INTERNAL: this is somehow related to issue #46]

Do you have any suggestions concerning the way to deal with overlapping patterns?

Case 1:

stri_NEW_REPLACE("aba", c("ab", "ba"), c("x", "y"))

The expected result should be (a) xa or (b) ay (c) error (d) a with a warning, (e) b with a warning?

Case 2:

stri_NEW_REPLACE("aba", c("ab", "ca"), c("c", "d"))

The expected result should be (a) ca or (b) d ?

Owner

gagolews commented Oct 18, 2014

Thanks for the suggestion. We'll get that done ASAP - that's a very useful function.
[INTERNAL: this is somehow related to issue #46]

Do you have any suggestions concerning the way to deal with overlapping patterns?

Case 1:

stri_NEW_REPLACE("aba", c("ab", "ba"), c("x", "y"))

The expected result should be (a) xa or (b) ay (c) error (d) a with a warning, (e) b with a warning?

Case 2:

stri_NEW_REPLACE("aba", c("ab", "ca"), c("c", "d"))

The expected result should be (a) ca or (b) d ?

@drammock

This comment has been minimized.

Show comment
Hide comment
@drammock

drammock Oct 19, 2014

Why is a new function needed here? what is wrong with doing a nested call:

stri_replace_all_fixed(stri_replace_all_fixed("AAAA abcdef BBBB", "AAAA", "1111"), "BBBB", "2222")

Why is a new function needed here? what is wrong with doing a nested call:

stri_replace_all_fixed(stri_replace_all_fixed("AAAA abcdef BBBB", "AAAA", "1111"), "BBBB", "2222")
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 19, 2014

Owner

If the number of patterns is large, multiple calls to stri_replace_all_fixed is inconvenient (and slow). In PHP we have the str_replace function, which I find quite useful.

BTW1:

$ php -r 'echo(str_replace(array("ab", "ba"), array("x", "y"), "aba\n"));'
xa
$ php -r 'echo(str_replace(array("ab", "ca"), array("c", "d"), "aba\n"));'
d

BTW2: If replace has fewer values than search, then an empty string is used for the rest of replacement values. If search is an array and replace is a string, then this replacement string is used for every value of search. The converse would not make sense, though.

BTW3: preg_replace -- a similarly vectorized version using regexes

What do you think about that?

Owner

gagolews commented Oct 19, 2014

If the number of patterns is large, multiple calls to stri_replace_all_fixed is inconvenient (and slow). In PHP we have the str_replace function, which I find quite useful.

BTW1:

$ php -r 'echo(str_replace(array("ab", "ba"), array("x", "y"), "aba\n"));'
xa
$ php -r 'echo(str_replace(array("ab", "ca"), array("c", "d"), "aba\n"));'
d

BTW2: If replace has fewer values than search, then an empty string is used for the rest of replacement values. If search is an array and replace is a string, then this replacement string is used for every value of search. The converse would not make sense, though.

BTW3: preg_replace -- a similarly vectorized version using regexes

What do you think about that?

@potockan

This comment has been minimized.

Show comment
Hide comment
@potockan

potockan Oct 19, 2014

In the project I only have separate cases so I guess it won't be an issue. I think that the most natural thing to do in case of overlapping patterns is to throw an error.

Maybe a good idea would be if you could choose the way the function behaves with error as the default...?

In the project I only have separate cases so I guess it won't be an issue. I think that the most natural thing to do in case of overlapping patterns is to throw an error.

Maybe a good idea would be if you could choose the way the function behaves with error as the default...?

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 19, 2014

Owner

I don't want to complicate the function too much; either an error, or str_replace, i.e. for (i in seq_along(pattern)) str <- stri_replace(str, pattern[i], replacement[i])-like behavior... both of them make sense; the first one, additionally, could be much faster, at least for the fixed case

Owner

gagolews commented Oct 19, 2014

I don't want to complicate the function too much; either an error, or str_replace, i.e. for (i in seq_along(pattern)) str <- stri_replace(str, pattern[i], replacement[i])-like behavior... both of them make sense; the first one, additionally, could be much faster, at least for the fixed case

@potockan

This comment has been minimized.

Show comment
Hide comment
@potockan

potockan Oct 19, 2014

So let's do it with an error option, as I'm going to use it a lot ( -> the speed of a funciton plays an important role).

So let's do it with an error option, as I'm going to use it a lot ( -> the speed of a funciton plays an important role).

@bartektartanus

This comment has been minimized.

Show comment
Hide comment
@bartektartanus

bartektartanus Oct 19, 2014

Contributor

I think that this version:

for (i in seq_along(pattern)) str <- stri_replace(str, pattern[i], replacement[i])

is the most natural :)

Contributor

bartektartanus commented Oct 19, 2014

I think that this version:

for (i in seq_along(pattern)) str <- stri_replace(str, pattern[i], replacement[i])

is the most natural :)

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 22, 2014

Owner

BTW, the same issue is already open: #47, closing the old one (more details here)

Owner

gagolews commented Oct 22, 2014

BTW, the same issue is already open: #47, closing the old one (more details here)

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 23, 2014

Owner

Hmmm.... Maybe this feature will be available in stri_replace_all via the vectorize=FALSE arg?....
Do we need that for stri_replace_first and stri_replace_last?

Owner

gagolews commented Oct 23, 2014

Hmmm.... Maybe this feature will be available in stri_replace_all via the vectorize=FALSE arg?....
Do we need that for stri_replace_first and stri_replace_last?

@potockan

This comment has been minimized.

Show comment
Hide comment
@potockan

potockan Oct 24, 2014

I don't think there would be a need for stri_replace_first and stri_replace_last (if you seach for the first/last pattern then you just want one to replace it with), so we can go with vectorize=FALSE argument thing.

I don't think there would be a need for stri_replace_first and stri_replace_last (if you seach for the first/last pattern then you just want one to replace it with), so we can go with vectorize=FALSE argument thing.

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 31, 2014

Owner

http://stackoverflow.com/questions/26676045/ - we have some reference solutions for benchmark tests :)

Owner

gagolews commented Oct 31, 2014

http://stackoverflow.com/questions/26676045/ - we have some reference solutions for benchmark tests :)

@gagolews gagolews changed the title from New version of stri_replace_all_fixed needed to New version of stri_replace_all_fixed needed (vectorize_all=FALSE) Nov 1, 2014

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Nov 1, 2014

Owner

So far so good:

> string <- "The quick brown fox jumped over the lazy dog."
> patterns     <- c("quick", "brown", "fox")
> replacements <- c("slow",  "black", "bear")
> stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE)
[1] "The slow black bear jumped over the lazy dog."

Benchmarks:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy",
   "sam", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze",
   "stringi", "gratyfikacja", "szczebrzeszyn", "Wólka", "Gryzelda",
   "Opiniusz", "Wernyhora", "Krajobraz", "Piedestał")
replacements <- paste0(patterns, rev(patterns))


xxx_replace_xxx2 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
   string
})

xxx_replace_xxx1 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
   string
})

microbenchmark::microbenchmark(
   xxx_replace_xxx1(string, patterns, replacements),
   xxx_replace_xxx2(string, patterns, replacements),
   stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
   times=10
)

give:

Unit: milliseconds
                                                                          expr       min        lq      mean    median        uq       max neval
                              xxx_replace_xxx1(string, patterns, replacements) 287.49586 288.89391 290.62425 290.25614 292.08440 294.80388    10
                              xxx_replace_xxx2(string, patterns, replacements)  88.21668  88.43505  97.44566  90.09334  92.13888 163.92009    10
 stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE)  32.69499  32.85521  33.59748  33.26405  33.94564  36.14842    10

I expected a more drastic speedup, but it's fine ;)

and:

microbenchmark::microbenchmark(
   xxx_replace_xxx1(string, "się", "mię"),
   xxx_replace_xxx2(string, "się", "mię"),
   stri_replace_all_fixed(string, "się", "mię", vectorize_all=FALSE),
   times=10
)

give:

Unit: milliseconds
                                                                expr       min        lq      mean    median        uq       max neval
                              xxx_replace_xxx1(string, "się", "mię") 13.243450 13.299965 13.500108 13.321585 13.421076 14.336925    10
                              xxx_replace_xxx2(string, "się", "mię")  3.847096  3.869993  4.505977  4.236147  4.863096  6.465639    10
 stri_replace_all_fixed(string, "się", "mię", vectorize_all = FALSE)  3.631004  3.688168  3.777327  3.709231  3.814820  4.285190    10
Owner

gagolews commented Nov 1, 2014

So far so good:

> string <- "The quick brown fox jumped over the lazy dog."
> patterns     <- c("quick", "brown", "fox")
> replacements <- c("slow",  "black", "bear")
> stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE)
[1] "The slow black bear jumped over the lazy dog."

Benchmarks:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy",
   "sam", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze",
   "stringi", "gratyfikacja", "szczebrzeszyn", "Wólka", "Gryzelda",
   "Opiniusz", "Wernyhora", "Krajobraz", "Piedestał")
replacements <- paste0(patterns, rev(patterns))


xxx_replace_xxx2 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
   string
})

xxx_replace_xxx1 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
   string
})

microbenchmark::microbenchmark(
   xxx_replace_xxx1(string, patterns, replacements),
   xxx_replace_xxx2(string, patterns, replacements),
   stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
   times=10
)

give:

Unit: milliseconds
                                                                          expr       min        lq      mean    median        uq       max neval
                              xxx_replace_xxx1(string, patterns, replacements) 287.49586 288.89391 290.62425 290.25614 292.08440 294.80388    10
                              xxx_replace_xxx2(string, patterns, replacements)  88.21668  88.43505  97.44566  90.09334  92.13888 163.92009    10
 stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE)  32.69499  32.85521  33.59748  33.26405  33.94564  36.14842    10

I expected a more drastic speedup, but it's fine ;)

and:

microbenchmark::microbenchmark(
   xxx_replace_xxx1(string, "się", "mię"),
   xxx_replace_xxx2(string, "się", "mię"),
   stri_replace_all_fixed(string, "się", "mię", vectorize_all=FALSE),
   times=10
)

give:

Unit: milliseconds
                                                                expr       min        lq      mean    median        uq       max neval
                              xxx_replace_xxx1(string, "się", "mię") 13.243450 13.299965 13.500108 13.321585 13.421076 14.336925    10
                              xxx_replace_xxx2(string, "się", "mię")  3.847096  3.869993  4.505977  4.236147  4.863096  6.465639    10
 stri_replace_all_fixed(string, "się", "mię", vectorize_all = FALSE)  3.631004  3.688168  3.777327  3.709231  3.814820  4.285190    10
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Nov 1, 2014

Owner

AAaaaaaargh!!! Holy moly!

xxx_replace_xxx3 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- stri_replace_all_fixed(patterns[i], replacements[i], string)
   string
})
microbenchmark::microbenchmark(
   xxx_replace_xxx2(string, patterns, replacements),
   xxx_replace_xxx3(string, patterns, replacements),
   stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
   times=100
)
Unit: milliseconds
                                                                          expr      min       lq     mean   median       uq      max neval
                              xxx_replace_xxx2(string, patterns, replacements) 85.33796 85.55921 86.38208 85.78053 87.71393 88.69729   100
                              xxx_replace_xxx3(string, patterns, replacements) 10.18930 10.26224 10.85178 10.36093 10.75792 12.78144   100
 stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE) 30.25190 30.46228 30.59021 30.54362 30.62306 32.72769   100
Owner

gagolews commented Nov 1, 2014

AAaaaaaargh!!! Holy moly!

xxx_replace_xxx3 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- stri_replace_all_fixed(patterns[i], replacements[i], string)
   string
})
microbenchmark::microbenchmark(
   xxx_replace_xxx2(string, patterns, replacements),
   xxx_replace_xxx3(string, patterns, replacements),
   stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
   times=100
)
Unit: milliseconds
                                                                          expr      min       lq     mean   median       uq      max neval
                              xxx_replace_xxx2(string, patterns, replacements) 85.33796 85.55921 86.38208 85.78053 87.71393 88.69729   100
                              xxx_replace_xxx3(string, patterns, replacements) 10.18930 10.26224 10.85178 10.36093 10.75792 12.78144   100
 stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE) 30.25190 30.46228 30.59021 30.54362 30.62306 32.72769   100
@bartektartanus

This comment has been minimized.

Show comment
Hide comment
@bartektartanus

bartektartanus Nov 1, 2014

Contributor

Simple solution is the best solution :)

Contributor

bartektartanus commented Nov 1, 2014

Simple solution is the best solution :)

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Nov 1, 2014

Owner

Aaaaaaaaaargh!!!!

Owner

gagolews commented Nov 1, 2014

Aaaaaaaaaargh!!!!

@bartektartanus

This comment has been minimized.

Show comment
Hide comment
@bartektartanus

bartektartanus Nov 1, 2014

Contributor

But is everything right?

xxx_replace_xxx3 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- stri_replace_all_fixed(patterns[i], replacements[i], string)
   string
})

I think you've got an error inside for loop.
It should be like this, doesn't it?
stri_replace_all_fixed(string, patterns[i], replacements[i])

Contributor

bartektartanus commented Nov 1, 2014

But is everything right?

xxx_replace_xxx3 <- compiler::cmpfun(function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- stri_replace_all_fixed(patterns[i], replacements[i], string)
   string
})

I think you've got an error inside for loop.
It should be like this, doesn't it?
stri_replace_all_fixed(string, patterns[i], replacements[i])

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Nov 1, 2014

Owner

Shish!!! Thanks!!

Unit: milliseconds
                                                                          expr      min       lq     mean   median       uq      max neval
                              xxx_replace_xxx3(string, patterns, replacements) 35.03884 35.34323 36.10783 35.74151 37.09618 38.87210   100
 stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE) 32.57453 32.72813 33.06105 32.84771 33.14467 36.17355   100

Anyway, I will implement that using consecutive calls to replace_all. Searching for pattern occurrences takes the most time. That'll be easier :)

Owner

gagolews commented Nov 1, 2014

Shish!!! Thanks!!

Unit: milliseconds
                                                                          expr      min       lq     mean   median       uq      max neval
                              xxx_replace_xxx3(string, patterns, replacements) 35.03884 35.34323 36.10783 35.74151 37.09618 38.87210   100
 stri_replace_all_fixed(string, patterns, replacements, vectorize_all = FALSE) 32.57453 32.72813 33.06105 32.84771 33.14467 36.17355   100

Anyway, I will implement that using consecutive calls to replace_all. Searching for pattern occurrences takes the most time. That'll be easier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment