Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stri_locate_all_fixed(), stri_locate_first_fixed(), stri_locate_last_fixed() #12

Closed
gagolews opened this issue Jan 17, 2013 · 9 comments
Milestone

Comments

@gagolews
Copy link
Owner

Find all/first/last position of a occurence of substr in str (vectorized over str and substr)

@bartektartanus
Copy link
Contributor

to determine which one do we expect
stri_locate_last_fixed("111","11") = c(1,2) or stri_locate_last_fixed("111","11") = c(2,3) ?

In other words, do first and last return first and last row of matrix returned by all?

@gagolews
Copy link
Owner Author

Very good question.
I think that an option should be added to look for overlapping patterns.

Maybe we should use the ICU API for all stri_____fixed - http://www.icu-project.org/apiref/icu4c/usearch_8h.html#details

There are many useful options, also locale-dependent

The algorithm you use has the worst-time complexity of O(nk) (n - str len, k - patt len), modified Boyer-Moore (http://icu-project.org/docs/papers/efficient_text_searching_in_java.html) has the same complexity, but may yet be faster in practice.

What do you think? Well, we should probably rely on ICU wherever it's possible

@bartektartanus
Copy link
Contributor

> one <- stri_flatten(c(stri_dup(1,10000),2))
> pat <- stri_flatten(c(stri_dup(1,1000),2))
> print(microbenchmark(stri_locate_all_fixed(one,pat),str_locate_all(one,fixed(pat))))
Unit: milliseconds
                             expr       min        lq    median        uq       max
1 stri_locate_all_fixed(one, pat) 18.151333 18.277081 18.347167 18.616544 19.761872
2 str_locate_all(one, fixed(pat))  1.545308  1.583163  1.651747  1.730458  1.876635

Indeed, sometimes it might be much more slower.

@gagolews
Copy link
Owner Author

Something's wrong:

> stri_locate_first_fixed(c('AaaaaaaA', 'aaa', 'AAA'), 'a')
[[1]]
     start end
[1,]    NA  NA

[[2]]
     start end
[1,]    NA  NA

[[3]]
     start end
[1,]    NA  NA

Moreover, I think that stri_locate_first_fixed() and stri_locate_last_fixed() should return ONE matrix, not a list of matrices...

@bartektartanus
Copy link
Contributor

done in commit 524f4e4

@gagolews
Copy link
Owner Author

TO DO: use collator_opts

@bartektartanus
Copy link
Contributor

all functions DONE :) Closing.

@gagolews
Copy link
Owner Author

Nope, sorry, stri__locate_*_fixed_byte should use StriContainerUTF8, just like *_detect_byte does... 🎱

@bartektartanus
Copy link
Contributor

DONE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants