stri_opts_brkiter - tune BreakIterator's behavior #108

Closed
gagolews opened this Issue Oct 29, 2014 · 3 comments

Comments

Projects
None yet
1 participant
@gagolews
Owner

gagolews commented Oct 29, 2014

Allow for tuning BreakIterator's behavior

  • type
  • locale
  • skip_... - breaks types to ignore

current implementation of *_boundaries is too restrictive and hence not very useful...

@gagolews gagolews self-assigned this Oct 29, 2014

@gagolews gagolews added this to the stringi-0.3 milestone Oct 29, 2014

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 29, 2014

Owner

All right, now we may play like this:

> test <- "The\u00a0above-mentioned    features are very useful. Warm thanks to their developers. 123 456 789"
> stri_split_boundaries(test, stri_opts_brkiter(type="line"))
[[1]]
 [1] "The above-"    "mentioned    " "features "     "are "          "very "         "useful. "      "Warm "         "thanks "      
 [9] "to "           "their "        "developers. "  "123 "          "456 "          "789"          

> stri_split_boundaries(test, stri_opts_brkiter(type="word"))
[[1]]
 [1] "The"        " "          "above"      "-"          "mentioned"  " "          " "          " "          " "          "features"  
[11] " "          "are"        " "          "very"       " "          "useful"     "."          " "          "Warm"       " "         
[21] "thanks"     " "          "to"         " "          "their"      " "          "developers" "."          " "          "123"       
[31] " "          "456"        " "          "789"       

> stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE))
[[1]]
 [1] "The"        "above"      "mentioned"  "features"   "are"        "very"       "useful"     "Warm"       "thanks"     "to"        
[11] "their"      "developers" "123"        "456"        "789"       

> stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_letter=TRUE))
[[1]]
[1] "123" "456" "789"

> stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_number=TRUE))
[[1]]
 [1] "The"        "above"      "mentioned"  "features"   "are"        "very"       "useful"     "Warm"       "thanks"     "to"        
[11] "their"      "developers"

> stri_split_boundaries(test, stri_opts_brkiter(type="sentence"))
[[1]]
[1] "The above-mentioned    features are very useful. " "Warm thanks to their developers. "                
[3] "123 456 789"                                      

> stri_split_boundaries(test, stri_opts_brkiter(type="sentence", skip_sentence_sep=TRUE))
[[1]]
[1] "The above-mentioned    features are very useful. " "Warm thanks to their developers. "                

> stri_split_boundaries(test, stri_opts_brkiter(type="character"))
[[1]]
 [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n" "e" "d" " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a"
[34] "r" "e" " " "v" "e" "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "W" "a" "r" "m" " " "t" "h" "a" "n" "k" "s" " " "t" "o" " " "t" "h"
[67] "e" "i" "r" " " "d" "e" "v" "e" "l" "o" "p" "e" "r" "s" "." " " "1" "2" "3" " " "4" "5" "6" " " "7" "8" "9"
Owner

gagolews commented Oct 29, 2014

All right, now we may play like this:

> test <- "The\u00a0above-mentioned    features are very useful. Warm thanks to their developers. 123 456 789"
> stri_split_boundaries(test, stri_opts_brkiter(type="line"))
[[1]]
 [1] "The above-"    "mentioned    " "features "     "are "          "very "         "useful. "      "Warm "         "thanks "      
 [9] "to "           "their "        "developers. "  "123 "          "456 "          "789"          

> stri_split_boundaries(test, stri_opts_brkiter(type="word"))
[[1]]
 [1] "The"        " "          "above"      "-"          "mentioned"  " "          " "          " "          " "          "features"  
[11] " "          "are"        " "          "very"       " "          "useful"     "."          " "          "Warm"       " "         
[21] "thanks"     " "          "to"         " "          "their"      " "          "developers" "."          " "          "123"       
[31] " "          "456"        " "          "789"       

> stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE))
[[1]]
 [1] "The"        "above"      "mentioned"  "features"   "are"        "very"       "useful"     "Warm"       "thanks"     "to"        
[11] "their"      "developers" "123"        "456"        "789"       

> stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_letter=TRUE))
[[1]]
[1] "123" "456" "789"

> stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_number=TRUE))
[[1]]
 [1] "The"        "above"      "mentioned"  "features"   "are"        "very"       "useful"     "Warm"       "thanks"     "to"        
[11] "their"      "developers"

> stri_split_boundaries(test, stri_opts_brkiter(type="sentence"))
[[1]]
[1] "The above-mentioned    features are very useful. " "Warm thanks to their developers. "                
[3] "123 456 789"                                      

> stri_split_boundaries(test, stri_opts_brkiter(type="sentence", skip_sentence_sep=TRUE))
[[1]]
[1] "The above-mentioned    features are very useful. " "Warm thanks to their developers. "                

> stri_split_boundaries(test, stri_opts_brkiter(type="character"))
[[1]]
 [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n" "e" "d" " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a"
[34] "r" "e" " " "v" "e" "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "W" "a" "r" "m" " " "t" "h" "a" "n" "k" "s" " " "t" "o" " " "t" "h"
[67] "e" "i" "r" " " "d" "e" "v" "e" "l" "o" "p" "e" "r" "s" "." " " "1" "2" "3" " " "4" "5" "6" " " "7" "8" "9"
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 29, 2014

Owner

C++ versions of stri_extract_words and stri_locate_words are no longer needed, removing

Owner

gagolews commented Oct 29, 2014

C++ versions of stri_extract_words and stri_locate_words are no longer needed, removing

gagolews added a commit that referenced this issue Oct 29, 2014

gagolews added a commit that referenced this issue Oct 30, 2014

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 30, 2014

Owner

OK, stri_trans_toupper now also uses opts_brkiter. Closing.

Owner

gagolews commented Oct 30, 2014

OK, stri_trans_toupper now also uses opts_brkiter. Closing.

@gagolews gagolews closed this Oct 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment