fm.index

fm.index is an R package providing a fast data structure (FM Index) for finding occurrences of string snippets in large libraries of strings (corpus).

Partial string matching can be ~50-fold faster than simple string scans for many real world string collections. A given corpus is converted into a compact in-memory FM index representation that can be efficiently queried for partial string matches.

fm.index wraps the C++ library SDSL v3 and uses a Compressed Suffix Array based on a Wavelet Tree of the Burrow-Wheeler Transform of the given string library.

Installation

Install the most recent fm.index CRAN release

install.packages("fm.index")

You can install the development version from Github with:

install.packages("remotes")
remotes::install_github("clemenshug/fm.index")

Example

In this example we search a library of state names for all states that contain the substring “new”.

library(fm.index)

data("state")
print(state.name)
#>  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
#>  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
#>  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
#> [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
#> [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
#> [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
#> [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
#> [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
#> [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
#> [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
#> [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
#> [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
#> [49] "Wisconsin"      "Wyoming"

index <- fm_index_create(state.name, case_sensitive = FALSE)
hits <- fm_index_locate("new", index)
print(hits)
#>   pattern_index corpus_index position
#> 1             1           29        1
#> 2             1           30        1
#> 3             1           31        1
#> 4             1           32        1

The result is a data frame with three columns. pattern_index is the index of the query pattern, corpus_index is the index of the matching string in the string corpus, and position is the starting position of the match in the indexed string. All indices are 1-based.

In order to extract the matching states we can use the corpus_index to subset the vector of original state names.

print(state.name[hits$corpus_index])
#> [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

Performance

The speedup achieved by using fm.index over simple string scans, for example using grepl() or stringi::stri_locate(), depends on the kind of strings in the library.

Random strings

For this example, we can generate a million random strings of length 50 and search them for all occurrences of the letters “ab”.

library(stringi)
library(microbenchmark)

set.seed(42)
lib_random <- stri_rand_strings(1000000, 50)
idx_random <- fm_index_create(lib_random)

head(fm_index_locate("ab", idx_random))
#>   pattern_index corpus_index position
#> 1             1       792988       36
#> 2             1       251840       23
#> 3             1       875759        1
#> 4             1       341798       47
#> 5             1        56031       50
#> 6             1       376499       40

random_benchmark <- microbenchmark(
  fm.index = fm_index_locate("ab", idx_random),
  stringi = stri_locate_all_coll(lib_random, "ab"),
  times = 3
)
print(random_benchmark)
#> Unit: milliseconds
#>      expr      min        lq      mean    median        uq       max neval cld
#>  fm.index  927.259  932.7593  934.6613  938.2596  938.3625  938.4654     3  a 
#>   stringi 4141.160 4230.1927 4260.2905 4319.2250 4319.8556 4320.4863     3   b

Random strings are hard to compress, so searching using an FM Index only yields a modest ~5-fold speedup.

Real-world text

Real-world text is usually repetitive and much easier to compress than random strings. In this example, we search each line of a book for all instances of the word “help”.

book_path <- tempfile()
download.file("http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip", book_path)
book <- scan(unz(book_path, "12370.txt"), sep = "\n", what = "character")

Five random lines from the book.

book[sample(length(book), 5)]
#> [1] "bring her a true and particular account of that strange circumstance,"  
#> [2] "it. For a long while this was my way, that whatever living beings"      
#> [3] "reason his appearance is such; he is burning with the fire of love; how"
#> [4] "prosperous. What strange fancy has possessed the royal mind! If to this"
#> [5] "water of immortality, and that in consequence of having drunk thereof,"

idx_book <- fm_index_create(book)
help_locations <- fm_index_locate("water", idx_book)

nrow(help_locations)
#> [1] 73
head(help_locations)
#>   pattern_index corpus_index position
#> 1             1         5262       43
#> 2             1         5533       16
#> 3             1         4097       23
#> 4             1         6969       18
#> 5             1         4130       13
#> 6             1         4141       46

The number of rows in the resulting data frame tells us that the word “water” occurs 73 times.

book_benchmark <- microbenchmark(
  fm.index = fm_index_locate("water", idx_book),
  stringi = stri_locate_all_coll(book, "water"),
  times = 10
)
print(book_benchmark)
#> Unit: microseconds
#>      expr       min        lq       mean     median       uq      max neval cld
#>  fm.index   513.489   535.863   635.2254   583.5145   762.65   876.80    10  a 
#>   stringi 31313.135 31425.868 31925.5229 31852.8845 32288.81 32780.87    10   b

The text from this book is encoded very efficiently by the FM Index, resulting in a ~55-fold speedup.

Funding

This work was supported by NIH grants U54-HL127365, U24-DK116204, and U54-HL127624.

License

This package is provided under the MIT license. The bundled SDSL library is licensed under the BSD license.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
R		R
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

fm.index

Installation

Example

Performance

Random strings

Real-world text

Funding

License

About

Licenses found

Releases 1

Languages

License

Licenses found

clemenshug/fm.index

Folders and files

Latest commit

History

Repository files navigation

fm.index

Installation

Example

Performance

Random strings

Real-world text

Funding

License

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 1

Languages