Fast multi-pattern string matching in R with the Aho-Corasick algorithm.
ahocorasick, powered by the Rust
aho-corasick crate, allows you to
search for many substrings in one or more strings with linear complexity
in R. It builds a reusable automaton once and then uses it for
detection, counting, locating, and extraction and replacement.
Aho-Corasick is
a multiple-pattern string matching algorithm designed to search for many
keywords in a text in a single pass. After preprocessing the pattern set
into an automaton, it finds all matches in
O(text_length + number_of_matches), instead of the naive
O(text_length × number_of_patterns) approach. This makes it especially
suitable for high-throughput tasks such as dictionary matching, keyword
extraction, rule-based text filtering, and large-scale log or document
analysis.
For more details on the Aho-Corasick algorithm, please see Aho and Corasick (1975).
Install the released version from CRAN:
install.packages("ahocorasick")Install the released version from R-universe/R-multiverse:
install.packages("ahocorasick", repos = "https://yousa-mirage.r-universe.dev")
# or
install.packages("ahocorasick", repos = "https://community.r-multiverse.org")Install the development version from source. You need Cargo and rustc
on PATH.
# install.packages("pak")
pak::pak("Yousa-Mirage/r-ahocorasick")Start with a character vector of patterns you want to search for, then
compile it into an <ac_automaton> object by calling ac_build() that
can be reused across many documents.
library(ahocorasick)
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)You can set ascii_case_insensitive = TRUE to make the search
case-insensitive for ASCII characters (a-z and A-Z) only.
ac_extract() returns one data frame per document, with both the
matched text and the pattern value that produced each match.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_extract(ac, doc)
hits[[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#> 3 hello helloUse ac_extract_df() when you want one data frame for all documents.
ac_extract_df(ac, doc)
#> doc_id matches patterns
#> 1 1 hello hello
#> 2 1 world world
#> 3 1 hello hello
#> 4 3 fish fishSet overlapping = TRUE to return overlapping matches. This is only
supported when the automaton was built with the default
match_kind = "standard".
ac_overlap <- ac_build(c("aba", "bab"))
ac_extract_df(ac_overlap, "ababa", overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 aba aba
#> 2 1 bab bab
#> 3 1 aba abaac_locate() returns one data frame per document. Offsets are 1-based
character positions, inclusive on both sides, so they can be used with
substr().
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_locate(ac, doc)
hits[[1]]
#> pattern_id start end
#> 1 1 18 22
#> 2 2 24 28
#> 3 1 31 35Use ac_locate_df() when you want one data frame for all documents.
ac_locate_df(ac, doc)
#> doc_id pattern_id start end
#> 1 1 1 18 22
#> 2 1 2 24 28
#> 3 1 1 31 35
#> 4 3 3 1 4Use ac_locate_bytes() when byte offsets are more useful than R
character positions. Byte offsets are 0-based and byte_end is
end-exclusive.
Use ac_detect() when you only need to know whether a document contains
at least one match, and ac_count() when you need the number of
matches.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
ac_detect(ac, doc)
#> [1] TRUE FALSE TRUE
ac_count(ac, doc)
#> [1] 3 0 1These functions are usually the cheapest way to answer yes/no or frequency questions without materializing the matched strings or offsets.
Use ac_replace() to replace every non-overlapping match with the
replacement for the matched pattern.
patterns <- c("fox", "brown", "quick")
doc <- "The quick brown fox."
ac <- ac_build(patterns)
ac_replace(ac, doc, c("sloth", "grey", "slow"))
#> [1] "The slow grey sloth."Use one replacement string to replace all patterns with the same value.
ac_replace(ac, doc, "")
#> [1] "The ."Replacement uses the automaton’s match semantics. If you want Polars
replace_many(..., leftmost = TRUE)-style priority, build the automaton
with match_kind = "leftmost_first".
ac_leftmost <- ac_build(
c("append", "appendage", "app"),
match_kind = "leftmost_first"
)
ac_replace(ac_leftmost, "append the app to the appendage", c("x", "y", "z"))
#> [1] "x the z to the xage"The ac_*_file() family applies the same automaton to one or more
files. Pass a vector of file paths and each function returns results
aligned with those files.
ac_files <- ac_build(c("hello", "world"))
paths <- c(tempfile(fileext = ".txt"), tempfile(fileext = ".txt"))
writeLines("hello world", paths[[1]])
writeLines("fish and chips", paths[[2]])
ac_detect_file(ac_files, paths)
#> [1] TRUE FALSE
ac_count_file(ac_files, paths)
#> [1] 2 0
ac_extract_file(ac_files, paths)
#> [[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#>
#> [[2]]
#> [1] matches patterns
#> <0 rows> (or 0-length row.names)ac_detect_file(), ac_count_file(), ac_extract_file(), and
ac_replace_file() support stream = TRUE when the automaton was built
with the default match_kind = "standard". This is useful for large
files when you want to avoid reading the whole file into memory at once.
ac_locate_file() does not support stream = TRUE. It returns R
character offsets instead of raw byte offsets, and converting streamed
byte positions back into character positions would require another pass
over the file. Keeping file location search non-streaming makes the
behavior simpler and easier to reason about.
These search functions work naturally inside dplyr::mutate(). Scalar
outputs like ac_detect(), ac_count(), and ac_replace() become
ordinary columns, while ac_extract() and ac_locate() can be stored
as list-columns.
patterns <- c("hello", "world", "fish")
ac <- ac_build(patterns)
docs <- tibble::tibble(
doc = c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
)
docs |>
dplyr::mutate(
detected = ac_detect(ac, doc),
n_matches = ac_count(ac, doc),
replaced = ac_replace(ac, doc, "[match]"),
extracted = ac_extract(ac, doc)
) |>
tidyr::unnest(extracted)
#> # A tibble: 4 × 6
#> doc detected n_matches replaced matches patterns
#> <chr> <lgl> <int> <chr> <chr> <chr>
#> 1 this is my first hello world. he… TRUE 3 this is… hello hello
#> 2 this is my first hello world. he… TRUE 3 this is… world world
#> 3 this is my first hello world. he… TRUE 3 this is… hello hello
#> 4 fish and chips TRUE 1 [match]… fish fishSearch functions let you choose how NA documents behave.
doc_na <- c("hello", NA_character_, "world")
ac_na <- ac_build(c("hello", "world"))
ac_detect(ac_na, doc_na, na = "false")
#> [1] TRUE FALSE TRUE
ac_count(ac_na, doc_na, na = "zero")
#> [1] 1 0 1
ac_replace(ac_na, doc_na, "[match]", na = "empty")
#> [1] "[match]" "" "[match]"For list-column workflows, ac_locate(..., na = "empty") and
ac_extract(..., na = "empty") treat missing documents as no matches.
ac_build() exposes three match kinds from the Rust library.
The default, match_kind = "standard", returns the match that finishes
first. It is also the only mode that supports overlapping search.
patterns <- c("content", "disco", "disc", "discontent", "winter")
haystack <- "This is the winter of my discontent"
ac <- ac_build(patterns)
ac_extract_df(ac, haystack)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
ac_extract_df(ac, haystack, overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
#> 3 1 disco disco
#> 4 1 discontent discontent
#> 5 1 content contentmatch_kind = "leftmost_first" returns the leftmost non-overlapping
match. If several patterns start at the same position, the earlier
pattern wins.
ac <- ac_build(
c("disco", "disc"),
match_kind = "leftmost_first"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 disco discomatch_kind = "leftmost_longest" returns the leftmost non-overlapping
match. If several patterns start at the same position, the longest
pattern wins.
ac <- ac_build(
c("disco", "disc", "discontent"),
match_kind = "leftmost_longest"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 discontent discontentimplementation controls the underlying automaton implementation:
"auto"lets the Rust crate choose a heuristic default."noncontiguous_nfa"is fast to build and uses moderate memory."contiguous_nfa"uses memory efficiently and is usually faster to search than a noncontiguous NFA."dfa"can be fastest to search, but may be slow to build and can use much more memory.
ac <- ac_build(
c("disco", "disc"),
implementation = "dfa"
)
ac_info(ac)$implementation
#> [1] "dfa"See the benchmark
article
for results on English, Chinese, and mixed-text workloads. In the
current benchmark, ahocorasick is fastest for detect and count
across the tested cases. The ac_extract_df() is also the fastest
for extract while preserving a tidy long data-frame result.
As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!
Packages that developed to handle multiple string matching in R are less than in Python. These are what I found:
AhoCorasickTrie: Implemented in C++. It only supports 128 ASCII characters and currently does not support UTF-8/Unicode (such as CJK and emojis).Biostrings: It is a special tool for bioinformatics, and its object system isXString/DNAString/XStringSet. It is not suitable as a drop-in multi-mode retrieval tool for general R string vectors.r-polars: Polars’ string expressions also use the Aho-Corasick algorithm to match (based on the same crate as this package). If your data is already in a DataFrame, I recommend you use Polars for matching strings directly.
It’s great to see that Rust is providing more and more modern, safe, high-performance open source tools for R (and also for Python) :)
Inspired by the Python package
ahocorasick_rs. Thanks
for the excellent Rust aho-corasick
crate.
MIT © 2026 Yousa Mirage