ahocorasick

Fast multi-pattern string matching in R with the Aho-Corasick algorithm.

ahocorasick, powered by the Rust aho-corasick crate, allows you to search for many substrings in one or more strings with linear complexity in R. It builds a reusable automaton once and then uses it for detection, counting, locating, and extraction and replacement.

Aho-Corasick is a multiple-pattern string matching algorithm designed to search for many keywords in a text in a single pass. After preprocessing the pattern set into an automaton, it finds all matches in O(text_length + number_of_matches), instead of the naive O(text_length × number_of_patterns) approach. This makes it especially suitable for high-throughput tasks such as dictionary matching, keyword extraction, rule-based text filtering, and large-scale log or document analysis.

For more details on the Aho-Corasick algorithm, please see Aho and Corasick (1975).

Installation

Install the released version from CRAN:

install.packages("ahocorasick")

Install the released version from R-universe/R-multiverse:

install.packages("ahocorasick", repos = "https://yousa-mirage.r-universe.dev")
# or
install.packages("ahocorasick", repos = "https://community.r-multiverse.org")

Install the development version from source. You need Cargo and rustc on PATH.

# install.packages("pak")
pak::pak("Yousa-Mirage/r-ahocorasick")

Quick Start

Start with a character vector of patterns you want to search for, then compile it into an <ac_automaton> object by calling ac_build() that can be reused across many documents.

library(ahocorasick)

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

You can set ascii_case_insensitive = TRUE to make the search case-insensitive for ASCII characters (a-z and A-Z) only.

Extracting Matches

ac_extract() returns one data frame per document, with both the matched text and the pattern value that produced each match.

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

hits <- ac_extract(ac, doc)
hits[[1]]
#>   matches patterns
#> 1   hello    hello
#> 2   world    world
#> 3   hello    hello

Use ac_extract_df() when you want one data frame for all documents.

ac_extract_df(ac, doc)
#>   doc_id matches patterns
#> 1      1   hello    hello
#> 2      1   world    world
#> 3      1   hello    hello
#> 4      3    fish     fish

Set overlapping = TRUE to return overlapping matches. This is only supported when the automaton was built with the default match_kind = "standard".

ac_overlap <- ac_build(c("aba", "bab"))
ac_extract_df(ac_overlap, "ababa", overlapping = TRUE)
#>   doc_id matches patterns
#> 1      1     aba      aba
#> 2      1     bab      bab
#> 3      1     aba      aba

Locating Matches

ac_locate() returns one data frame per document. Offsets are 1-based character positions, inclusive on both sides, so they can be used with substr().

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

hits <- ac_locate(ac, doc)
hits[[1]]
#>   pattern_id start end
#> 1          1    18  22
#> 2          2    24  28
#> 3          1    31  35

Use ac_locate_df() when you want one data frame for all documents.

ac_locate_df(ac, doc)
#>   doc_id pattern_id start end
#> 1      1          1    18  22
#> 2      1          2    24  28
#> 3      1          1    31  35
#> 4      3          3     1   4

Use ac_locate_bytes() when byte offsets are more useful than R character positions. Byte offsets are 0-based and byte_end is end-exclusive.

Detecting And Counting

Use ac_detect() when you only need to know whether a document contains at least one match, and ac_count() when you need the number of matches.

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

ac_detect(ac, doc)
#> [1]  TRUE FALSE  TRUE
ac_count(ac, doc)
#> [1] 3 0 1

These functions are usually the cheapest way to answer yes/no or frequency questions without materializing the matched strings or offsets.

Replacing Matches

Use ac_replace() to replace every non-overlapping match with the replacement for the matched pattern.

patterns <- c("fox", "brown", "quick")
doc <- "The quick brown fox."
ac <- ac_build(patterns)

ac_replace(ac, doc, c("sloth", "grey", "slow"))
#> [1] "The slow grey sloth."

Use one replacement string to replace all patterns with the same value.

ac_replace(ac, doc, "")
#> [1] "The   ."

Replacement uses the automaton’s match semantics. If you want Polars replace_many(..., leftmost = TRUE)-style priority, build the automaton with match_kind = "leftmost_first".

ac_leftmost <- ac_build(
  c("append", "appendage", "app"),
  match_kind = "leftmost_first"
)
ac_replace(ac_leftmost, "append the app to the appendage", c("x", "y", "z"))
#> [1] "x the z to the xage"

Searching Files

The ac_*_file() family applies the same automaton to one or more files. Pass a vector of file paths and each function returns results aligned with those files.

ac_files <- ac_build(c("hello", "world"))
paths <- c(tempfile(fileext = ".txt"), tempfile(fileext = ".txt"))
writeLines("hello world", paths[[1]])
writeLines("fish and chips", paths[[2]])

ac_detect_file(ac_files, paths)
#> [1]  TRUE FALSE
ac_count_file(ac_files, paths)
#> [1] 2 0
ac_extract_file(ac_files, paths)
#> [[1]]
#>   matches patterns
#> 1   hello    hello
#> 2   world    world
#> 
#> [[2]]
#> [1] matches  patterns
#> <0 rows> (or 0-length row.names)

ac_detect_file(), ac_count_file(), ac_extract_file(), and ac_replace_file() support stream = TRUE when the automaton was built with the default match_kind = "standard". This is useful for large files when you want to avoid reading the whole file into memory at once.

ac_locate_file() does not support stream = TRUE. It returns R character offsets instead of raw byte offsets, and converting streamed byte positions back into character positions would require another pass over the file. Keeping file location search non-streaming makes the behavior simpler and easier to reason about.

Using With Tidyverse

These search functions work naturally inside dplyr::mutate(). Scalar outputs like ac_detect(), ac_count(), and ac_replace() become ordinary columns, while ac_extract() and ac_locate() can be stored as list-columns.

patterns <- c("hello", "world", "fish")
ac <- ac_build(patterns)

docs <- tibble::tibble(
  doc = c(
    "this is my first hello world. hello!",
    "nothing to see",
    "fish and chips"
  )
)

docs |> 
  dplyr::mutate(
    detected = ac_detect(ac, doc),
    n_matches = ac_count(ac, doc),
    replaced = ac_replace(ac, doc, "[match]"),
    extracted = ac_extract(ac, doc)
  ) |> 
  tidyr::unnest(extracted)
#> # A tibble: 4 × 6
#>   doc                               detected n_matches replaced matches patterns
#>   <chr>                             <lgl>        <int> <chr>    <chr>   <chr>   
#> 1 this is my first hello world. he… TRUE             3 this is… hello   hello   
#> 2 this is my first hello world. he… TRUE             3 this is… world   world   
#> 3 this is my first hello world. he… TRUE             3 this is… hello   hello   
#> 4 fish and chips                    TRUE             1 [match]… fish    fish

Missing Values

Search functions let you choose how NA documents behave.

doc_na <- c("hello", NA_character_, "world")
ac_na <- ac_build(c("hello", "world"))

ac_detect(ac_na, doc_na, na = "false")
#> [1]  TRUE FALSE  TRUE
ac_count(ac_na, doc_na, na = "zero")
#> [1] 1 0 1
ac_replace(ac_na, doc_na, "[match]", na = "empty")
#> [1] "[match]" ""        "[match]"

For list-column workflows, ac_locate(..., na = "empty") and ac_extract(..., na = "empty") treat missing documents as no matches.

Matching Semantics

ac_build() exposes three match kinds from the Rust library.

`standard` (the default)

The default, match_kind = "standard", returns the match that finishes first. It is also the only mode that supports overlapping search.

patterns <- c("content", "disco", "disc", "discontent", "winter")
haystack <- "This is the winter of my discontent"

ac <- ac_build(patterns)
ac_extract_df(ac, haystack)
#>   doc_id matches patterns
#> 1      1  winter   winter
#> 2      1    disc     disc
ac_extract_df(ac, haystack, overlapping = TRUE)
#>   doc_id    matches   patterns
#> 1      1     winter     winter
#> 2      1       disc       disc
#> 3      1      disco      disco
#> 4      1 discontent discontent
#> 5      1    content    content

`leftmost_first`

match_kind = "leftmost_first" returns the leftmost non-overlapping match. If several patterns start at the same position, the earlier pattern wins.

ac <- ac_build(
  c("disco", "disc"),
  match_kind = "leftmost_first"
)
ac_extract_df(ac, "discontent")
#>   doc_id matches patterns
#> 1      1   disco    disco

`leftmost_longest`

match_kind = "leftmost_longest" returns the leftmost non-overlapping match. If several patterns start at the same position, the longest pattern wins.

ac <- ac_build(
  c("disco", "disc", "discontent"),
  match_kind = "leftmost_longest"
)
ac_extract_df(ac, "discontent")
#>   doc_id    matches   patterns
#> 1      1 discontent discontent

Performance Options

implementation controls the underlying automaton implementation:

"auto" lets the Rust crate choose a heuristic default.
"noncontiguous_nfa" is fast to build and uses moderate memory.
"contiguous_nfa" uses memory efficiently and is usually faster to search than a noncontiguous NFA.
"dfa" can be fastest to search, but may be slow to build and can use much more memory.

ac <- ac_build(
  c("disco", "disc"),
  implementation = "dfa"
)

ac_info(ac)$implementation
#> [1] "dfa"

Benchmark

See the benchmark article for results on English, Chinese, and mixed-text workloads. In the current benchmark, ahocorasick is fastest for detect and count across the tested cases. The ac_extract_df() is also the fastest for extract while preserving a tidy long data-frame result.

As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!

Related Work

Packages that developed to handle multiple string matching in R are less than in Python. These are what I found:

AhoCorasickTrie: Implemented in C++. It only supports 128 ASCII characters and currently does not support UTF-8/Unicode (such as CJK and emojis).
Biostrings: It is a special tool for bioinformatics, and its object system is XString / DNAString / XStringSet. It is not suitable as a drop-in multi-mode retrieval tool for general R string vectors.
r-polars: Polars’ string expressions also use the Aho-Corasick algorithm to match (based on the same crate as this package). If your data is already in a DataFrame, I recommend you use Polars for matching strings directly.

It’s great to see that Rust is providing more and more modern, safe, high-performance open source tools for R (and also for Python) :)

Acknowledgments

Inspired by the Python package ahocorasick_rs. Thanks for the excellent Rust aho-corasick crate.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
R		R
build		build
inst		inst
man		man
src		src
tests		tests
tools		tools
vignettes		vignettes
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
MD5		MD5
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cleanup		cleanup
cleanup.win		cleanup.win
configure		configure
configure.win		configure.win

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ahocorasick

Installation

Quick Start

Extracting Matches

Locating Matches

Detecting And Counting

Replacing Matches

Searching Files

Using With Tidyverse

Missing Values

Matching Semantics

`standard` (the default)

`leftmost_first`

`leftmost_longest`

Performance Options

Benchmark

Related Work

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ahocorasick

Installation

Quick Start

Extracting Matches

Locating Matches

Detecting And Counting

Replacing Matches

Searching Files

Using With Tidyverse

Missing Values

Matching Semantics

standard (the default)

leftmost_first

leftmost_longest

Performance Options

Benchmark

Related Work

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`standard` (the default)

`leftmost_first`

`leftmost_longest`

Packages