trie-match is a command-line tool that finds intersections of string tries,
supporting a simple pattern syntax.
Think of it as a specialized grep, with multiple patterns and optimized for
shared prefixes.
For these two input files
$ cat patterns.txt
foo.**.bar pattern 1
foo.*.baz.* pattern 2
$ cat queries.txt
foo.bar query 1
foo.baz query 2
foo.hello.baz.world query 3we find two matches:
$ trie-match -p patterns.txt -q queries.txt
foo.hello.baz.world foo.*.baz.* query 3 pattern 2
foo.bar foo.**.bar query 1 pattern 1(Note that the input and output use tabs as column separators.)
The search avoids repeating prefix matches, so foo is only matched once. This
makes the search more efficient than pair-wise matching for large inputs.
Precompiled binaries are provided for x86_64-linux and aarch64-darwin in the GitHub releases.
Preferred method: if you have mise set up, the ubi backend can discover and install these binaries:
mise use --global ubi:cebamps/trie-matchIt is of course also possible to download and run these binaries by hand. Note however that they are not signed, which can be annoying with macOS when downloaded from a web browser.
flake.nix provides a package and devshell, so nix run, nix shell and
nix develop are all supported.
nix shell github:cebamps/trie-matchA Cabal file is provided.
cabal installEnsure that GHC and Cabal are properly set up before using this method. More details can be found at Haskell.org.
This should work under GHC 9.6, 9.8, and 9.10 at least.
A Docker file is provided. It supports an optional GHC_VERSION build arg.
docker build -t trie-match .$ trie-match --help
Usage: trie-match (-p|--pattern FILENAME) (-q|--query FILENAME) [-x|--prefix]
[-w|--wide-globs]
Find intersections of string tries efficiently, based on shared prefixes.
Available options:
-p,--pattern FILENAME pattern file
-q,--query FILENAME query file
-x,--prefix interpret patterns as prefixes
-w,--wide-globs globs at the boundaries of segment patterns may
consume neighbouring segments
-h,--help Show this help textCompletion scripts are available for Bash, Zsh and fish. See
optparse-applicative documentation and wiki for
details and inspiration. In short, for Zsh, place the output of
trie-match --zsh-completion-script $(which trie-match) in a file named
_trie-match found in your $FPATH.
The input files list their patterns and queries list in the form of input lines.
An input line consists of a pattern (or query) expression, followed by an optional tab character and arbitrary string annotation.
A pattern expression is a string representation of a pattern, with segments separated by dots. Segments may not contain dots, tabs, or line breaks. The segment language consists of:
- Star (
**) - Matches zero or more query segments. - Plus (
*) - Matches one or more query segments. - Glob (any other valid segment) - Matches one query segment, where
*acts as a wildcard for zero or more characters.
A query expression has the same syntax and semantics as a pattern expression.
A pattern and query match (or intersect) when there exists at least one input that both of them accept. In the simpler case where the query only consists of literal segments (i.e., of glob type and containing no wildcard), the pattern and query match when the pattern accepts the query as an input.
trie-match outputs one line per query-pattern intersection, optionally
followed by the annotation for the query and pattern annotations, all separated
by tabs. That is, the format of each output line is:
OutputLine ::= Query "\t" Pattern ("\t" QueryAnn ("\t" PatternAnn)?)?That format is suitable to be processed by cut from coreutils or by
Miller invoked as mlr -T.
Extra options are supported (see Usage) to alter the meaning of patterns.
-
--prefixeffectively makes each pattern match queries by prefix, by ensuring it ends in.**or.*, implicitly appending.**if needed. For example, the patternfoowill match the queryfoo.baronly with--prefix. -
--wide-globsmakes asterisks (*) in glob segments capture zero or more neighbouring segments. For example,foo*.quxwill matchfoobar.baz.qux, butb*zwill not matchbar.baz.
While the patterns are implicitly modified in the process, the output still shows the original patterns from the input file.
Tests are executed via Cabal:
cabal testThe code remains scrappy. Small units are tested but the program as a whole is not.
While annotations in the input are allowed to contain tabs, this can make their delimitation ambiguous in the output.
The pattern syntax does not support escaping or custom delimiters. This would be useful to implement.
Performance may become an issue in the presence of chains of segments like
*.*.* due to combinatorial explosion, similar to regex. Note however that
simplifications are applied where patterns are strictly equivalent, for example
**.* to *.
When a file contain duplicate entries, only the last annotation will be kept.
Pattern modifiers --prefix and --wide-globs may cause further duplication of
patterns, which will not only clobber the first instances' annotations but also
the patterns themselves. For example, if patterns foo and foo.** are present
in that order and --prefix is used, foo will never be reported as a match
because it internally becomes foo.** and is clobbered by the latter.
Nevertheless, no query match is lost.
