Skip to content

cebamps/trie-match

Repository files navigation

trie-match

trie-match is a command-line tool that finds intersections of string tries, supporting a simple pattern syntax.

Think of it as a specialized grep, with multiple patterns and optimized for shared prefixes.

Example

For these two input files

$ cat patterns.txt
foo.**.bar	pattern 1
foo.*.baz.*	pattern 2

$ cat queries.txt
foo.bar	query 1
foo.baz	query 2
foo.hello.baz.world	query 3

we find two matches:

$ trie-match -p patterns.txt -q queries.txt
foo.hello.baz.world	foo.*.baz.*	query 3	pattern 2
foo.bar	foo.**.bar	query 1	pattern 1

(Note that the input and output use tabs as column separators.)

The search avoids repeating prefix matches, so foo is only matched once. This makes the search more efficient than pair-wise matching for large inputs.

Installation

Precompiled binaries are provided for x86_64-linux and aarch64-darwin in the GitHub releases.

Preferred method: if you have mise set up, the ubi backend can discover and install these binaries:

mise use --global ubi:cebamps/trie-match

It is of course also possible to download and run these binaries by hand. Note however that they are not signed, which can be annoying with macOS when downloaded from a web browser.

From source, using Nix

flake.nix provides a package and devshell, so nix run, nix shell and nix develop are all supported.

nix shell github:cebamps/trie-match

From source, using Cabal

A Cabal file is provided.

cabal install

Ensure that GHC and Cabal are properly set up before using this method. More details can be found at Haskell.org.

This should work under GHC 9.6, 9.8, and 9.10 at least.

From source, using Docker

A Docker file is provided. It supports an optional GHC_VERSION build arg.

docker build -t trie-match .

Usage

$ trie-match --help
Usage: trie-match (-p|--pattern FILENAME) (-q|--query FILENAME) [-x|--prefix]
                  [-w|--wide-globs]

  Find intersections of string tries efficiently, based on shared prefixes.

Available options:
  -p,--pattern FILENAME    pattern file
  -q,--query FILENAME      query file
  -x,--prefix              interpret patterns as prefixes
  -w,--wide-globs          globs at the boundaries of segment patterns may
                           consume neighbouring segments
  -h,--help                Show this help text

Completion scripts are available for Bash, Zsh and fish. See optparse-applicative documentation and wiki for details and inspiration. In short, for Zsh, place the output of trie-match --zsh-completion-script $(which trie-match) in a file named _trie-match found in your $FPATH.

Matching Details

The input files list their patterns and queries list in the form of input lines.

An input line consists of a pattern (or query) expression, followed by an optional tab character and arbitrary string annotation.

A pattern expression is a string representation of a pattern, with segments separated by dots. Segments may not contain dots, tabs, or line breaks. The segment language consists of:

  • Star (**) - Matches zero or more query segments.
  • Plus (*) - Matches one or more query segments.
  • Glob (any other valid segment) - Matches one query segment, where * acts as a wildcard for zero or more characters.

A query expression has the same syntax and semantics as a pattern expression.

A pattern and query match (or intersect) when there exists at least one input that both of them accept. In the simpler case where the query only consists of literal segments (i.e., of glob type and containing no wildcard), the pattern and query match when the pattern accepts the query as an input.

trie-match outputs one line per query-pattern intersection, optionally followed by the annotation for the query and pattern annotations, all separated by tabs. That is, the format of each output line is:

OutputLine ::= Query "\t" Pattern ("\t" QueryAnn ("\t" PatternAnn)?)?

That format is suitable to be processed by cut from coreutils or by Miller invoked as mlr -T.

Pattern modifiers

Extra options are supported (see Usage) to alter the meaning of patterns.

  • --prefix effectively makes each pattern match queries by prefix, by ensuring it ends in .** or .*, implicitly appending .** if needed. For example, the pattern foo will match the query foo.bar only with --prefix.

  • --wide-globs makes asterisks (*) in glob segments capture zero or more neighbouring segments. For example, foo*.qux will match foobar.baz.qux, but b*z will not match bar.baz.

While the patterns are implicitly modified in the process, the output still shows the original patterns from the input file.

Running Tests

Tests are executed via Cabal:

cabal test

Caveats

The code remains scrappy. Small units are tested but the program as a whole is not.

While annotations in the input are allowed to contain tabs, this can make their delimitation ambiguous in the output.

The pattern syntax does not support escaping or custom delimiters. This would be useful to implement.

Performance may become an issue in the presence of chains of segments like *.*.* due to combinatorial explosion, similar to regex. Note however that simplifications are applied where patterns are strictly equivalent, for example **.* to *.

When a file contain duplicate entries, only the last annotation will be kept.

Pattern modifiers --prefix and --wide-globs may cause further duplication of patterns, which will not only clobber the first instances' annotations but also the patterns themselves. For example, if patterns foo and foo.** are present in that order and --prefix is used, foo will never be reported as a match because it internally becomes foo.** and is clobbered by the latter. Nevertheless, no query match is lost.

About

trie-match: find intersection of string tries

Resources

License

Stars

Watchers

Forks

Languages