Skip to content
/ regrams Public

Convert regular expressions to trigram queries in the spirit of Google's codesearch.

License

Notifications You must be signed in to change notification settings

aaw/regrams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

regrams

regrams is a go package that implements an engine for text search by regular expression queries. Given a collection of documents that are indexed by trigrams and a regular expression, regrams can generate a boolean query for the trigram index that retrieves good candidates that might match the regular expression.

For example, regrams.MakeQuery("a(bc)+d") returns (abc) (bcb|bcd), which means that any document that matches the regular expression a(bc)+d must contain the trigram abc as well as one of bcb or bcd.

regrams is inspired by Google's codesearch repo and Russ Cox's corresponding blog post explaining the implementation. Instead of using the structure of the parsed regular expression to generate queries, however, regrams creates an NFA from the query with nodes weighted by the size of the reachable trigram sets and repeatedly extracts disjunctions from the NFA by identifying and removing minimum-weight graph cuts. I wrote a post on regular expression search via graph cuts that gives a lot more detail about the technique.

The easiest way to try out this package is to run the command-line wrapper included in this repo:

$ go run cmd/regrams/main.go abcd
(abc) (bcd)
$ go run cmd/regrams/main.go 'Time: [0-9]+ ms'
( ms) (: 0|: 1|: 2|: 3|: 4|: 5|: 6|: 7|: 8|: 9) (Tim) (e: ) (ime) (me:)
$ go run cmd/regrams/main.go '(abc*)+de'
(aba|abc|abd) (bab|bca|bcc|bcd|bde)
$ go run cmd/regrams/main.go '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)

Running go run cmd/regrams/main.go --help will give you more options.

Alternatively, you can try out examples in a browser at the Go Playground.

To use regrams as a package, import github.com/aaw/regrams, then call regrams.MakeQuery in your code to generate a trigram query from a string representing a regular expression. MakeQuery returns a slice-of-slices of strings, which should be interpreted as a boolean query in conjunctive normal form: the slice [["abc"], ["xyz","xyw"], ["xxx", "yyy", "zzz"]] represents the query abc AND (xyz OR xyw) AND (xxx OR yyy OR zzz).

About

Convert regular expressions to trigram queries in the spirit of Google's codesearch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages