An GADDAG implementation in Rust built with the fst crate.
A GADDAG is a data structure invented for scrabble AIs to generate all possible words that contain a subsequence.
You could imagine using a regular prefix trie to quickly answer:
- "what can I build by appending letters to PLAIN" (PLAINS, PLAINLY, PLAINTIFF, ...)
or the same data in a reverse order to answer
- "what can I build by pre-pending to PLAIN" (EXPLAIN)
But what if you wanted to build EXPLAINING from PLAIN?
The GADDAG is a regular DAG that stores several variations of each word in a clever way to be able to answer that question.
This one uses the fantastic fst
crate as its underlying engine. I recommend reading the blog post about it, but essentially it
optimises storing Sets or Maps of strings where the strings have a lot of overlap.
It has the ability to optimize shared postfixes as well as prefixes, so unlike a prefix trie, it can also share the last 3 nodes of eg EXPLAINING and BOATING. It's also fast, and has no dependencies for us.
You build one by calling from_words
with anything that implements IntoIterator<string>
. For example, if you had a dictionary as a text file, you could
build it this way:
let file = File::open("dictionary.txt")?;
let reader = BufReader::new(file);
let gaddag = Gaddag::from_words(reader.lines().map(|l| l.unwrap()));
You can also load it from bytes with:
static INPUT: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/set.fst"));
let gaddag = Gaddag::from_bytes(INPUT).unwrap();
Saving it to bytes is similar:
let mut file = File::create("gaddag.fst")?;
file.write_all(gaddag.as_bytes());
Then you can query it with:
println!("dict contains AA : {} ", gaddag.contains("AA"));
println!("dict words with .*TRING: {:#?} ", gaddag.ends_with("TRING"));
println!("dict words with BA.*: {:#?} ", gaddag.starts_with("BA"));
println!("dict words with .*PLING.*: {:#?} ", gaddag.substring("PLING"));
You may want to navigate it node by node when building your possible moves in a Scrabble game.
In that case you'll use the lower level CompiledAddr
type from fst
to refer to the indexes of nodes in the
gaddag.
You can get the node address for a prefix with:
let node_addr = gaddag.node_for_prefix("GIN")
And then you can query for if you can continue with the tiles in your bag one by one with
if let Some(next_node_addr) = gaddag.can_next(node_addr, "GINT"); //checks if there is a word in the dictionary with TING in it
fst
is optimized for storing bytes, which means that it's fine for utf8 strings as long as you remember to encode and decode your inputs and outputs.
This code has so far been only tested in English. While it should theoretically support multi-byte characters there may be uncovered bugs in there.