A lexicon is a data-structure which stores a set of words. The difference between a dictionary and a lexicon is that in a lexicon there are no values associated with the words.
A lexicon is similar to a list of words or a set, but the internal representation is different and optimized for faster searches of words, prefixes and wildcard patterns.
Precisely the search time is O(W) where W is the length of the word.
2 important lexicon data-structures are:
- Trie.
- Directed Acyclic Word Graph(DAWG).

Both Trie and DAWG are Finite State Automaton(FSA)

Summary of changes in the current version

Removed support for Python 2. Going forward only Python 3 will be supported.
Support for word counts.
- add(word, count=4000)
- add_all([word]*1000)
- search(pattern, with_count=True)
Changes in DAWG minimization to support the word count feature.
- Only nodes with edges are considered for minimization
- Each node's hash also takes into account the count flag of the node to determine if 2 nodes are logically equivalent.
Minor performance tweaks
Improved test coverage

Install

pip install lexpy

Interface

Interface Description	Trie method	DAWG method
Add a single word	`add('apple', count=2)`	`add('apple', count=2)`
Add multiple words	`add_all(['advantage', 'courage'])`	`add_all(['advantage', 'courage'])`
Check if exists?	`in` operator	`in` operator
Search using wildcard expression	`search('a?b*', with_count=True)`	`search('a?b*, with_count=True)`
Search for prefix matches	`search_with_prefix('bar', with_count=True)`	`search_with_prefix('bar')`
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance	`search_within_distance('apble', dist=1, with_count=True)`	`search_within_distance('apble', dist=1, with_count=True)`
Get the number of nodes in the automaton	`len(trie)`	`len(dawg)`

Examples

Trie

Build from an input list, set, or tuple of words.

from lexpy.trie import Trie

trie = Trie()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', 
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' 
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']

trie.add_all(input_words) # You can pass any sequence types of a file like object here

print(trie.get_word_count())

>>> 48

Build from a file or file path.

from lexpy.trie import Trie

# Either
trie = Trie()
trie.add_all('/path/to/file.txt')

# Or
with open('/path/to/file.txt', 'r') as infile:
     trie.add_all(infile)

Check if exists using the `in` operator

print('ampyx' in trie)

>>> True

Prefix search

print(trie.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']

print(trie.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using `?` and `*`

? = 0 or 1 occurrence of any character
* = 0 or more occurrence of any character

print(trie.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(trie.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(trie.search('su?t'))

>>> ['suit']

print(trie.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(trie.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Increment word count

You can either add a new word or increment the counter for an existing word.

trie.add('athie', count=1000)

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 1003), ('arbil', 1), ('auric', 1)]

Directed Acyclic Word Graph (DAWG)

DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always less than or equal to the number of nodes in Trie.
They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.
In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.
In the current implementation of DAWG, the insertion order of the words should be alphabetical.
The implementation idea of DAWG is borrowed from http://stevehanov.ca/blog/?id=115

from lexpy.trie import Trie
from lexpy.dawg import DAWG

trie = Trie()
trie.add_all(['advantageous', 'courageous'])

dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])

len(trie) # Number of Nodes in Trie
23

dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.

len(dawg) # Number of nodes in DAWG
21

DAWG

The APIs are exactly same as the Trie APIs

Build a DAWG

from lexpy.dawg import DAWG
dawg = DAWG()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', 
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' 
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']


dawg.add_all(input_words)
dawg.reduce()

dawg.get_word_count()

>>> 48

Check if exists using the `in` operator

print('ampyx' in dawg)

>>> True

Prefix search

print(dawg.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']

print(dawg.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using `?` and `*`

? = 0 or 1 occurance of any character

* = 0 or more occurance of any character

print(dawg.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(dawg.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(dawg.search('su?t'))

>>> ['suit']

print(dawg.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(dawg.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(dawg.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Alphabetical order insertion

If you insert a word which is out-of-order, ValueError will be raised.

dawg.add('athie', count=1000)

ValueError

ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>

Increment the word count

You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.

dawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)

print(dawg.search('thrill', with_count=True))

>> [('thrill', 20001)]

Trie vs DAWG

Fun Facts :

The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary. So for all english words, the search time is bounded by O(45).
The longest technical word(not in dictionary) is the name of a protein called as titin. It has 189,819 letters and it is disputed whether it is a word.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Summary of changes in the current version

Install

Interface

Examples

Trie

Build from an input list, set, or tuple of words.

Build from a file or file path.

Check if exists using the `in` operator

Prefix search

Wildcard search using `?` and `*`

Search for similar words using the notion of Levenshtein distance

Increment word count

Directed Acyclic Word Graph (DAWG)

DAWG

Build a DAWG

Check if exists using the `in` operator

Prefix search

Wildcard search using `?` and `*`

Search for similar words using the notion of Levenshtein distance

Alphabetical order insertion

Increment the word count

Trie vs DAWG

Files

README.md

Latest commit

History

README.md

File metadata and controls

Summary of changes in the current version

Install

Interface

Examples

Trie

Build from an input list, set, or tuple of words.

Build from a file or file path.

Check if exists using the in operator

Prefix search

Wildcard search using ? and *

Search for similar words using the notion of Levenshtein distance

Increment word count

Directed Acyclic Word Graph (DAWG)

DAWG

Build a DAWG

Check if exists using the in operator

Prefix search

Wildcard search using ? and *

Search for similar words using the notion of Levenshtein distance

Alphabetical order insertion

Increment the word count

Trie vs DAWG

Check if exists using the `in` operator

Wildcard search using `?` and `*`

Check if exists using the `in` operator

Wildcard search using `?` and `*`