An index data structure for approximate string search.
Python Makefile
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
test
LICENSE
Makefile
README.md
fastss.py
setup.py

README.md

TinyFastSS

FastSS is an efficient indexing data structure for string similarity search, invented by researchers at Zurich University in 2007.

TinyFastSS is a simple Python implementation of FastSS, written in less than 300 LoC.

Features

  • Create a FastSS index on disk.
  • Perform very fast fuzzy searches using the index file.
  • Python 2/3 compatible (tested with Python 2.7 / 3.4)
  • No external modules required (only dependent on built-in modules)

How to install

You can install via pip:

$ pip install TinyFastSS

Basic usage

(1) Create an index file.

import fastss

with fastss.open('fastss.dat') as index:
    for word in open('dictonary.txt'):
        index.add(word.strip())

(2) Perform a fuzzy string search.

import fastss

with fastss.open('fastss.dat') as index:
    # return a dict like: {0: ['test'], 1: ['text', 'west'], 2: ['taft']}
    print(index.query('test'))

Command-line usage

You can also use fastss.py from the command line. Here is a bare minimum example:

$ cat dictionary.txt | head -n 3
aardvark
abacus
aerial
$ python -m fastss -c index.dat dictionary.txt
$ python -m fastss -q index.dat adaptive
{"0": ["adaptive"], "1": ["adoptive"], "2": ["additive"]}

Invoke "python -m fastss -h" for more details.

Performance

A simple speed testing was done to grasp the overall performance of TinyFastSS.

The machine used in this test had Intel Core i3-4010U (1.70GHz) processor with 4GB memory.

The dictionary used in this test was one derived from SCOWL v2015-08-24 (english-50), which contained 98,986 English words (909 KB in disk size).

1. Index creation performance

  • Roughly it took 3 minutes to complete the index creation.
  • The size of the resulting index file was 161 MB.
$ time python -m fastss -c fastss.dat dictonary.txt
3m0.71s real     2m44.35s user     0m16.43s system
$ stat --format=%s fastss.dat
168214528

2. Query performance

  • With randomly chosen words, it took around 5ms to perform a single search on average.
  • The actual time varied between 1.16ms (with "nirvana") and 11.7ms (with "burn").
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("sterner")'
100 loops, best of 3: 7.67 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("spotlighted")'
100 loops, best of 3: 2.43 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("burn")'
100 loops, best of 3: 11.7 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("nirvana")'
1000 loops, best of 3: 1.16 msec per loop
$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("conveyor")'
100 loops, best of 3: 1.99 msec per loop

Implementation notes

TinyFastSS uses built-in module dbm (anydbm) to store the index data.

A "index file" is basically a (disk-based) hash table which maps a key (read the original paper to know what a 'key' means) to a list of words.

As dbm can only store bytes, both keys and words are encoded in UTF-8, and each word-list is serialized into a byte string delimited by null bytes (b'\x00'). Here is an example of a (key, value) pair:

(b'almond', b'almond\x00almonds\x00almondy')

License

MIT License (see LICENSE for details)