Benchmark various data structures / algorithms for string searching.
Clone or download
dhobsd remove murmurhash; it's currently unused
It would be interesting to try out hash tables at some point, but it
introduces some complications. Namely, there are huge numbers of approaches
to hash tables (chaining, linear / quadratic probing, subtypes such as
robin hood hashing, perfect hashes when input set is known up front) and
many tunables to consider:

 * How many buckets to allocate initially?
 * Will the table grow?
 * What is the growth factor?
 * What hash algorithm should we use?

These questions indicate a large number of potential configurations. I would
like to investigate this, but getting results that can be plotted is a higher
priority right now. (Not that this repo is a priority.)
Latest commit 69855f5 Jul 1, 2015
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial commit Jul 1, 2015
LICENSE Add code! Jul 1, 2015
Makefile Add code! Jul 1, 2015
README.md Add some docs Jul 1, 2015
bench-critbit.c Add code! Jul 1, 2015
bench-radix.c Add code! Jul 1, 2015
bench-rb-clever.c Add code! Jul 1, 2015
bench-rb-naive.c Add code! Jul 1, 2015
bench-rb-simplest.c Add code! Jul 1, 2015
cache.h Add code! Jul 1, 2015
critbit.c Add code! Jul 1, 2015
critbit.h Add code! Jul 1, 2015
mkwords.pl Add code! Jul 1, 2015
radix.c Add code! Jul 1, 2015
radix.h Add code! Jul 1, 2015
tree.h Add code! Jul 1, 2015

README.md

sbst

Benchmarking various datastructures and algorithms for string lookups. For more context, see also:

Running Benchmarks

You'll need a dictionary and possibly to fiddle around with the mkwords.pl script.

  • Run make
  • Run ./mkwords.pl [nwords] [nfiles] [depth] [prefix]. If you intend to run more than 1 thread, nfiles must be more than 1. depth controls how deep to make the string. If you specify a prefix, generated strings will all contain that prefix.
  • Run ./bench-[foo] [nthreads] [nwords] to benchmark a particular algorithm.

Notes

The reported times contain:

  1. The time it took to parse input, allocate data structures, insert, and find the element just inserted.
  2. The time it took to find items in insertion order.

Times are measured in CPU cycles.

I'm sure there are ways to improve these tests. I'd be interested in contributions for other data structures or other implementations of the same data structures used here.

I'd also be interested in ports to other platforms and results.