-
Notifications
You must be signed in to change notification settings - Fork 7
About
From http://ww2.cs.mu.oz.au/~jz/fulltext/acmtois02.pdf:
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. Compared to a splay tree, the fastest variant of a burst trie can accumulate the vocabulary of a 10 Gb collection of web data in less than 40%of the time, while using no more memory. Compared to a ternary search tree, the burst trie is around 25% faster and uses only 50% of the memory. Analysing the superior performance of the burst trie in comparison to other structures, we show that the ability of a burst trie to store frequently-accessed records in locations that are close to the root of containers far outperforms the adaptivity of a splay tree.
This structure is a collection of small data structures, which we call containers, that are accessed via a conventional trie, which in this context we call an access trie. Searching involves using the first few characters of a query string to identify a particular container, then using the remainder of the query string to find a record in the container. A container can be any data structure that is reasonably efficient for small sets of strings, such as a list or a binary search tree. The general principle for maintaining a burst trie is that, initially, it consists of a single container. When a container is deemed to be inefficient, it is burst, that is, replaced by a trie node and a set of child containers which between them partition the original container’s strings. Thus there are two major design decisions to be explored for burst tries: what data structure to use for a container; and how to determine when a container should be burst.
We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.