Skip to content

Indices

dan-eyles edited this page Feb 3, 2013 · 17 revisions

If you've worked with MongoDB or any other database management system before you're familiar with the concept of indexes. Under the hood indices are used to optimize the search time for various queries, you can think of them as look ups that allow the database to resolve a set of results without having to scan every single record.

SculeJS includes two types of indices: hash indices and b+ tree indices. Each of these index types has unique properties that make them good choices for different situations, but we'll get into that a bit later.

Ensuring you create well designed and optimised indices on your SculeJS collections will mean that you get great performance when it comes to executing queries, in many cases the performance boost can be in orders of magnitude.

Table of contents:

Hash indices

Hash indices are based around a specialised hash table data structure, as such they should generally provide constant time look ups on one or several keys. This type of index is great if you want to perform general equality queries on one or more keys - for example:

var collection = scule.factoryCollection('scule+dummy://test');
collection.ensureHashIndex('a,b');
collection.find({a:3, b:10}, {}, function(results) {
    results.forEach(function(o) {
       // do something here
    });
});

In the case above the ensureHashIndex function call creates a hash index using the a and b fields of the objects in the collection to generate compound keys. For example, say you added an object to the collection with the following fields:

{
   a:3,
   b:10,
   foo:'bar',
   nested: {
      bar:'foo'
   }
}

The hash index on the collection will create an entry for the object above on the key '3,10' - in fact, every object sharing the same values for the a and b fields will be added to a bucket in the hash table corresponding to the generated key.

Through this example you can begin to see how this might optimize search times on collections; rather than scanning every object in the collection and performing an equality check on a and b the system simply calculates a hash key and does a constant time look up on the table using it. The contents of the bucket for the hash table are collated and returned directly as the result set for the query.

B+ Tree indices

B+ tree indices are different to hash indices - they use an underlying b+ plus tree data structure to store and index objects inside collections. This particular data structure provides logarithmic time look ups, but also allows nifty things like support for range queries on arbitrary keys. B+ tree indices also allow direct look ups in the same way that hash indices do, usually with similar performance characteristics... I'm pretty sure this is because most JavaScript engines use b-trees to support associative arrays.

For most indexing strategies b+ trees provide great performance with the added benefit of flexibility, however they do create more over head in terms of updates and deletes. Each time you add an object to a collection, delete an object from a collection or perform an update against the collection, indices also need to be modified.

For hash table indices this is a fairly straight forward operation, however b+ trees require a little more work to ensure correct balancing of nodes within the tree (usually these operations require only a few milliseconds to complete). The b+ tree indices included in SculeJS are self balancing, so you really don't have to worry about the implementation details of maintaining them - overall the data structure will attempt to ensure minimal depth for optimised search time. You can actually tune this behaviour by changing the order and threshold values when creating an index:

var collection = scule.factoryColletion('scule+dummy://test');
collection.ensureBTreeIndex('a,b', {order:1000, threshold:500});
collection.find({a:3, b:10}, {}, function(results) {
    results.forEach(function(o) {
       // do something here
    });
});

You can also perform range queries on single fields using expressions like the following:

var collection = scule.factoryColletion('scule+dummy://test');
collection.ensureBTreeIndex('a', {order:1000, threshold:500});
collection.find({a:{$gt:3, $lte:100}}, {}, function(results) {
    results.forEach(function(o) {
       // do something here
    });
});

B+ tree indices also have the intrinsic property of clustering on their selected keys in ascending order.

Using multiple indices

SculeJS will attempt to optimise query execution plans to use as many indices as possible when performing queries. Let's examine the following collection and it's associated indices:

var collection = scule.factoryColletion('scule+dummy://test');
collection.ensureBTreeIndex('a', {order:1000, threshold:500});
collection.ensureBTreeIndex('c', {order:1000, threshold:500});
collection.ensureHashIndex('a,b');

You can see that there are three indices on the collection:

  • a b+ tree index on the field a
  • a b+ tree index on field c
  • and a compound hash index on the fields a and b.

Now let's consider the following query expression:

collection.find({c:{$gt:3, $lte:100}, a:3, b:10}, {$limit:10}, function(results) {
    results.forEach(function(o) {
       // do something here
    });
});
// SQL equivalent: SELECT * FROM test WHERE (c > 3 AND c <= 100) AND a = 3 AND b = 10 LIMIT 10

When evaluating the query and building the execution plan, SculeJS will examine the available indices and determine which ones to use during execution. In this case SculeJS will select the following indices:

  • the b+ tree index on field c
  • the compound hash index on fields a and b

The above query is entirely covered by multiple indices, so SculeJS will resolve the various query clauses against their associated indices and perform an intersection operation to calculate the final result set. Instead of scanning n objects the system performs two look ups (the b+ tree look up consisting of O(log(n)) comparisons and the hash look up consisting of O(1) comparisons - O(log(n) + 1). Performing the query this way is much faster than performing O(n*e) comparisons where n is the number of objects in the collection and e is the number of logical comparisons required by the query.

Let's examine the theory behind this - if you have 100,000 objects in your collection and perform this query without any indices the SculeJS virtual machine will have to perform 400,000 (100,000 * 4) comparisons.

Using the intersection of the two indices will yield LogF(n) (should be near 2) comparisons where F is the branching factor of the tree plus the constant time of the hash index lookup - roughly 3 comparisons. This will also include the time taken to intersect the two lists, which should be O(n) where n represents the total number of objects in all segments of the result set. If the total number of objects returned through the two index lookups is 20,000 then the total number of comparisons executed is 20,003.

Obviously performing 20,003 comparisons is much faster than performing 400,000...

In cases where queries are only partially covered by indices SculeJS will still use any index or combination of indices to reduce the number of comparisons required to perform queries. Generally this will mean much better performance.

Caveats

Indices are great, but it's not all roses. When you're defining indices on your collections it's wise to be judicious - don't define them on every field or combination of fields. Think about the data you're working with and (most importantly) think about the ways you're planning on using it. Once you've figured out how your queries will be structured it's much easier to get a handle on how you can index collections in order to speed things up.

SculeJS indices are sparse (using the MongoDB definition of the term) - that is, they only contain entries for objects that have the indexed field(s). Any object that is missing the field(s) specified in the index definition is omitted from the index.

At the time of writing SculeJS does not support indices on OR-ed queries. For example, consider the following collection:

var collection = scule.factoryColletion('scule+dummy://test');
collection.ensureBTreeIndex('a', {order:1000, threshold:500});
collection.ensureBTreeIndex('c', {order:1000, threshold:500});
collection.ensureHashIndex('a,b');

Now let's considering the following query expression:

collection.find({c:{$gt:3, $lte:100}, a:3, b:10, $or:[{a:4},{b:12}]}, {$limit:10}, function(results) {
    results.forEach(function(o) {
       // do something here
    });
});
// SQL equivalent: SELECT * FROM test WHERE (c > 3 AND c <= 100) AND a = 3 AND b = 10 AND (a = 4 OR b = 12) LIMIT 10

In order to cover the above query with all available indices SculeJS would need to perform a sub-query for the OR-ed portion and then merge the results of the two queries to produce the result set. In a multi-threaded environment it would be possible to perform the two queries in parallel and merge the results, however JavaScript uses a single thread with a single event loop so the two queries would be performed sequentially.

This should still yield superior performance to scanning the entire collection, however at the cost of increased complexity in terms of the query execution plan. I've looked at a couple of different ways of doing this, but right now I'd rather focus on adding other features to SculeJS. If the use of indices for queries with OR-ed expressions ends up being something users need then I'll put some time and energy into building it.