B Tree index

Stephen Oliver edited this page Mar 30, 2017 · 2 revisions

Forkable scalable B-Tree indexes

Since infinity0's Summer of Code 2009 work, Library includes code to read and write a new type of index based on b-trees. This can be on-disk or on-Freenet.

It is a scalable structure: It does not rely on gigantic manifests, everything refers to everything else via CHKs, although in future we might include the top-level splitfile metadata instead (this would cost us some fan-out but improve reliability).

The on-Freenet form is forkable, i.e. it is possible to make a few changes and then only reinsert the leaf blocks that have been changed, and the middle and top blocks above them, without reinserting the entire index. This is an important property for fork-and-merge DSCMs: You can fork somebody else's tree (whatever that tree happens to be) without having to reinsert the whole thing.

Another important function is that Library has the ability to merge a whole pile of stuff, sorted, at once, in an efficient manner (rewriting only those blocks that have been changed, visiting each one only once, and thus minimizing seeks).

Important applications

Search indexes

Freesite indexes are already rather large. The main scaling problem at the moment is that XMLSpider takes around a week to write the index from its database. This is because of all the seeking involved.

We are in the process of fixing this with the new index format and the new Spider plugin, which buffers a bunch of pages and then selectively updates what needs to be updated on Freenet. The new indexes should also be much faster to search from the client side.

In the future, we will also have distributed search based on the Web of Trust, which hopefully will include human-maintained indexes of files as well as freesites.

Databases

Anywhere you need a scalable on-Freenet database (in the sense of a keyword lookup i.e. a flat file scalable map), you can use one of these.

In future people might write full blown relational databases on Freenet, which would also be forkable i.e. anyone can clone your database and add their own stuff.

Basic transaction support could be implemented in much the same way BTRFS does things. BTRFS uses COW trees a lot, and essentially it forks and updates several related trees and then updates the superblock to point to the new trees all at once, thus allowing it to e.g. journal data with very little performance cost, similar to "wandering logs".

We could do something similar for databases. This would give us consistency and atomicity, and some measure of durability (any reliable system would likely rely on the participants being able to reinsert, but also the planned metadata-in-parent-node changes would help).

Isolation is a separate question but is closely related to the general problem of merging changes from other forks while maintaining consistency constraints; in other words, multi-user support. Which, in fact, is closely related to what a lot of modern huge database systems do: They merge data from numerous sources, without maintaining transactional consistency.

A query does not necessarily return a consistent result across the whole dataset from an instant in time; its inputs each have their own version (see this ACM article).

This is remarkably close to what we'd have on Freenet: We can't guarantee that a database we are querying hasn't been updated since we last saw it, but we can return a consistent version, and then we can mate it with other databases with their own versions. Old versions persist, at least for a while, and data may even be immutable (append only).

One unique difficulty with Freenet is there are no multi-user databases, or at least it's a very bad idea - fork-and-merge is the order of the day. But this is only an extreme case of merging data from multiple databases that may or may not be up to date.

Fork-and-merge DSCMs

Freenet already supports Git and Mercurial reasonably well, since they can both publish to a standard web server.

However for huge repositories, the ability to fork a repository without reinserting the whole thing could become very important (this can be solved for Mercurial using the infocalypse extension).

Wiki over Freenet

Wikipedia-scale wikis over Freenet would be a very interesting application.

There are various ways to implement this, but most of them require efficient keyword lookups and the ability to make changes and republish without reinserting the whole thing and without just pushing a diff either (since a diff would have to be relative to somebody else's tree).

Of course, in the meantime, smaller wiki's using similar principles are enormously helpful, see Jfniki.

Planned optimizations

Some major optimizations have been proposed. These are on the bug tracker:

  • Publishing a bloom filter of the terms indexed by each index, to avoid unnecessary lookups (especially in distributed search).

  • Making it a b+tree rather than a b-tree so that all the data is in the leaves. This is probably a good idea for various reasons. It would also be useful to look at the paper on COW B-Trees cited in the wikipedia article on BTRFS.

  • Preloading the higher level nodes in the various applications (e.g. search).

  • Storing the splitfile metadata for the child nodes in the parent rather than redirecting via a CHK. This would improve robustness against the loss of a single key significantly, although it would reduce fan-out (it should still be hundreds though).

  • Radically changing the data format for non-leaf nodes so that we can fetch only the block that we need (and do a few other fetches for persistence's sake), and if it fails, fetch the whole redundant splitfile.

The combination of these would make the common case for a simple lookup extraordinarily fast (by Freenet standards!) - although for searching, the bottom layer (the actual list of hits for a term) may still be very large, and there are other tactics to optimize this (e.g. by not fetching the whole list when we don't need to).

Distributed searching would have to do many such lookups so would still be somewhat slow - on the other hand, with good merging tools, it's likely that even for files most of the content will be in a relatively small number of big indexes, and we can optimize out many of the fetches for smaller indexes.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.