Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

first public release

  • Loading branch information...
commit 73b41ae1e28a8a74cfe4cb1b4dd5e3d671fb3975 0 parents
@antirez authored
4 .gitignore
@@ -0,0 +1,4 @@
+.*.swp
+btree_example
+*.db
+*.dSYM
308 BTREE.txt
@@ -0,0 +1,308 @@
+FORMAT OF INTEGERS
+==================
+
+In the following document all the integers are intended to be in network byte
+order. All the sizes are stored as 32 bit unsigned integers, while all the
+pointers are stored as 64 bit unsigned integers.
+
+So the maximum data size, and in general the maximum number of items, is
+2^32-1 while the maximum size of the database file is 2^64-1 bytes.
+
+LAYOUT
+======
+
++------------------------------------------+
+| HEADER |
++------------------------------------------+
+| 8 bytes freelist |
++------------------------------------------+
+| 16 bytes freelist |
++------------------------------------------+
+|... more power of two freelists ... |
++------------------------------------------+
+| 2GB bytes freelist |
++------------------------------------------+
+| ROOT node offset |
++------------------------------------------+
+/ /
+/ all the other nodes and data /
++------------------------------------------+
+
+Freelists are available for sizes for 8, 16, 32, 64, 128, 256, 512, 1024
+2048, 4096, 8192, 16384, 32768, ..., 2GB, for a total of 29 free lists.
+
+ALIGNMENT REQUIREMENTS
+======================
+
+Every pointer on disk is aligned to 8 byte boundary, so that if the block is
+a multiple of 8 (512 and 4096 are for example) there are more consistency
+guarantees when writing a single pointer on disk for atomic updates.
+
+In order to ensure this property we simply do the following:
+new blocks of data are always allocated in multiple of 8.
+Nodes, freelists, and all the structures we write inside data blocks are
+always contain 8 bytes aligned pointers.
+
+HEADER
+======
+
++--------+--------+--------+--------+
+| magic |version | free |freeoff |
++--------+--------+--------+--------+
+
+The magic is the 64 bit string "REDBTREE"
+
+The version field is the version of the btree, as an ascii string: "00000000".
+
+The freeoff is a pointer to the first byte of the file that is not used, so
+it can be used for allocations. When there is the need to allocate more space
+then available, the file size is enlarged using truncate(2).
+
+the free field is the amount of free space starting at freeoff. 64 bit.
+
+The file is always enlarged by at least BTREE_PREALLOC_SIZE that is a power
+of two.
+
+FREELIST BLOCK
+==============
+
++--------+--------+--------+--------+
+| prev | next |numitems| item1 |
++-----------------------------------+
+/ /
+/ more items /
+/ /
++--------+--------------------------+
+| item N |
++--------+
+
+Every free list block contains BTREE_FREELIST_BLOCK_ITEMS items.
+
+'next' is the offset of the next free list block for the same size. If the
+block is the last one, next is set to the value of zero.
+
+'prev' is the offset of the previous free list block for the same size. If the
+block is the first one, prev is set to the value of zero.
+
+'numitems' is the number of used items in this freelist block.
+Since this is a count can't go over 32 bits but it is written as 64 bit on disk so that the free list block is just a sequence of N 64 bit numbers.
+
+Every item is just a pointer to some place of the file.
+
+Implementations should try to take freelist pointers in memory, for
+instance, for every size:
+
+- Offsets for all the freelist blocks
+- Number of elements of the last block (all the other blocks are full)
+
+ALLOCATION
+==========
+
+The btree allocates memory in pieces of power of two sizes. For instance if
+there is to create a new node, BTREE_ALLOC is called, with the size of the
+node. If the amount to allocate is not already a power of two, the allocator
+will return a block that is exactly like the nearest power of two that is
+greater than the specified size.
+
+Data is allocated by looking at the free list for that size. If there is an
+available block to reuse, it is removed from the free list and reused
+(just updating the number of items in the free list block, or alternatively
+removing the block and the link from the previous block if this was the
+latest item in the current free list block).
+
+If there is no space to reuse for the allocation, we check if there is
+data at the end of the file that is ready to be used. This is done simply
+checking at the difference between the 'totlen' and 'freeoff' fields in the
+header. If there is not enough space a truncate(2) operation is performed
+against the file to allocate more space at the end.
+
+Every time BTREE_ALLOC performs an allocation, the returned block is
+prefixed with a byte reporting the size of the allocated block. Since
+allocations are always power of two, this is just the esponent so that
+2 raised to it will provide the allocation size.
+
+So actually if we call BTREE_ALLOC(4), this is what happens:
+
+ * one byte is added, so BTREE_ALLOC will really try to allocate 5 bytes.
+ * the nearest power of two greater than 5 is 8.
+ * the freelist for 8 bytes is checked. If there is an item it is reused.
+ (note that we don't need to write the size prefix when we reuse an
+ item, there should already be the right number in the header).
+ * if there is no free space, more free space is created at the end, of
+ at least BTREE_PREALLOC_SIZE, or more if the allocation needed more.
+ * The one byte header is written as first byte, an the pointer to the
+ next byte is returned.
+
+RELEASING AN ALLOCATION
+=======================
+
+BTREE_FREE does the contrary, releasing an allocation so that the space
+will be available for further allocations.
+
+What happens is simply that that the pointer to the released allocation is
+put into the corresponding free list.
+
+As you can see the size of the btree file will never get smaller, as even
+when memory is released we take it pre-allocated into free lists.
+A tool will be provided for off line compaction of databases that for some
+reason need to be restored to the minimum size, for instance for backups
+or WAN transfers.
+
+BTREE NODE
+==========
+
+The B-tree node is composed of:
+
+ N keys
+ N values pointers
+ N+1 pointers to child nodes
+
+Every node can have from BTREE_MIN_KEYS to BTREE_MAX_KEYS keys.
+
+All the keys are the same size of 16 bytes, that is, the first 16 bytes of
+the SHA1 sum of the real key if big keys support is enabled.
+
+Keys may also be 128 bit numbers when the btree is used as an index, or
+fixed length 16 bytes keys, possible zero padded at the end if the key
+is not binary safe but may be shorter.
+
++--------+--------+--------+--------+
+| start |numkeys | isleaf | notused|
++--------+--------+--------+--------+
+| key1 |
++--------+--------+--------+--------+
+| key2 |
++--------+--------+--------+--------+
+| ... all the other keys .. |
++--------+--------+--------+--------+
+| value pointer 1 | value pointer 2 |
++--------+--------+--------+--------+
+| ... N value pointers in total ... |
++-----------------------------------+
+| child pointer 1 | child pointer 2 |
++--------+--------+--------+--------+
+| .. N+1 child pointers in total .. |
++--------+--------------------------+
+| end |
++--------+
+
+start is just a random 32 bit number. The same random number is also written
+in the end field. This is used in order to detect corruptions (half written
+noes).
+
+numkeys is the number of keys in this node, and is a count (32 bit integer).
+
+isleaf is set to 0 if the node is not a leaf, otherwise to 1. (32 bit integer)
+
+unused is a not used field, 32 bit. It was added in order to make sure all the
+pointers are 8 bytes aligned, may be used in future versions of the btree.
+
+key1 ... keyN are the fixed length keys, 16 bytes. If support for big keys
+is enabled, this is the SHA1 of the key, truncated at 16 bytes.
+
+all the pointers are simply 64 bit unsigend offsets.
+
+All nodes are allocated with space for the maximum number of keys, so for
+instance if BTREE_MAX_KEYS is 255, every node will be:
+
+4 + 4 + 4 + 4 + 16*255 + 8*255 + 8*256 + 4 bytes = 8188 bytes.
+
+It is important to select BTREE_MAX_KEYS so that it is just a little smaller
+than a power of two. In this case 8188 is just a bit smaller than 8192.
+
+REDIS LEVEL OPERATIONS
+======================
+
+DISKSTORE LOCK and DISKSTORE UNLOCK commands may be implemented in order to
+tell Redis to hold new changes in memory and don't write new things on
+diskstore, exactly like what happens when a BGSAVE is in progress with diskstore
+enabled. This is useful in order to allow the system administrator to copy
+the B-TREE while Redis is running.
+
+FREE LIST HANDLING
+==================
+
+Let's assume that:
+
+1) The free list block of our btree is 512 bytes.
+2) We receive an btree_alloc() free request against another block of 512 bytes.
+3) The latest free list for size 512 bytes is all full:
+
+[UUUUU ... UUUUU] (U means "used")
+
+Since we received a btree_alloc we need to put a new item inside the free list
+block, but since it is full we need to allocate a new 512 bytes block.
+Allocating this block would result in the current block to have a free item!
+So after we link the new block we'll obtain:
+
+[UUUUU ... UUUU ] -> [ ]
+
+That's not correct as the previous block now has a free item.
+
+So what we do instead is to use the block that we should put into the free list
+as a new block for the free list. So the final result is:
+
+[UUUUU ... UUUUU] -> [ ]
+
+That is what we want.
+
+On the contrary, if we want to allocaet a block 512 bytes in size, and there
+is an empty block at the tail of our free list, we just use the block itself.
+
+RANDOM IDEAS / TODO
+===================
+
+1) Take a global incremental 64 bit counter and set it always both at the start and at the end of a node, so that it is possible to detect half written blocks.
+2) Perhaps it is better to return allocations always zeroed, even when we reuse an entry in the freelist.
+3) Check at startup that the file length is equal to freeoff+free in header.
+
+USEFUL RESOURCES ON BTREES
+==========================
+
+- Introduction to Algorithms (Cormen, ...), chapter on B-TREEs.
+- http://dr-josiah.blogspot.com/2010/08/databases-on-ssds-initial-ideas-on.html
+- http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf
+
+NODE SPLITTING
+==============
+
+x: c h
+ / | \
+ / | \
+ a | z zz zzz
+ |
+y: d e ee
+
+
+i = 1
+-----
+
+x: c e h
+ / | | \
+ / | | \
+ a | | z zz zzz
+ | |
+y: d ee
+
+
+k[i+1] = k[i], per numkeys-i = (2 - 1) = 1
+c[i+2] = c[i+1], per numkeys-i = (2 - 1) = 1
+k[i] = e
+c[i] = left
+c[i+1] = right
+
+i = 2
+-----
+
+x: c h zz
+ / | \ \
+ / | \ \
+ a | z zzz
+ |
+y: d e ee
+
+k[3] = k[2], per numkeys-i = (2-2) = 0
+c[4] = k[3], per numkeys-i = (2-2) = 0
+k[2] = zz
+c[i] = left
+c[i+1] = right
7 Makefile
@@ -0,0 +1,7 @@
+all: btree-example
+
+btree-example: btree.c btree_example.c
+ $(CC) -o btree_example btree.c btree_example.c -Wall -W -g -rdynamic -ggdb -O2
+
+clean:
+ rm -rf btree_example
44 README
@@ -0,0 +1,44 @@
+There are a number of btree (and variants) implementations around, but for
+many reasons like complexity and tight coupling with external code bases, or
+licensing issues, to find an easy to embed implementation of an on disk btree
+is not an easy task.
+
+Since an on disk btree is a tool useful in a number of software projects this
+library is an attempt to bring to the table an open implementation of a btree.
+The term "open" here means: simple to use, BSD licensed, well documented, and
+simple to modify and understand.
+
+CURRENT STATUS
+==============
+
+Currently this is a work in progress, so far we have a subset of basic btree
+operations implemented on top of an on disk allocator (something like a
+file-based malloc).
+
+Supported operations are adding new keys, splitting of nodes.
+Deletion is not supported, nor update of old values.
+In other words the project is NOT usable so far, more work is needed.
+
+Currently everything is written on disk on every write for the sake of
+simplicity of the first implementation, but the work is in progress in order
+to cache the allocator metadata in memory, so that the performances can be
+enhanced.
+
+The goal is to eventually support all the following features:
+
+- A compromise between fast implementation and ability to incrementally reclaim
+ memory from disk automatically. This is why we have the on disk allocator.
+- Range queries using 128 bit precision integers, to use the btree as index.
+- Good tools for recovering and checking the btree.
+- Good documentation.
+
+Perhaps in the future:
+
+- Optional append only mode with compaction, for higher corruption resistance.
+
+In the first stage of the project the goal is to be good enough for the Redis
+project (in order to use this library for the diskstore feature of Redis).
+However while trying to reach this goal every care will be used in order to
+retain a great level of generality of this lib that will continue to live as a
+stand alone library. Redis will just happen to use a copy of it.
+
7 TODO
@@ -0,0 +1,7 @@
+There are tons of things to do as the lib is currently a work in progress
+and not ready to be used. So this is just a list of random things that must
+be done soon or later.
+
+- In memory freelist. So there is no disk access for allocation/free, simply the btree will leak some memory on disk when the application does not propetly close it.
+- crc32 in btree values. In the current allocation header we use a 64 bit length filed that is too much as our max allocation is 2GB. We needed the 8 byte header in order to preserve alignment. But we can use four of this bytes for crc32 purposes. This way the btree-check utility can validate values in a data agnostic way.
+- The btree-check utility should be able to rewrite the freelists. It can simply create an in-memory bitmap representing every 8 byte block of the btree. Then walk the whole btree, flipping every used 8 byte block to 1. At the end we can do a one-pass scan on the bitmap to populate all the free lists.
924 btree.c
@@ -0,0 +1,924 @@
+#include "btree.h"
+
+#include <assert.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <string.h>
+#include <sys/time.h>
+
+int btree_create(struct btree *bt);
+int btree_read_metadata(struct btree *bt);
+struct btree_node *btree_create_node(void);
+void btree_free_node(struct btree_node *n);
+int btree_write_node(struct btree *bt, struct btree_node *n, uint64_t offset);
+int btree_freelist_index_by_exp(int exponent);
+int btree_split_child(struct btree *bt, uint64_t pointedby, uint64_t parentoff,
+ int i, uint64_t childoff, uint64_t *newparent);
+
+/* ------------------------ UNIX standard VFS Layer ------------------------- */
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/stat.h>
+
+void *bvfs_unistd_open(char* path, int flags) {
+ int fd;
+ void *handle;
+
+ fd = open(path,((flags & BTREE_CREAT) ? O_CREAT : 0)|O_RDWR,0644);
+ if (fd == -1) return NULL;
+ handle = malloc(sizeof(fd));
+ *(int*)handle = fd;
+ return handle;
+}
+
+void bvfs_unistd_close(void *handle) {
+ int *fd = handle;
+
+ close(*fd);
+ free(handle);
+}
+
+ssize_t bvfs_unistd_pread(void *handle, void *buf, uint32_t nbytes,
+ uint64_t offset)
+{
+ int *fd = handle;
+
+ return pread(*fd,buf,nbytes,offset);
+}
+
+ssize_t bvfs_unistd_pwrite(void *handle, const void *buf, uint32_t nbytes,
+ uint64_t offset)
+{
+ int *fd = handle;
+
+ return pwrite(*fd,buf,nbytes,offset);
+}
+
+int bvfs_unistd_resize(void *handle, uint64_t length) {
+ int *fd = handle;
+
+ return ftruncate(*fd,length);
+}
+
+int bvfs_unistd_getsize(void *handle, uint64_t *size) {
+ int *fd = handle;
+ struct stat sb;
+
+ if (fstat(*fd,&sb) == -1) return -1;
+ *size = (uint64_t) sb.st_size;
+ return 0;
+}
+
+void bvfs_unistd_sync(void *handle) {
+ int *fd = handle;
+
+ fsync(*fd);
+}
+
+struct btree_vfs bvfs_unistd = {
+ bvfs_unistd_open,
+ bvfs_unistd_close,
+ bvfs_unistd_pread,
+ bvfs_unistd_pwrite,
+ bvfs_unistd_resize,
+ bvfs_unistd_getsize,
+ bvfs_unistd_sync
+};
+
+/* ------------------------- From/To Big endian ----------------------------- */
+
+void btree_u32_to_big(unsigned char *buf, uint32_t val) {
+ buf[0] = (val >> 24) & 0xff;
+ buf[1] = (val >> 16) & 0xff;
+ buf[2] = (val >> 8) & 0xff;
+ buf[3] = val & 0xff;
+}
+
+void btree_u64_to_big(unsigned char *buf, uint64_t val) {
+ buf[0] = (val >> 56) & 0xff;
+ buf[1] = (val >> 48) & 0xff;
+ buf[2] = (val >> 40) & 0xff;
+ buf[3] = (val >> 32) & 0xff;
+ buf[4] = (val >> 24) & 0xff;
+ buf[5] = (val >> 16) & 0xff;
+ buf[6] = (val >> 8) & 0xff;
+ buf[7] = val & 0xff;
+}
+
+uint32_t btree_u32_from_big(unsigned char *buf) {
+ uint32_t val = 0;
+
+ val |= buf[0] << 24;
+ val |= buf[1] << 16;
+ val |= buf[2] << 8;
+ val |= buf[3];
+ return val;
+}
+
+uint64_t btree_u64_from_big(unsigned char *buf) {
+ uint64_t val = 0;
+
+ val |= (uint64_t) buf[0] << 56;
+ val |= (uint64_t) buf[1] << 48;
+ val |= (uint64_t) buf[2] << 40;
+ val |= (uint64_t) buf[3] << 32;
+ val |= buf[4] << 24;
+ val |= buf[5] << 16;
+ val |= buf[6] << 8;
+ val |= buf[7];
+ return val;
+}
+
+/* -------------------------- Utility functions ----------------------------- */
+
+/* We read and write too often to write bt->vfs->...(bt->vfs_handle...) all the
+ * times, so we use this two help functions. */
+ssize_t btree_pwrite(struct btree *bt, const void *buf, uint32_t nbytes,
+ uint64_t offset)
+{
+ return bt->vfs->pwrite(bt->vfs_handle,buf,nbytes,offset);
+}
+
+ssize_t btree_pread(struct btree *bt, void *buf, uint32_t nbytes,
+ uint64_t offset)
+{
+ return bt->vfs->pread(bt->vfs_handle,buf,nbytes,offset);
+}
+
+/* We want to be able to write and read 32 and 64 integers easily and in a
+ * platform / endianess agnostic way. */
+ssize_t btree_pwrite_u32(struct btree *bt, uint32_t val, uint64_t offset) {
+ unsigned char buf[4];
+
+ btree_u32_to_big(buf,val);
+ return btree_pwrite(bt,buf,sizeof(buf),offset);
+}
+
+int btree_pwrite_u64(struct btree *bt, uint64_t val, uint64_t offset) {
+ unsigned char buf[8];
+
+ btree_u64_to_big(buf,val);
+ return btree_pwrite(bt,buf,sizeof(buf),offset);
+}
+
+int btree_pread_u32(struct btree *bt, uint32_t *val, uint64_t offset) {
+ unsigned char buf[4];
+
+ if (btree_pread(bt,buf,sizeof(buf),offset) == -1) return -1;
+ *val = btree_u32_from_big(buf);
+ return 0;
+}
+
+int btree_pread_u64(struct btree *bt, uint64_t *val, uint64_t offset) {
+ unsigned char buf[8];
+
+ if (btree_pread(bt,buf,sizeof(buf),offset) == -1) return -1;
+ *val = btree_u64_from_big(buf);
+ return 0;
+}
+
+void btree_sync(struct btree *bt) {
+ if (bt->flags & BTREE_FLAG_USE_WRITE_BARRIER)
+ bt->vfs->sync(bt->vfs_handle);
+}
+
+/* ---------------------------- BTREE operations ---------------------------- */
+
+void btree_set_flags(struct btree *bt, int flags) {
+ bt->flags |= flags;
+}
+
+void btree_clear_flags(struct btree *bt, int flags) {
+ bt->flags &= ~flags;
+}
+
+/* Open a btree. On error NULL is returned, and errno is set accordingly.
+ * Flags modify the behavior of the call:
+ *
+ * BTREE_CREAT: create the btree if it does not exist. */
+struct btree *btree_open(struct btree_vfs *vfs, char *path, int flags) {
+ struct btree *bt = NULL;
+ struct timeval tv;
+ int j, mkroot = 0;
+
+ /* Initialize a new btree structure */
+ if ((bt = malloc(sizeof(*bt))) == NULL) {
+ errno = ENOMEM;
+ return NULL;
+ }
+ bt->vfs = vfs ? vfs : &bvfs_unistd;
+ bt->vfs_handle = NULL;
+ bt->flags = BTREE_FLAG_USE_WRITE_BARRIER;
+ for (j = 0; j < BTREE_FREELIST_COUNT; j++) {
+ bt->freelist[j].numblocks = 0;
+ bt->freelist[j].blocks = NULL;
+ bt->freelist[j].last_items = 0;
+ }
+
+ /* Try opening the specified btree */
+ bt->vfs_handle = bt->vfs->open(path,0);
+ if (bt->vfs_handle == NULL) {
+ if (!(flags & BTREE_CREAT)) goto err;
+ /* Create the btree */
+ if ((bt->vfs_handle = bt->vfs->open(path,flags)) == NULL) goto err;
+ if (btree_create(bt) == -1) goto err;
+ mkroot = 1; /* Create the root node before returing */
+ }
+
+ /* There are things about our btree that we always take in memory,
+ * like all the free list block pointers and so forth.
+ * Once we open the btree, we need to load this data into memory. */
+ if (btree_read_metadata(bt) == -1) goto err;
+ gettimeofday(&tv,NULL);
+ bt->mark = (uint32_t) random() ^ tv.tv_sec ^ tv.tv_usec;
+
+ /* Write the root node if needed (only when DB is created) */
+ if (mkroot) {
+ struct btree_node *root;
+ uint64_t rootptr;
+
+ /* Allocate space for the root */
+ if ((rootptr = btree_alloc(bt,BTREE_NODE_SIZE)) == 0) goto err;
+
+ /* Create a fresh root node and write it on disk */
+ if ((root = btree_create_node()) == NULL) goto err;
+ root->isleaf = 1; /* Our first node is a leaf */
+ if (btree_write_node(bt,root,rootptr) == -1) {
+ btree_free_node(root);
+ goto err;
+ }
+ btree_free_node(root);
+ btree_sync(bt);
+
+ /* Write the root node pointer. */
+ if (btree_pwrite_u64(bt,rootptr,BTREE_HDR_ROOTPTR_POS) == -1) goto err;
+ bt->rootptr = rootptr;
+ btree_sync(bt);
+ }
+ return bt;
+
+err:
+ btree_close(bt);
+ return NULL;
+}
+
+/* Close a btree, even one that was unsuccesfull opened, so that
+ * btree_open() can use this function for cleanup on error. */
+void btree_close(struct btree *bt) {
+ int j;
+
+ if (!bt) return;
+ if (bt->vfs_handle) bt->vfs->close(bt->vfs_handle);
+ for (j = 0; j < BTREE_FREELIST_COUNT; j++)
+ free(bt->freelist[j].blocks);
+ free(bt);
+}
+
+#include <stdio.h>
+
+/* Create a new btree, populating the header, free lists.
+ * Note that this function is not exported, as callers should create a new
+ * btree using open with the BTREE_CREAT flag. */
+int btree_create(struct btree *bt) {
+ int size, j;
+ uint64_t filesize, freeoff;
+
+ /* Make room for all the objects we have in the header */
+ if (bt->vfs->getsize(bt->vfs_handle,&filesize) == -1) return -1;
+ assert(filesize == 0);
+
+ /* header: magic, version, free, freeoff */
+ size = 8*4;
+ /* Then we have our root free lists */
+ size += BTREE_FREELIST_COUNT * BTREE_FREELIST_BLOCK_SIZE;
+ /* And finally our root node pointer and actual node */
+ size += 8; /* root pointer */
+ size += BTREE_NODE_SIZE; /* root node */
+ if (bt->vfs->resize(bt->vfs_handle,size) == -1) return -1;
+
+ /* Now we have enough space to actually build the btree header,
+ * free lists, and root node. */
+
+ /* Magic and version */
+ if (btree_pwrite(bt,"REDBTREE00000000",16,0) == -1) return -1;
+
+ /* Free and Freeoff */
+ if (btree_pwrite_u64(bt,0,BTREE_HDR_FREE_POS) == -1) return -1;
+ freeoff = 32+BTREE_FREELIST_BLOCK_SIZE*BTREE_FREELIST_COUNT+8+BTREE_NODE_SIZE;
+ if (btree_pwrite_u64(bt,freeoff,BTREE_HDR_FREEOFF_POS) == -1) return -1;
+
+ /* Free lists */
+ for (j = 0; j < BTREE_FREELIST_COUNT; j++) {
+ uint64_t off = 32+BTREE_FREELIST_BLOCK_SIZE*j;
+
+ /* next and prev pointers are set to zero, as this is the first
+ * and sole block for this size. */
+ if (btree_pwrite_u64(bt,0,off) == -1) return -1;
+ if (btree_pwrite_u64(bt,0,off+8) == -1) return -1;
+ /* Set count as zero, as we have no entry inside this block */
+ if (btree_pwrite_u32(bt,0,off+16) == -1) return -1;
+ }
+ return 0;
+}
+
+int btree_read_metadata(struct btree *bt) {
+ int j;
+
+ /* TODO: Check signature and version. */
+ /* Read free space and offset information */
+ if (btree_pread_u64(bt,&bt->free,BTREE_HDR_FREE_POS) == -1) return -1;
+ if (btree_pread_u64(bt,&bt->freeoff,BTREE_HDR_FREEOFF_POS) == -1) return -1;
+ /* TODO: check that they makes sense considered the file size. */
+ /* Read root node pointer */
+ if (btree_pread_u64(bt,&bt->rootptr,BTREE_HDR_ROOTPTR_POS) == -1) return -1;
+ printf("Root node is at %llu\n", bt->rootptr);
+ /* Read free lists information */
+ for (j = 0; j < BTREE_FREELIST_COUNT; j++) {
+ uint64_t ptr = 32+BTREE_FREELIST_BLOCK_SIZE*j;
+ uint64_t nextptr, numitems;
+
+ // printf("Load metadata for freelist %d\n", j);
+ do {
+ struct btree_freelist *fl = &bt->freelist[j];
+
+ if (btree_pread_u64(bt,&nextptr,ptr+sizeof(uint64_t)) == -1)
+ return -1;
+ if (btree_pread_u64(bt,&numitems,ptr+sizeof(uint64_t)*2) == -1)
+ return -1;
+ // printf(" block %lld: %lld items (next: %lld)\n", ptr, numitems,
+ // nextptr);
+ fl->blocks = realloc(fl->blocks,sizeof(uint64_t)*(fl->numblocks+1));
+ if (fl->blocks == NULL) return -1;
+ fl->blocks[fl->numblocks] = ptr;
+ fl->numblocks++;
+ fl->last_items = numitems;
+ ptr = nextptr;
+ } while(ptr);
+ }
+ return 0;
+}
+
+/* Create a new node in memory */
+struct btree_node *btree_create_node(void) {
+ struct btree_node *n = calloc(1,sizeof(*n));
+
+ return n;
+}
+
+void btree_free_node(struct btree_node *n) {
+ free(n);
+}
+
+/* Write a node on disk at the specified offset. Returns 0 on success.
+ * On error -1 is returne and errno set accordingly. */
+int btree_write_node(struct btree *bt, struct btree_node *n, uint64_t offset) {
+ unsigned char buf[BTREE_NODE_SIZE];
+ unsigned char *p = buf;
+ int j;
+
+ bt->mark++;
+ btree_u32_to_big(p,bt->mark); p += 4; /* start mark */
+ btree_u32_to_big(p,n->numkeys); p += 4; /* number of keys */
+ btree_u32_to_big(p,n->isleaf); p += 4; /* is a leaf? */
+ p += 4; /* unused field, needed for alignment */
+ memcpy(p,n->keys,sizeof(n->keys)); p += sizeof(n->keys); /* keys */
+ /* values */
+ for (j = 0; j < BTREE_MAX_KEYS; j++) {
+ btree_u64_to_big(p,n->values[j]);
+ p += 8;
+ }
+ /* children */
+ for (j = 0; j <= BTREE_MAX_KEYS; j++) {
+ btree_u64_to_big(p,n->children[j]);
+ p += 8;
+ }
+ btree_u32_to_big(p,bt->mark); p += 4; /* end mark */
+ return btree_pwrite(bt,buf,sizeof(buf),offset);
+}
+
+/* Read a node from the specified offset.
+ * On success the in memory representation of the node is returned as a
+ * btree_node structure (to be freed with btree_free_node). On error
+ * NULL is returned and errno set accordingly.
+ *
+ * If data on disk is corrupted errno is set to EFAULT. */
+struct btree_node *btree_read_node(struct btree *bt, uint64_t offset) {
+ unsigned char buf[BTREE_NODE_SIZE], *p;
+ struct btree_node *n;
+ int j;
+
+ if (btree_pread(bt,buf,sizeof(buf),offset) == -1) return NULL;
+ /* Verify start/end marks */
+ if (memcmp(buf,buf+BTREE_NODE_SIZE-4,4)) {
+ errno = EFAULT;
+ return NULL;
+ }
+ if ((n = btree_create_node()) == NULL) return NULL;
+
+ p = buf+4;
+ n->numkeys = btree_u32_from_big(p); p += 4; /* number of keys */
+ n->isleaf = btree_u32_from_big(p); p += 4; /* is a leaf? */
+ p += 4; /* unused field, needed for alignment */
+ memcpy(n->keys,p,sizeof(n->keys)); p += sizeof(n->keys); /* keys */
+ /* values */
+ for (j = 0; j < BTREE_MAX_KEYS; j++) {
+ n->values[j] = btree_u64_from_big(p);
+ p += 8;
+ }
+ /* children */
+ for (j = 0; j <= BTREE_MAX_KEYS; j++) {
+ n->children[j] = btree_u64_from_big(p);
+ p += 8;
+ }
+ return n;
+}
+
+/* ------------------------- disk space allocator --------------------------- */
+
+/* Compute logarithm in base two of 'n', with 'n' being a power of two.
+ * Probably you can just check the latest 1 bit set, but here it's not
+ * a matter of speed as we are dealing with the disk every time we call
+ * this function. */
+int btree_log_two(uint32_t n) {
+ int log = -1;
+
+ while(n) {
+ log++;
+ n /= 2;
+ }
+ return log;
+}
+
+int btree_alloc_freelist(struct btree *bt, uint32_t realsize, uint64_t *ptr) {
+ int exp = btree_log_two(realsize);
+ int fli = btree_freelist_index_by_exp(exp);
+ struct btree_freelist *fl = &bt->freelist[fli];
+ uint64_t block, lastblock = 0, p;
+
+ if (fl->last_items == 0 && fl->numblocks == 1) {
+ *ptr = 0;
+ return 0;
+ }
+
+ /* Last block is empty? Remove it */
+ if (fl->last_items == 0) {
+ uint64_t prevblock, *oldptr;
+
+ assert(fl->numblocks > 1);
+ /* Set prevblock next pointer to NULL */
+ prevblock = fl->blocks[fl->numblocks-2];
+ if (btree_pwrite_u64(bt,0,prevblock+sizeof(uint64_t)) == -1) return -1;
+ btree_sync(bt);
+ /* Fix our memory representaiton of freelist */
+ lastblock = fl->blocks[fl->numblocks-1];
+ fl->numblocks--;
+ /* The previous item must be full, so we set the new number
+ * of items to the max. */
+ fl->last_items = BTREE_FREELIST_BLOCK_ITEMS;
+ /* Realloc the block as we have one element less. */
+ oldptr = fl->blocks;
+ fl->blocks = realloc(fl->blocks,sizeof(uint64_t)*fl->numblocks);
+ if (fl->blocks == NULL) {
+ /* Out of memory. The realloc failed, but note that while this
+ * is a leak as the block remains larger than needed we still
+ * have a valid in memory representation. */
+ fl->blocks = oldptr;
+ return -1;
+ }
+ }
+
+ /* There was a block to remove, but this block is the same size
+ * of the allocation required? Just return it. */
+ if (lastblock && exp == BTREE_FREELIST_SIZE_EXP) {
+ *ptr = lastblock;
+ return 0;
+ } else {
+ btree_free(bt,lastblock);
+ }
+
+ /* Get an element from the current block, and return it to the
+ * caller. */
+ block = fl->blocks[fl->numblocks-1];
+ if (btree_pread_u64(bt,&p,block+((2+fl->last_items)*sizeof(uint64_t))) == -1) return -1;
+ fl->last_items--;
+ if (btree_pwrite_u64(bt,fl->last_items,block+(2*sizeof(uint64_t))) == -1) return -1;
+ btree_sync(bt);
+ *ptr = p+sizeof(uint64_t);
+ return 0;
+}
+
+/* Return the next power of two that is able to hold size+1 bytes.
+ * The byte we add is used to save the exponent of two as the first byte
+ * so that for btree_free() can check the block size. */
+uint32_t btree_alloc_realsize(uint32_t size) {
+ uint32_t realsize;
+
+ realsize = 16; /* We don't allocate nothing that is smaller than 16 bytes */
+ while (realsize < (size+sizeof(uint64_t))) realsize *= 2;
+ return realsize;
+}
+
+/* Allocate some piece of data on disk. Returns the offset to the newly
+ * allocated space. If the allocation can't be performed, 0 is returned. */
+uint64_t btree_alloc(struct btree *bt, uint32_t size) {
+ uint64_t ptr;
+ uint32_t realsize;
+
+ printf("ALLOCATIING %lu\n", (unsigned long) size);
+
+ /* Don't allow allocations bigger than 2GB */
+ if (size > (unsigned)(1<<31)) {
+ errno = EINVAL;
+ return 0;
+ }
+ realsize = btree_alloc_realsize(size);
+
+ /* Search for free space in the free lists */
+ if (btree_alloc_freelist(bt,realsize,&ptr) == -1) return 0;
+ if (ptr) {
+ uint64_t oldsize;
+ /* Got an element from the free list. Fix the size header if needed. */
+ if (btree_pread_u64(bt,&oldsize,ptr-sizeof(uint64_t)) == -1) return 0;
+ if (oldsize != size) {
+ if (btree_pwrite_u64(bt,size,ptr-sizeof(uint64_t)) == -1)
+ return 0;
+ btree_sync(bt);
+ }
+ return ptr;
+ }
+
+ /* We have to perform a real allocation.
+ * If we don't have room at the end of the file, create some space. */
+ if (bt->free < realsize) {
+ uint64_t currsize = bt->freeoff + bt->free;
+ if (bt->vfs->resize(bt->vfs_handle,currsize+BTREE_PREALLOC_SIZE) == -1)
+ return 0;
+ bt->free += BTREE_PREALLOC_SIZE;
+ }
+
+ /* Allocate it moving the header pointers and free space count */
+ ptr = bt->freeoff;
+ bt->free -= realsize;
+ bt->freeoff += realsize;
+
+ if (btree_pwrite_u64(bt,bt->free,BTREE_HDR_FREE_POS) == -1) return -1;
+ if (btree_pwrite_u64(bt,bt->freeoff,BTREE_HDR_FREEOFF_POS) == -1) return -1;
+
+ /* Write the size header in the new allocated space */
+ if (btree_pwrite_u64(bt,size,ptr) == -1) return -1;
+
+ /* A final fsync() as a write barrier */
+ btree_sync(bt);
+ return ptr+sizeof(uint64_t);
+}
+
+/* Given an on disk pointer returns the length of the original allocation
+ * (not the size of teh chunk itself as power of two, but the original
+ * argument passed to btree_alloc function).
+ *
+ * On success 0 is returned and the size parameter populated, otherwise
+ * -1 is returned and errno set accordingly. */
+int btree_alloc_size(struct btree *bt, uint32_t *size, uint64_t ptr) {
+ uint64_t s;
+
+ if (btree_pread_u64(bt,&s,ptr-8) == -1) return -1;
+ *size = (uint32_t) s;
+ return 0;
+}
+
+/* Return the free list slot index given the power of two exponent representing
+ * the size of the free list allocations. */
+int btree_freelist_index_by_exp(int exponent) {
+ assert(exponent > 1 && exponent < 32);
+ return exponent-4;
+}
+
+/* Release allocated memory, putting the pointer in the right free list.
+ * On success 0 is returned. On error -1. */
+int btree_free(struct btree *bt, uint64_t ptr) {
+ uint64_t size;
+ uint32_t realsize;
+ int fli, exp;
+ struct btree_freelist *fl;
+
+ if (btree_pread_u64(bt,&size,ptr-sizeof(uint64_t)) == -1) return -1;
+ realsize = btree_alloc_realsize(size);
+ exp = btree_log_two(realsize);
+ printf("Free %llu bytes (realsize: %llu)\n", size, (uint64_t) realsize);
+
+ fli = btree_freelist_index_by_exp(exp);
+ fl = &bt->freelist[fli];
+
+ /* We need special handling when freeing an allocation that is the same
+ * size of the freelist block, and the latest free list block for that size
+ * is full. Without this special handling what happens is that we need
+ * to allocate a new block of the same size to make space, but doing so
+ * would result in an element removed from the latest block, so after we
+ * link the new block we have the previous block that is not full.
+ *
+ * Check BTREE.txt in this source distribution for more information. */
+ if (fl->last_items == BTREE_FREELIST_BLOCK_ITEMS &&
+ exp == BTREE_FREELIST_SIZE_EXP)
+ {
+ /* Just use the freed allocation as the next free block */
+ fl->blocks = realloc(fl->blocks,sizeof(uint64_t)*(fl->numblocks+1));
+ if (fl->blocks == NULL) return -1;
+ fl->blocks[fl->numblocks] = ptr;
+ fl->numblocks++;
+ fl->last_items = 0;
+ /* Init block setting items count, next pointer, prev pointer. */
+ btree_pwrite_u64(bt,0,ptr+sizeof(uint64_t)); /* next */
+ btree_pwrite_u64(bt,fl->blocks[fl->numblocks-2],ptr); /* prev */
+ btree_pwrite_u64(bt,0,ptr+sizeof(uint64_t)*2); /* numitems */
+ btree_sync(bt); /* Make sure it's ok before linking it to prev block */
+ /* Link this new block to the free list blocks updating next pointer
+ * of the previous block. */
+ btree_pwrite_u64(bt,ptr,fl->blocks[fl->numblocks-2]+sizeof(uint64_t));
+ btree_sync(bt);
+ } else {
+ /* Allocate a new block if needed */
+ if (fl->last_items == BTREE_FREELIST_BLOCK_ITEMS) {
+ uint64_t newblock;
+
+ newblock = btree_alloc(bt,BTREE_FREELIST_BLOCK_SIZE);
+ if (newblock == 0) return -1;
+
+ fl->blocks = realloc(fl->blocks,sizeof(uint64_t)*(fl->numblocks+1));
+ if (fl->blocks == NULL) return -1;
+ fl->blocks[fl->numblocks] = newblock;
+ fl->numblocks++;
+ fl->last_items = 0;
+ /* Init block setting items count, next pointer, prev pointer. */
+ btree_pwrite_u64(bt,0,newblock+sizeof(uint64_t)); /* next */
+ btree_pwrite_u64(bt,fl->blocks[fl->numblocks-2],newblock);/* prev */
+ btree_pwrite_u64(bt,0,newblock+sizeof(uint64_t)*2); /* numitems */
+ btree_sync(bt); /* Make sure it's ok before linking it. */
+ /* Link this new block to the free list blocks updating next pointer
+ * of the previous block. */
+ btree_pwrite_u64(bt,newblock,fl->blocks[fl->numblocks-2]+sizeof(uint64_t));
+ btree_sync(bt);
+ }
+ /* Add the item */
+ fl->last_block[fl->last_items] = ptr-sizeof(uint64_t);
+ fl->last_items++;
+ /* Write the pointer in the block first */
+ printf("Write freelist item about ptr %llu at %llu\n",
+ ptr, fl->blocks[fl->numblocks-1]+(sizeof(uint64_t)*3)
+ +(sizeof(uint64_t)*(fl->last_items-1)));
+ btree_pwrite_u64(bt,ptr-sizeof(uint64_t),fl->blocks[fl->numblocks-1]+(sizeof(uint64_t)*3)+(sizeof(uint64_t)*(fl->last_items-1)));
+ btree_sync(bt);
+ /* Then write the items count. */
+ printf("Write the new count for block %lld: %lld at %lld\n",
+ fl->blocks[fl->numblocks-1],
+ (uint64_t) fl->last_items,
+ fl->blocks[fl->numblocks-1]+sizeof(uint64_t)*2);
+ btree_pwrite_u64(bt,fl->last_items,fl->blocks[fl->numblocks-1]+sizeof(uint64_t)*2);
+ btree_sync(bt);
+ }
+ return 0;
+}
+
+/* --------------------------- btree operations ---------------------------- */
+
+int btree_node_is_full(struct btree_node *n) {
+ return n->numkeys == BTREE_MAX_KEYS;
+}
+
+/* Add a key at the specified position 'i' inside an in-memory node.
+ * All the other keys starting from the old key at position 'i' are
+ * shifted one position to the right.
+ *
+ * Note: this function does not change the position of the children as it
+ * is intented to be used only on leafs. */
+void btree_node_insert_key_at(struct btree_node *n, int i, unsigned char *key, uint64_t valoff) {
+ void *p;
+
+ p = n->keys + (i*BTREE_HASHED_KEY_LEN);
+ memmove(p+BTREE_HASHED_KEY_LEN,p,(n->numkeys-i)*BTREE_HASHED_KEY_LEN);
+ memmove(n->values+i+1,n->values+i,(n->numkeys-i)*8);
+ memcpy(p,key,BTREE_HASHED_KEY_LEN);
+ n->values[i] = valoff;
+ n->numkeys++;
+}
+
+/* Insert a key (and associated value) into a non full node.
+ * If the node is a leaf the key can be inserted in the current node otherwise
+ * we need to walk the three, possibly splitting full nodes as we descend.
+ *
+ * The nodeptr is the offset of the node we want to insert into.
+ *
+ * Pointedby is the offset on disk inside the parent of the node pointed by
+ * 'nodeptr'. As we always write new full nodes instead of modifying old ones
+ * in order to be more crash proof, we need to update the pointer in the
+ * parent node when everything is ready.
+ *
+ * The function returns 0 on success, and -1 on error.
+ * On error errno is set accordingly, and may also assume the following values:
+ *
+ * EFAULT if the btree seems corrupted.
+ * EEXIST if the key already exists.
+ */
+int btree_add_nonfull(struct btree *bt, uint64_t nodeptr, uint64_t pointedby, unsigned char *key, unsigned char *val, size_t vlen) {
+ struct btree_node *n = NULL;
+ int i;
+
+ if ((n = btree_read_node(bt,nodeptr)) == NULL) return -1;
+ i = n->numkeys-1;
+
+ /* Seek to the right position in the current node */
+ while(i >= 0 && memcmp(key,n->keys+i*BTREE_HASHED_KEY_LEN,BTREE_HASHED_KEY_LEN) < 0) i--;
+
+ if (n->isleaf) {
+ uint64_t newoff; /* New node offset */
+ uint64_t valoff; /* Value offset on disk */
+
+ /* Write the value on disk */
+ if ((valoff = btree_alloc(bt,vlen)) == 0) goto err;
+ if (btree_pwrite(bt,val,vlen,valoff) == -1) goto err;
+ /* Insert the new key in place, and a pointer to the value. */
+ btree_node_insert_key_at(n,i+1,key,valoff);
+ /* Write the modified node to disk */
+ if ((newoff = btree_alloc(bt,BTREE_NODE_SIZE)) == 0) goto err;
+ if (btree_write_node(bt,n,newoff) == -1) goto err;
+ /* Update the pointer pointing to this node with the new node offset. */
+ if (btree_pwrite_u64(bt,newoff,pointedby) == -1) goto err;
+ if (pointedby == BTREE_HDR_ROOTPTR_POS) bt->rootptr = newoff;
+ /* Free the old node on disk */
+ if (btree_free(bt,nodeptr) == -1) goto err;
+ btree_free_node(n);
+ } else {
+ struct btree_node *child;
+ uint64_t newnode;
+
+ i++;
+ if ((child = btree_read_node(bt,n->children[i])) == NULL) return -1;
+ if (btree_node_is_full(child)) {
+ if (btree_split_child(bt,pointedby,nodeptr,i,n->children[i],
+ &newnode) == -1)
+ {
+ btree_free_node(child);
+ goto err;
+ }
+ } else {
+ pointedby = nodeptr+16+BTREE_HASHED_KEY_LEN*BTREE_MAX_KEYS+8*BTREE_MAX_KEYS+8*i;
+ newnode = n->children[i];
+ /* Fixme, here we can set 'n' to 'child' and tail-recurse with
+ * a goto, to avoid re-reading the same node again. */
+ }
+ btree_free_node(n);
+ btree_free_node(child);
+ return btree_add_nonfull(bt,newnode,pointedby,key,val,vlen);
+ }
+ return 0;
+
+err:
+ btree_free_node(n);
+ return -1;
+}
+
+/* Split child, that is the i-th child of parent.
+ * We'll write three new nodes, two to split the original child in two nodes
+ * and one containing the updated parent.
+ * Finally we'll set 'pointedby' to the offset of the new parent. So
+ * pointedby must point to the offset where the parent is referenced on disk,
+ * that is the root pointer heeader if it's the root node, or the right offset
+ * inside its parent (that is, the parent of the parent). */
+int btree_split_child(struct btree *bt, uint64_t pointedby, uint64_t parentoff,
+ int i, uint64_t childoff, uint64_t *newparent)
+{
+ struct btree_node *lnode = NULL, *rnode = NULL;
+ struct btree_node *child = NULL, *parent = NULL;
+ int halflen = (BTREE_MAX_KEYS-1)/2;
+ uint64_t loff, roff, poff; /* new left, right, parent nodes offets. */
+
+ /* Read parent and child from disk.
+ * Also creates new nodes in memory, lnode and rnode, that will be
+ * the nodes produced splitting the child into two nodes. */
+ if ((parent = btree_read_node(bt,parentoff)) == NULL) goto err;
+ if ((child = btree_read_node(bt,childoff)) == NULL) goto err;
+ if ((lnode = btree_create_node()) == NULL) goto err;
+ if ((rnode = btree_create_node()) == NULL) goto err;
+ /* Two fundamental conditions that must be always true */
+ assert(child->numkeys == BTREE_MAX_KEYS);
+ assert(parent->numkeys != BTREE_MAX_KEYS);
+ /* Split the child into lnode and rnode */
+ memcpy(lnode->keys,child->keys,BTREE_HASHED_KEY_LEN*halflen);
+ memcpy(lnode->values,child->values,8*halflen);
+ memcpy(lnode->children,child->children,8*(halflen+1));
+ lnode->numkeys = halflen;
+ lnode->isleaf = child->isleaf;
+ /* And the rnode */
+ memcpy(rnode->keys,child->keys+BTREE_HASHED_KEY_LEN*(halflen+1),
+ BTREE_HASHED_KEY_LEN*halflen);
+ memcpy(rnode->values,child->values+halflen+1,8*halflen);
+ memcpy(rnode->children,child->children+halflen+1,8*(halflen+1));
+ rnode->numkeys = halflen;
+ rnode->isleaf = child->isleaf;
+ /* Save left and right children on disk */
+ if ((loff = btree_alloc(bt,BTREE_NODE_SIZE)) == 0) goto err;
+ if ((roff = btree_alloc(bt,BTREE_NODE_SIZE)) == 0) goto err;
+ if (btree_write_node(bt,lnode,loff) == -1) goto err;
+ if (btree_write_node(bt,rnode,roff) == -1) goto err;
+
+ /* Now fix the parent node:
+ * let's move the child's median key into the parent.
+ * Shift the current keys, values, and child pointers. */
+ memmove(parent->keys+BTREE_HASHED_KEY_LEN*(i+1),
+ parent->keys+BTREE_HASHED_KEY_LEN*i,
+ (parent->numkeys-i)*BTREE_HASHED_KEY_LEN);
+ memmove(parent->values+i+1,parent->values+i,(parent->numkeys-i)*8);
+ memmove(parent->children+i+2,parent->children+i+1,(parent->numkeys-i)*8);
+ /* Set the key and left and right children */
+ memcpy(parent->keys+BTREE_HASHED_KEY_LEN*i,
+ child->keys+BTREE_HASHED_KEY_LEN*halflen,BTREE_HASHED_KEY_LEN);
+ parent->values[i] = child->values[halflen];
+ parent->children[i] = loff;
+ parent->children[i+1] = roff;
+ parent->numkeys++;
+ /* Write the parent on disk */
+ if ((poff = btree_alloc(bt,BTREE_NODE_SIZE)) == 0) goto err;
+ if (btree_write_node(bt,parent,poff) == -1) goto err;
+ if (newparent) *newparent = poff;
+ /* Now link the new nodes to the old btree */
+ btree_sync(bt); /* Make sure the nodes are flushed */
+ if (btree_pwrite_u64(bt,poff,pointedby) == -1) goto err;
+ if (pointedby == BTREE_HDR_ROOTPTR_POS) bt->rootptr = poff;
+ /* Finally reclaim the space used by the old nodes */
+ btree_free(bt,parentoff);
+ btree_free(bt,childoff);
+
+ btree_free_node(lnode);
+ btree_free_node(rnode);
+ btree_free_node(parent);
+ btree_free_node(child);
+ return 0;
+
+err:
+ btree_free_node(lnode);
+ btree_free_node(rnode);
+ btree_free_node(parent);
+ btree_free_node(child);
+ return -1;
+}
+
+int btree_add(struct btree *bt, unsigned char *key, unsigned char *val, size_t vlen) {
+ struct btree_node *root;
+
+ if ((root = btree_read_node(bt,bt->rootptr)) == NULL) return -1;
+
+ if (btree_node_is_full(root)) {
+ uint64_t rootptr;
+
+ /* Root is full. Split it. */
+ btree_free_node(root);
+ root = NULL;
+ /* Create a fresh node on disk: will be our new root. */
+ if ((root = btree_create_node()) == NULL) return -1;
+ if ((rootptr = btree_alloc(bt,BTREE_NODE_SIZE)) == 0) goto err;
+ if (btree_write_node(bt,root,rootptr) == -1) goto err;
+ btree_free_node(root);
+ /* Split it */
+ if (btree_split_child(bt,BTREE_HDR_ROOTPTR_POS,rootptr,0,bt->rootptr,NULL) == -1) goto err;
+ } else {
+ btree_free_node(root);
+ }
+ return btree_add_nonfull(bt,bt->rootptr,BTREE_HDR_ROOTPTR_POS,key,val,vlen);
+
+err:
+ btree_free_node(root);
+ return -1;
+}
+
+/* Just a debugging function to check what's inside the whole btree... */
+void btree_walk(struct btree *bt, uint64_t nodeptr) {
+ struct btree_node *n;
+ unsigned int j;
+
+ n = btree_read_node(bt,nodeptr);
+ if (n == NULL) {
+ printf("Error walking the btree: %s\n", strerror(errno));
+ return;
+ }
+ for (j = 0; j < n->numkeys; j++) {
+ char *data;
+ uint32_t datalen;
+
+ if (n->children[j] != 0) {
+ btree_walk(bt,n->children[j]);
+ }
+ if (j == 0)
+ printf("Node at %llu, %d keys\n", nodeptr, (int)n->numkeys);
+ printf(" Key %20s: ", n->keys+(j*BTREE_HASHED_KEY_LEN));
+ btree_alloc_size(bt,&datalen,n->values[j]);
+ data = malloc(datalen+1);
+ btree_pread(bt,data,datalen,n->values[j]);
+ data[datalen] = '\0';
+ printf("@%llu %lu bytes: %s\n",
+ n->values[j],
+ (unsigned long)datalen, data);
+ free(data);
+ }
+ if (n->children[j] != 0) {
+ btree_walk(bt,n->children[j]);
+ }
+}
111 btree.h
@@ -0,0 +1,111 @@
+#include <stdint.h>
+#include <sys/types.h>
+
+#define BTREE_CREAT 1
+
+#define BTREE_PREALLOC_SIZE (1024*512)
+#define BTREE_FREELIST_BLOCK_ITEMS 252
+#define BTREE_MIN_KEYS 4
+//#define BTREE_MAX_KEYS 255
+#define BTREE_MAX_KEYS 7
+#define BTREE_HASHED_KEY_LEN 16
+
+/* We have free lists for the following sizes:
+ * 16 32 64 128 256 512 1024 2048 4096 8192 16k 32k 64k 128k 256k 512k 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 1G 2G */
+#define BTREE_FREELIST_COUNT 28
+
+/* A free list block is composed of 2 pointers (prev, next), one count
+ * (numitems), and a pointer for every free list item inside. */
+#define BTREE_FREELIST_BLOCK_SIZE ((8*3)+(8*BTREE_FREELIST_BLOCK_ITEMS))
+#define BTREE_FREELIST_SIZE_EXP 11 /* 2^11 = 2048 */
+
+/* A node is composed of:
+ * one count (startmark),
+ * one count (numkeys),
+ * one count (isleaf),
+ * BTREE_MAX_KEYS keys (16 bytes for each key, as our keys are fixed size),
+ * BTREE_MAX_KEYS pointers to values,
+ * BTREE_MAX_KEYS+1 child pointers,
+ * and a final count(endmark) */
+#define BTREE_NODE_SIZE (4*4+BTREE_MAX_KEYS*BTREE_HASHED_KEY_LEN+((BTREE_MAX_KEYS*2)+1)*8+4)
+
+/* Offsets inside the file of the 'free' and 'freeoff' fields */
+#define BTREE_HDR_FREE_POS 16
+#define BTREE_HDR_FREEOFF_POS 24
+#define BTREE_HDR_ROOTPTR_POS (32+(BTREE_FREELIST_BLOCK_SIZE*BTREE_FREELIST_COUNT))
+
+/* ------------------------------ VFS Layer --------------------------------- */
+
+struct btree_vfs {
+ void *(*open) (char *path, int flags);
+ void (*close) (void *vfs_handle);
+ ssize_t (*pread) (void *vfs_handle, void *buf, uint32_t nbytes,
+ uint64_t offset);
+ ssize_t (*pwrite) (void *vfs_handle, const void *buf, uint32_t nbytes,
+ uint64_t offset);
+ int (*resize) (void *vfs_handle, uint64_t length);
+ int (*getsize) (void *vfs_handle, uint64_t *size);
+ void (*sync) (void *vfs_handle);
+};
+
+extern struct btree_vfs bvfs_unistd;
+
+/* ------------------------------ ALLOCATOR --------------------------------- */
+
+struct btree_freelist {
+ uint32_t numblocks; /* number of freelist blocks */
+ uint64_t *blocks; /* blocks offsets. last is block[numblocks-1] */
+ uint32_t last_items; /* number of items in the last block */
+ uint64_t last_block[BTREE_FREELIST_BLOCK_ITEMS]; /* last block cached */
+};
+
+/* -------------------------------- BTREE ----------------------------------- */
+
+#define BTREE_FLAG_NOFLAG 0
+#define BTREE_FLAG_USE_WRITE_BARRIER 1
+
+/* This is our btree object, returned to the client when the btree is
+ * opened, and used as first argument for all the btree API. */
+struct btree {
+ struct btree_vfs *vfs; /* Our VFS API */
+ void *vfs_handle; /* The open VFS resource */
+ /* Our free lists, from 4 bytes to 4 gigabytes, so freelist[0] is for
+ * size 4, and freelist[BTREE_FREELIST_COUNT-1] is for 4GB. */
+ struct btree_freelist freelist[BTREE_FREELIST_COUNT];
+ /* We pre-allocate free space at the end of the file, as a room for
+ * the allocator. Amount and location of free space is handled
+ * by the following fields: */
+ uint64_t free; /* Amount of free space starting at freeoff */
+ uint64_t freeoff; /* Offset where free space starts */
+ uint64_t rootptr; /* Root node pointer */
+ uint32_t mark; /* This incremental number is used for
+ nodes start/end mark to detect corruptions. */
+ int flags; /* BTREE_FLAG_* */
+};
+
+/* In memory representation of a btree node. We manipulate this in memory
+ * representation in order to avoid to deal with too much disk operations
+ * and complexities. Once a node was modified it can be written back to disk
+ * using btree_write_node. */
+struct btree_node {
+ uint32_t numkeys;
+ uint32_t isleaf;
+ char keys[BTREE_HASHED_KEY_LEN*BTREE_MAX_KEYS];
+ uint64_t values[BTREE_MAX_KEYS];
+ uint64_t children[BTREE_MAX_KEYS+1];
+};
+
+/* ---------------------------- EXPORTED API ------------------------------- */
+
+/* Btree */
+struct btree *btree_open(struct btree_vfs *vfs, char *path, int flags);
+void btree_close(struct btree *bt);
+void btree_set_flags(struct btree *bt, int flags);
+void btree_clear_flags(struct btree *bt, int flags);
+int btree_add(struct btree *bt, unsigned char *key, unsigned char *val, size_t vlen);
+void btree_walk(struct btree *bt, uint64_t nodeptr);
+
+/* On disk allocator */
+uint64_t btree_alloc(struct btree *bt, uint32_t size);
+int btree_free(struct btree *bt, uint64_t ptr);
+int btree_alloc_size(struct btree *bt, uint32_t *size, uint64_t ptr);
73 btree_example.c
@@ -0,0 +1,73 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "btree.h"
+
+#define OP_ALLOC 0
+#define OP_FREE 1
+#define OP_ALLOCFREE 2
+#define OP_ADD 3
+#define OP_WALK 4
+
+int main(int argc, char **argv) {
+ struct btree *bt;
+ uint64_t ptr;
+ int j, count, op, arg;
+
+ if (argc != 4) {
+ fprintf(stderr,"Usage: btree_example <op> <size/ptr> <count>\n");
+ exit(1);
+ }
+ count = atoi(argv[3]);
+ arg = atoi(argv[2]);
+ if (!strcasecmp(argv[1],"alloc")) {
+ op = OP_ALLOC;
+ } else if (!strcasecmp(argv[1],"free")) {
+ op = OP_FREE;
+ } else if (!strcasecmp(argv[1],"allocfree")) {
+ op = OP_ALLOCFREE;
+ } else if (!strcasecmp(argv[1],"add")) {
+ op = OP_ADD;
+ } else if (!strcasecmp(argv[1],"walk")) {
+ op = OP_WALK;
+ } else {
+ printf("not supported op %s\n", argv[1]);
+ exit(1);
+ }
+
+ bt = btree_open(NULL, "./btree.db", BTREE_CREAT);
+ btree_clear_flags(bt,BTREE_FLAG_USE_WRITE_BARRIER);
+ if (bt == NULL) {
+ perror("btree_open");
+ exit(1);
+ }
+
+ for (j = 0; j < count; j++) {
+ if (op == OP_ALLOC) {
+ ptr = btree_alloc(bt,arg);
+ printf("PTR: %llu\n", ptr);
+ } else if (op == OP_FREE) {
+ btree_free(bt,arg);
+ } else if (op == OP_ALLOCFREE) {
+ ptr = btree_alloc(bt,arg);
+ printf("PTR: %llu\n", ptr);
+ btree_free(bt,ptr);
+ }
+ }
+ if (op == OP_ADD) {
+ int retval;
+ char key[16];
+ memset(key,0,16);
+ strcpy(key,argv[2]);
+
+ retval = btree_add(bt,(unsigned char*)key,
+ (unsigned char*)argv[3],strlen(argv[3]));
+ printf("retval %d\n", retval);
+ } else if (op == OP_WALK) {
+ btree_walk(bt,bt->rootptr);
+ }
+
+ btree_close(bt);
+ return 0;
+}
Please sign in to comment.
Something went wrong with that request. Please try again.