Skip to content
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.

Configurable page size by db or bucket #114

Closed
pkieltyka opened this issue Apr 2, 2014 · 10 comments
Closed

Configurable page size by db or bucket #114

pkieltyka opened this issue Apr 2, 2014 · 10 comments

Comments

@pkieltyka
Copy link

I'm not sure if this is a good idea, it's more of a question then an issue. I'm planning to use boltdb as a storage engine for a local database of images that vary in size from 30kb to 500kb (mostly). The database will be between 5 to 30 GB.

I spent some time reading the boltdb code today, which is super clean, but I'm still learning it. The general data structure in memory is:

bucket
  * node (key []byte, value which is a bunch of inodes)
      * inodes (key []byte, value []byte where value is based on page size)

Is that right? so a page represents a chunk of a value. The page size of each chunk is 64kb (based on os.Getpagesize() on my system in the DB init() func). Is that the same structure as on disk (when commited)?

I need to do some testing by just playing with the code and seeing if it would work, but could the pageSize be configurable when opening the db? or when creating a bucket? (so each bucket could have its own page size).

The reason is, hoping there would be performance gains if you know the average size of the data that is to be stored, so there would be few computations to chunk / rebuild the data? ... you forego more diskspace (for unused segments) for performance?

.. btw, if the data structure above is correct, why does each inode need to repeat the "key" from the node? feels like for longer keys that would be extra unnecessary overhead.

@benbjohnson
Copy link
Member

@pkieltyka I need to write out an in-depth "Internals" section on the README. It's a little confusing how the types work but it's mostly because there are types that represent on disk data and there's types that represent in-memory (not yet written to disk) data. The tricky part is that sometimes they overlap.

So a bucket represents a pointer to the root of a bucket. That root is either a leaf or a branch page. If you're working with read-only transactions, everything is a page and the cursor traverses page elements. Everything is in the mmap. However, if you put/delete data then those pages that change get materialized into nodes. A node has inodes which are analogous to page elements but they're the uncommitted, in-memory version. When a node is initially read and materialized from a page, every inode is simply a pointer to the mmap. However, when key/values are inserted, those inode.key and inode.value values point to in-memory byte arrays. On commit, all of it gets serialized back into pages and written through the file descriptor.

I added an issue for the "Internals" docs (#115).

As for the page size, it should be an easy to open up the page size. It has to be at the DB-level though. However, since you're already at 64kb pages, I'm not sure if it'll help you. Values always get written contiguously and use overflow pages so your 500kb values aren't getting split up.

I have to run right now but I can get into the details more later.

@benbjohnson
Copy link
Member

@pkieltyka By the way, what kind of machine are you running os.Getpagesize() on? I get 4096 on my local Mac and my DigitalOcean box but I get 64kb on play.golang.org.

As I was saying yesterday about page size, Bolt always puts at least 2 keys per page -- regardless of the size of their data. If you have two keys on a page that have data totaling 1MB then Bolt will find 16 contiguous pages (1MB / 64KB = 16) and write the keys and data to those pages. On the first of those pages it'll set page.overflow to 15 to let Bolt know to reclaim that page plus the next fifteen pages when it gets reclaimed in the future.

So since Bolt will automatically use extra pages when necessary, I don't think increasing the page size will necessarily help you. The only benefit you really get from larger page size is less overhead for tracking pages and a shallower b+tree. Although, at 64KB pages, the overhead is already trivial.

@pkieltyka
Copy link
Author

hey @benbjohnson actually I just tested it on play.golang.org and assumed it was just a Linux 64bit machine. My Mac actually returns a pagesize of 4096 and so does the c3.large EC2 instance (64bit). I wonder what play.golang is running too.

Thanks for the break down. Just to get a higher level picture.. is a page the smallest representation that makes up the btree? it can take the form of a combination of branch, leaf, meta, bucket, freelist. Its the unit that maps the structure of the entire bolt storage format? I guess I see why the pageSize then isn't 64kb then, as there would be a lot of wasted space to represent a basic bucketPage root, metaPage or freelistPage. But the data values would be split to less chunks. What about putting the data values (objects) in a Superpage or something then..? (if I'm following correctly..).

Also, does the entire btree structure (not the data values) need to fit into memory? or does mmap relax that?

Btw, are the 2 keys per page you mentioned the same key that is used to identify the unique value in a bucket when doing a bucket.Put()? if so, then why are 2 keys necessary per page? wouldn't there be a more efficient way to link the pages together then using the long / random sized key byte array? .. so for a 1MB data value, it would be 1MB / 4096 = 256 pages at least.. and would representation require just 2 keys with the overflow set to 256+ (actual num of pages) or would it be a key per page?

.. and as an aside, the MaxKeySize is 32768? .. I'm not sure how that would fit anymore.. I figured the MaxKeySize then would be < pageSize.

Btw, I'll try hardcoding db.pageSize to another value in the bolt code and run some benchmarks. I'll be traveling this afternoon to visit a buddy in Barcelona but I'll spend more time reading the bolt code on the plane. Thanks again for being so available to answer my newb questions.

@benbjohnson
Copy link
Member

@pkieltyka Sure, I'm happy to answer any questions. Part of the goal of Bolt is to be cleanly written and educational. Too many low-level databases are cryptic and difficult to understand.

So, to answer your questions:

Is a page the smallest representation that makes up the btree?

A "page" can really mean two things. First, it's a basic unit of storage -- the whole database is divided into these fixed sized blocks. Second, the term also represents logical pages which can span over multiple of these fixed-sized blocks using the page's "overflow" property. In hindsight, I should have used separate terms for physical "blocks" and logical "pages". I added an issue to rename the physical pages to "blocks" (#116).

It can take the form of a combination of branch, leaf, meta, bucket, freelist.

Yes, logical pages can take those forms. Physical pages (e.g. blocks) can also have overflow from any one of those types so they don't necessarily have a type themselves (i.e. their page header would just include meaningless data).

I guess I see why the pageSize then isn't 64kb then.

Your smallest database for 64KB pages would be 256KB (2 meta pages, a buckets page, and a freelist page). There would be a lot of empty space in there but once you start adding data to your leafs and branches then it would fill up pretty well.

Every commit has to write a meta page, a buckets page, and a freelist page so you incur an extra 180KB of write IO for every commit by having those large meta/buckets/freelist pages (192KB for 64KB blocks vs 12KB for 4KB blocks).

But the data values would be split to less chunks. What about putting the data values (objects) in a Superpage or something then..?

The individual values are not split since they will span across overflow pages. Moving them to a separate large block section doesn't make a lot of sense since you'll have to rewrite that large section when the values change. The large overflow leafs should actually work pretty well.

Also, does the entire btree structure (not the data values) need to fit into memory? or does mmap relax that?

There should be no memory restrictions. The OS will page in-and-out of memory automatically since it uses the mmap. The only things that need to be in-memory are the DB, Tx, Bucket and Cursor (and the cursor's internal stack) but those shouldn't be larger than a couple hundred bytes.

Btw, are the 2 keys per page you mentioned the same key that is used to identify the unique value in a bucket when doing a bucket.Put()? if so, then why are 2 keys necessary per page?

The two-keys-per-page is a carryover from LMDB. It's required on the branch pages since the purpose of the branch page is to split the tree to 2 separate subpages. I can't remember off the top of my head if leaf pages also have the same restriction.

The key in the page is the same as the key in bucket.Put(). The pages have 64-bit integer identifiers but the key lets the cursor traverse the b+tree to find the data. The branch pages store the lowest key for each subpage and the page id. The leaf pages store the key and value pairs. That lets you do a binary search at each level to eventually find the leaf element.

I'd definitely advise against large key sizes. You can do it if you have to but some of those keys get duplicated in the branch pages. If you use 64-bit integers mapped onto a []byte then you can fit ~170 elements in a single branch page (after you account for branchPageElement overhead). That means that with a single branch level you can store 170 leaf pages but at two branch levels you can store 28,900 leafs:

branch -> 170 leafs                        170 leafs
branch -> 170 branches -> 170 leafs      28900 leafs (170 * 170)

As an aside, the MaxKeySize is 32768? .. I'm not sure how that would fit anymore.. I figured the MaxKeySize then would be < pageSize.

The MaxKeySize was actually just a guess when I first started the project. Since branches and leafs can both have overflow pages there's really not much of a limit to key size. The pageElement types use uint32 for the position, key, and value sizes so theoretically there's probably a key size limit of around 2GB once you account for maxKeysPerPage. I need to actually figure out the real max sizes. I added an issue for this (#117).

I'll try hardcoding db.pageSize to another value in the bolt code and run some benchmarks.

That'd be awesome. There's an issue with the current benchmarks because of a bulk load bug (#94) so it can be really slow. It may make sense to benchmark after that's fixed.

I'm going to be working on performance and optimization a lot in the next two weeks as we're looking to do some production-level testing against Bolt at Shopify soon.

@pkieltyka
Copy link
Author

Thanks! I will read this some more along-side the code. I'm planning to hack on a LRU mechanism for Bolt so it can keep the db at a capped size.

And, sweet I didn't know you were at Shopify, cool! In the Ottawa office?

@benbjohnson
Copy link
Member

@pkieltyka An LRU is a good idea in front of Bolt. I've used a version of @bradfitz' LRU from groupcache before and it worked well. It was a good starting point for me.

Shopify sponsors my work on SkyDB which they use for all their customer-facing analytics. SkyDB currently uses LMDB but some of the limitations are becoming a real problem. I work remotely out of Denver but I fly out to Ottawa occasionally. I'll be out there the week of April 14th next if you're around. Only a mere 4 hour drive from Toronto. :)

@pkieltyka
Copy link
Author

Yea, I was planning to build the LRU in front as well, thx for the tip on groupcache/lru.

Sky looks cool, I will definitely take a closer look too! hrmm.. what kinds of limitations? I can imagine hardware fault tolerance could be an issue since its based on mmapped file and if something isn't committed / written, then it's lost? .. as well, I don't know enough about mmapped files, but it feels like an untamable beast which can just chop away at memory .. I could be wrong here, but in my playing around with mongodb which relies on mmapped files, it gets exponentially slower when its working set + indexes can't fit in memory.

April 14th is the first day I make it back from my trip which I'm leaving on in a few hours lol... but that'd be cool!

@pkieltyka
Copy link
Author

btw.. no reason you can't fork off with boltdb and change the design from LMDB as you see fit. Perhaps LMDB made design decisions for its kind of data, and its not like you need bolt to read the contents of another lmdb-made file? ... could be lmdb-inspired :)

@benbjohnson
Copy link
Member

@pkieltyka There are a handful of limitations that range from annoying to downright problematic:

  1. Max key size - This is our biggest issue. We use the DUPSORT feature in LMDB so that we can store an event series for a given object. The problem with DUPSORT is that your values are actually stored as keys so your individual values are restricted to 511 bytes. Most events are small but sometimes we go over that limit so we've had to convert fields into R-style factors so they fit into this 511 byte space. However, that causes a ton of extra look ups when we need to defactorize values. 60% of our insert time is spent managing these factors.
  2. DB/Reader/mmap limits - This is mostly annoying. You have to specify the upper bound on how many DBs and readers you can have as well as the upper bound on the mmap size. You can make your mmap size large but each one of Sky's tables is a separate mdb_env so if you have a bunch of tables then you hit a limit on Linux' virtual memory per process. A lot of this stems from the fact that multiple processes can share a database so they have to synchronize their reader tables through a separate file. Bolt only works in a single process so it can control a lot of this better. For example, when the mmap needs to grow it simply waits for all read transactions to finish and it does a munmap()/mmap() transparently behind the scenes.
  3. Readers not closing appropriately - We hit an issue every great once in a while where a reader transaction will not be closed properly and it causes the database to grow wildly until it runs out of space. We've audited the Sky code several times and can't find anywhere that transactions are not closed. This brings up a bigger issue, though, that we have Sky, gomdb, and LMDB to search through to find this one rare bug. LMDB is 8KLOC of C so it's not exactly easy to track down issues. Bolt is up to 2KLOC now and there's no intermediate layer (gomdb) to interface through.
  4. The intermediate gomdb uses a memcopy of every key and value going between Go and LMDB. This could probably be fixed but it'd make the interface really ugly. There's no translation layer with Bolt so all the reads are zero copy.

As far as mmaps go, I'm not sure what the issue is with MongoDB. We've had good luck with the performance so far in LMDB. It's been faster than LevelDB.

Bolt started off as a port of LMDB simply because Howard Chu is a smart guy and knows his low-level stuff so I wanted to learn from that. Bolt has definitely forked off onto its own path though. It doesn't aim for compatibility with LMDB and it takes a different approach on several different areas. There's still some pieces left behind from LMDB that need to be pulled out but they just come out as I come across them.

Have fun on your trip!

@benbjohnson
Copy link
Member

I haven't seen a strong need for this so I'm going to close the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants