-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Configurable page size by db or bucket #114
Comments
@pkieltyka I need to write out an in-depth "Internals" section on the README. It's a little confusing how the types work but it's mostly because there are types that represent on disk data and there's types that represent in-memory (not yet written to disk) data. The tricky part is that sometimes they overlap. So a I added an issue for the "Internals" docs (#115). As for the page size, it should be an easy to open up the page size. It has to be at the DB-level though. However, since you're already at 64kb pages, I'm not sure if it'll help you. Values always get written contiguously and use overflow pages so your 500kb values aren't getting split up. I have to run right now but I can get into the details more later. |
@pkieltyka By the way, what kind of machine are you running As I was saying yesterday about page size, Bolt always puts at least 2 keys per page -- regardless of the size of their data. If you have two keys on a page that have data totaling 1MB then Bolt will find 16 contiguous pages ( So since Bolt will automatically use extra pages when necessary, I don't think increasing the page size will necessarily help you. The only benefit you really get from larger page size is less overhead for tracking pages and a shallower b+tree. Although, at 64KB pages, the overhead is already trivial. |
hey @benbjohnson actually I just tested it on play.golang.org and assumed it was just a Linux 64bit machine. My Mac actually returns a pagesize of 4096 and so does the c3.large EC2 instance (64bit). I wonder what play.golang is running too. Thanks for the break down. Just to get a higher level picture.. is a Also, does the entire btree structure (not the data values) need to fit into memory? or does mmap relax that? Btw, are the .. and as an aside, the MaxKeySize is 32768? .. I'm not sure how that would fit anymore.. I figured the MaxKeySize then would be < pageSize. Btw, I'll try hardcoding db.pageSize to another value in the bolt code and run some benchmarks. I'll be traveling this afternoon to visit a buddy in Barcelona but I'll spend more time reading the bolt code on the plane. Thanks again for being so available to answer my newb questions. |
@pkieltyka Sure, I'm happy to answer any questions. Part of the goal of Bolt is to be cleanly written and educational. Too many low-level databases are cryptic and difficult to understand. So, to answer your questions:
A "page" can really mean two things. First, it's a basic unit of storage -- the whole database is divided into these fixed sized blocks. Second, the term also represents logical pages which can span over multiple of these fixed-sized blocks using the page's "overflow" property. In hindsight, I should have used separate terms for physical "blocks" and logical "pages". I added an issue to rename the physical pages to "blocks" (#116).
Yes, logical pages can take those forms. Physical pages (e.g. blocks) can also have overflow from any one of those types so they don't necessarily have a type themselves (i.e. their page header would just include meaningless data).
Your smallest database for 64KB pages would be 256KB (2 meta pages, a buckets page, and a freelist page). There would be a lot of empty space in there but once you start adding data to your leafs and branches then it would fill up pretty well. Every commit has to write a meta page, a buckets page, and a freelist page so you incur an extra 180KB of write IO for every commit by having those large meta/buckets/freelist pages (192KB for 64KB blocks vs 12KB for 4KB blocks).
The individual values are not split since they will span across overflow pages. Moving them to a separate large block section doesn't make a lot of sense since you'll have to rewrite that large section when the values change. The large overflow leafs should actually work pretty well.
There should be no memory restrictions. The OS will page in-and-out of memory automatically since it uses the mmap. The only things that need to be in-memory are the DB, Tx, Bucket and Cursor (and the cursor's internal stack) but those shouldn't be larger than a couple hundred bytes.
The two-keys-per-page is a carryover from LMDB. It's required on the branch pages since the purpose of the branch page is to split the tree to 2 separate subpages. I can't remember off the top of my head if leaf pages also have the same restriction. The key in the page is the same as the key in I'd definitely advise against large key sizes. You can do it if you have to but some of those keys get duplicated in the branch pages. If you use 64-bit integers mapped onto a
The
That'd be awesome. There's an issue with the current benchmarks because of a bulk load bug (#94) so it can be really slow. It may make sense to benchmark after that's fixed. I'm going to be working on performance and optimization a lot in the next two weeks as we're looking to do some production-level testing against Bolt at Shopify soon. |
Thanks! I will read this some more along-side the code. I'm planning to hack on a LRU mechanism for Bolt so it can keep the db at a capped size. And, sweet I didn't know you were at Shopify, cool! In the Ottawa office? |
@pkieltyka An LRU is a good idea in front of Bolt. I've used a version of @bradfitz' LRU from groupcache before and it worked well. It was a good starting point for me. Shopify sponsors my work on SkyDB which they use for all their customer-facing analytics. SkyDB currently uses LMDB but some of the limitations are becoming a real problem. I work remotely out of Denver but I fly out to Ottawa occasionally. I'll be out there the week of April 14th next if you're around. Only a mere 4 hour drive from Toronto. :) |
Yea, I was planning to build the LRU in front as well, thx for the tip on groupcache/lru. Sky looks cool, I will definitely take a closer look too! hrmm.. what kinds of limitations? I can imagine hardware fault tolerance could be an issue since its based on mmapped file and if something isn't committed / written, then it's lost? .. as well, I don't know enough about mmapped files, but it feels like an untamable beast which can just chop away at memory .. I could be wrong here, but in my playing around with mongodb which relies on mmapped files, it gets exponentially slower when its working set + indexes can't fit in memory. April 14th is the first day I make it back from my trip which I'm leaving on in a few hours lol... but that'd be cool! |
btw.. no reason you can't fork off with boltdb and change the design from LMDB as you see fit. Perhaps LMDB made design decisions for its kind of data, and its not like you need bolt to read the contents of another lmdb-made file? ... could be lmdb-inspired :) |
@pkieltyka There are a handful of limitations that range from annoying to downright problematic:
As far as mmaps go, I'm not sure what the issue is with MongoDB. We've had good luck with the performance so far in LMDB. It's been faster than LevelDB. Bolt started off as a port of LMDB simply because Howard Chu is a smart guy and knows his low-level stuff so I wanted to learn from that. Bolt has definitely forked off onto its own path though. It doesn't aim for compatibility with LMDB and it takes a different approach on several different areas. There's still some pieces left behind from LMDB that need to be pulled out but they just come out as I come across them. Have fun on your trip! |
I haven't seen a strong need for this so I'm going to close the issue. |
I'm not sure if this is a good idea, it's more of a question then an issue. I'm planning to use boltdb as a storage engine for a local database of images that vary in size from 30kb to 500kb (mostly). The database will be between 5 to 30 GB.
I spent some time reading the boltdb code today, which is super clean, but I'm still learning it. The general data structure in memory is:
Is that right? so a page represents a chunk of a value. The page size of each chunk is 64kb (based on os.Getpagesize() on my system in the DB init() func). Is that the same structure as on disk (when commited)?
I need to do some testing by just playing with the code and seeing if it would work, but could the pageSize be configurable when opening the db? or when creating a bucket? (so each bucket could have its own page size).
The reason is, hoping there would be performance gains if you know the average size of the data that is to be stored, so there would be few computations to chunk / rebuild the data? ... you forego more diskspace (for unused segments) for performance?
.. btw, if the data structure above is correct, why does each inode need to repeat the "key" from the node? feels like for longer keys that would be extra unnecessary overhead.
The text was updated successfully, but these errors were encountered: