Compaction #89

pkieltyka · 2014-03-24T21:04:49Z

Hello,

I was just wondering what Bolt is planning for database compaction? I'm not sure if this is something designed into Lmdb, but I was just running some tests, putting a lot of data into a boltdb, then removing the keys, then adding some more (ones with the same name and others), and the database just kept growing.

Btw, great work on Bolt! It's very clean and happy to see it so active.

As a comparison, leveldb does compaction in a background thread that gets triggered at some event (not sure at which point).

benbjohnson · 2014-03-24T21:16:17Z

@pkieltyka Bolt uses the same model as LMDB which is a copy-on-write B+Tree. Older versions of the tree are maintained as long as old transactions need them but will be reclaimed once those transactions close. There's no write-ahead log (WAL) or SSTables like what LevelDB does. Everything is consistent and up to date when it's written. You can add a WAL in front of Bolt to improve write performance or coalesce transactions but Bolt is mainly geared toward fast read performance.

There's an issue with the buckets page (aka the list of bucket root nodes) where they are not getting reclaimed (#82). I suspect that's the issue you're having. I'm going to try to fix it tomorrow.

pkieltyka · 2014-03-24T21:17:33Z

Very cool! Thanks

benbjohnson · 2014-03-24T21:17:35Z

@pkieltyka If you install the bolt CLI then you can run it against your db to check the page information:

$ go get github.com/boltdb/bolt/...
$ bolt pages /path/to/my.db

If you see a bunch of buckets lines then that's issue #82.

pkieltyka · 2014-03-24T21:18:30Z

https://gist.github.com/pkieltyka/1234f06cf63d03e3a769

benbjohnson · 2014-03-24T21:19:14Z

@pkieltyka Yep. That's the same issue. :)

I'll close this issue since it's the same as #82.

pkieltyka · 2014-03-24T21:21:09Z

Sounds good. Thanks and keep up the awesome work. Btw, have you considered added a distributed component to bolt to turn it into a distirbuted store? Some form of consistent hashing algorithm + goraft?

benbjohnson · 2014-03-24T21:21:43Z

@pkieltyka Actually, can you post the program used to generate the database? Page 181 is concerning. That's a lot of items and a lot of overflow pages.

benbjohnson · 2014-03-24T21:24:13Z

@pkieltyka Thanks! I'm working on a second version of goraft right now so maybe after that I'll try making a distributed k/v store. I've been looking at porting Lucene on top of Bolt to provide full text search and indexing. Not sure what to tackle first. :)

pkieltyka · 2014-03-24T21:36:06Z

yea for sure I will, I just have to head out now but I will post it in a few hours

pkieltyka · 2014-03-24T21:36:27Z

and +1 for distributed store :) .. and also, why rewrite goraft tho?

benbjohnson · 2014-03-24T21:42:05Z

go-raft is good for small datasets but there's performance issues around allocation and serialization. I've learned a lot about Go perf tuning since I wrote goraft a year ago and I need something higher performance for SkyDB.

I'm looking to do a simpler design and use read-only mmaps for the log (similar to how Bolt uses a read-only mmap). It's wicked fast. The snapshotting in go-raft is very limited too so that needs to be fixed as well. There's already a lot of people using go-raft so trying to upgrade it and break the API is not something I want to do.

I still need to think of a name for the next goraft. raft2?

benbjohnson · 2014-03-25T13:41:58Z

@pkieltyka #90 should fix your growing db issue. Let me know if you still have any problems.

pkieltyka · 2014-03-25T13:48:52Z

@benbjohnson hey, sorry about the delay. I'll give it a test it now and paste the page info. Thx.

pkieltyka · 2014-03-25T13:53:22Z

btw, re: raft I still have to read the paper to fully grasp all the details. It uses an http transport? or is that just an option. Btw, are you on irc?

benbjohnson · 2014-03-25T14:19:54Z

Raft doesn't specify a transport. go-raft uses a pluggable transport layer but most people use HTTP as the transport. I'll probably do something similar with raft2.

Here's a (relatively) short visualization of Raft that's helpful if you don't have time for the full paper:

http://thesecretlivesofdata.com/raft/

pkieltyka · 2014-03-25T14:57:02Z

Nice, that is a beautiful visualization.. I'm gonna go through that now.

So, I'm still not seeing the space free up. The pages: https://gist.github.com/pkieltyka/9763425

I've pushed the dummy project here: https://github.com/pkieltyka/seamdb. The project uses a sample of ~200 small images from 10KB to 150KB. When the test suite boots, it will repeat the set with the same key and append a number, just to increase the size of the db (seamdb_test.go has the testData build code with a 100x repeat).

To run the tests and reproduce the issue:

Change the path to the db in bolt.go Open function
go test -run=TestBoltSetInit
go test -run=TestBoltSetGet .. just to validate the data is there..
go test -run=TestBoltSetDel .. I also noticed this is a pretty slow operation

at each point, I was just checking the size of the db file, and the finally checking the pages.

pkieltyka · 2014-03-25T15:17:25Z

Btw, I never finally delete the now empty bucket.. I keep it around as I would intend to put more data into it. Treating it as a namespace.

pkieltyka · 2014-03-25T15:50:00Z

re: raft -- I looked at the visualization, it's quite logical and very cool. I'm still new to it, but wouldn't it make writes to a cluster extremely slow as it's strongly consistent between the cluster? .. that could be fine depending on the application.

For something like etcd, yea that is fast enough. But, I don't know how it could be fast enough for influxdb if its an analytics events db (unless its far from real-time). Hrmm.. I guess depends on interpretation too.. the data wouldn't be committed to the cluster until a majority accept it, but the uncommitted data is still accessible from the leader. But then that sucks too cuz your followers are just more like backups now.

And yea, it makes sense the spec doesn't prescribe a transport. But, I'm still not sure why http is a good idea at the low-level to communicate data between nodes when you can easily use tcp sockets. An app developer could just use the raft client, open a connection, and have a few functions to check the node state, etc... it would all be abstracted away with the goal of max efficiency. Just some thoughts..

Maybe raft can do it.. but I'm thinking of a completely different distributed data design for an embedded db.. with raft, you'd always have to read/write from the leader it seems? I suppose you could easily let it read from a follower and maybe proxy writes to the leader.. but still, I'd rather always read/write to the local embedded db, and have the set become eventually consistent. Hrmm.. I wonder if this is how cassandra is designed.

benbjohnson · 2014-03-25T15:55:38Z

Thanks for the repo link. I'll probably be tomorrow before I can look at it.

You're right that the delete operation is really slow. I've done very limited optimization on Bolt so far. It's mostly been ensuring that everything is consistent and correct and the API works well. I'll get that fixed up though.

Another performance issue is that bulk loading more than 1000 items at a time is also really slow. The nodes aren't splitting before commit so there's a lot of large memmove() calls that's causing slowness.

I added two issues to track those performance problems: #93 & #94

benbjohnson · 2014-03-25T16:12:40Z

Wouldn't it make writes to a cluster extremely slow as it's strongly consistent between the cluster?

The nice thing about Raft is that it batches up log entries so it can still theoretically achieve high throughput but there's additional latency. Latencies are in the 10s to 100s of milliseconds range.

But then that sucks too cuz your followers are just more like backups now.

It mostly depends on your data. If you can accept stale reads then you can read from the followers. If you need consistent reads then you have to read from the leader. It's definitely a high cost but sometimes you really need it.

But, I'm still not sure why http is a good idea at the low-level to communicate data between nodes when you can easily use tcp sockets.

If an application already has an HTTP server embedded in it then go-raft can overlay its routes on there and then you only need to open up that single HTTP port. If you're running over TCP then you'd have to open a separate port in your application (and your firewall) and it's just more configuration overhead.

Another benefit to HTTP is that you can implement Raft over TLS. Raft doesn't provide any safety for byzantine errors (e.g. malformed messages, malicious messages) so TLS can provide encryption and authentication.

I'd rather always read/write to the local embedded db, and have the set become eventually consistent.

Eventual consistency is a whole different approach you can go with but it has its own drawbacks and challenges. Building against a strongly consistent data store is a lot easier than understanding things like vector clocks, last-write-wins, and other eventually consistent concepts. However, eventually consistent systems can scale larger and have higher availability since they don't require a quorum. It just depends on your use case.

pkieltyka · 2014-03-25T16:18:54Z

Cool! this is all so interesting. Btw, have a look at https://github.com/inconshreveable/muxado .. it's written by the same guy who made ngrok. It's a tcp multiplexing transport that is designed similar to http2.... so there's your single port :)

In that case, for boltdb.. I like that it's just an embedded db, and implementing raft2 or other distribution schemes could be completely separated out anyways but easily overlaid.. (perhaps you were thinking like this anyways, but that just occurred to me).

kardianos · 2014-03-25T16:23:01Z

The distributed system I hope to build soon-ish is a distributed data store
with explicit versions. When a client interacts with the data store, it can
require a minimum version. Such explicit version allow clients to know when
data has changed before it updates it, it can also provide a cross between
the scalability of an eventually consistent store and the ease of use of a
strongly consistent data store.
You would use it as follows:

For consistent business data: Client reads data with version "A" from
data store, client modifies data, client writes back to data source
requiring that at time of write data store should still be version "A".

For consistent data reads: Client writes data and returns version "32".
Client reads data and requires the version be 32 or newer.

Has this been done before?
-Daniel

On Tue, Mar 25, 2014 at 9:12 AM, Ben Johnson notifications@github.comwrote:

Wouldn't it make writes to a cluster extremely slow as it's strongly
consistent between the cluster?

The nice thing about Raft is that it batches up log entries so it can
still theoretically achieve high throughput but there's additional latency.
Latencies are in the 10s to 100s of milliseconds range.

But then that sucks too cuz your followers are just more like backups now.

It mostly depends on your data. If you can accept stale reads then you can
read from the followers. If you need consistent reads then you have to read
from the leader. It's definitely a high cost but sometimes you really need
it.

But, I'm still not sure why http is a good idea at the low-level to
communicate data between nodes when you can easily use tcp sockets.

If an application already has an HTTP server embedded in it then go-raft
can overlay its routes on there and then you only need to open up that
single HTTP port. If you're running over TCP then you'd have to open a
separate port in your application (and your firewall) and it's just more
configuration overhead.

Another benefit to HTTP is that you can implement Raft over TLS. Raft
doesn't provide any safety for byzantine errors (e.g. malformed messages,
malicious messages) so TLS can provide encryption and authentication.

I'd rather always read/write to the local embedded db, and have the set
become eventually consistent.

Eventual consistency is a whole different approach you can go with but it
has its own drawbacks and challenges. Building against a strongly
consistent data store is a lot easier than understanding things like vector
clocks, last-write-wins, and other eventually consistent concepts. However,
eventually consistent systems can scale larger and have higher availability
since they don't require a quorum. It just depends on your use case.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/89#issuecomment-38584920
.

benbjohnson · 2014-03-25T17:23:43Z

@pkieltyka Yeah, distributed systems is a whole different beast. So many little nuances. I haven't seen muxado before. That looks cool. They mention something about performance so I'd be interested to try out some benchmarks at some point.

Layerable systems (e.g. bolt + raft + ...) is what I'm going for. Most systems tend to be broken up as individual servers such as MySQL + elasticsearch + redis but I've always been disappointed that these servers aren't also available as libraries. There's a lot of use cases where you can get significantly improved performance by combining some of these things into a single process. It can also be a lot easier to reason about and deploy when it's in a single process.

@kardianos The versioning you're doing is basically CAS (compare-and-swap). Zookeeper and etcd both have distributed CAS in this same way but they're backed by a consensus protocol (Raft and ZAB, respectively). Versioning alone doesn't help you distribute the data across multiple nodes. If you allow writes to multiple nodes then they can check their local version before committing but still have to resolve merges between nodes. If you're writing to a single node then you still have to handle failover and determine when something is safely committed in case of a node failure.

If you distribute the version across multiple nodes then you have Lamport Timestamps which can help determine causality but you still end up in situations where you have to resolve conflicts. There's also Vector Clocks which track Lamport timestamps per node and improve the granularity of the causality but you still have conflicts. Riak uses Vector Clocks.

There's also something new called Commutative Replicated Data Types (CRDTs) which allow you to merge conflicts better than last-write-wins (LWW), however, they're still early stage and not all data can use CRDTs.

pkieltyka · 2014-03-25T17:53:06Z

@benbjohnson yea I definitely like the idea of a layered infrastructure design. I feel go's unconventional package model helps with this too.

Btw, for bolt, it would be nice to have boltdb/harness or something that tests various data sets and really pounds it through some regressions. Once you deem it functionally complete, then to validate its integrity of all features. The repo I created above could be a start.

benbjohnson · 2014-03-25T18:26:01Z

@pkieltyka I think an external harness is a good idea. I added two issues:

bolt-bench - Database benchmarker
bolt-fsck - Database consistency checker

Those could probably be bolt bench and bolt fsck in the main CLI too. The harness could probably just be a shell script to combine a series of bolt commands. Thanks for the starting test case. :)

pkieltyka · 2014-03-25T18:27:20Z

Nice!

On Mar 25, 2014, at 2:26 PM, Ben Johnson notifications@github.com wrote:

@pkieltyka I think an external harness is a good idea. I added two issues:

bolt-bench - Database benchmarker
bolt-fsck - Database consistency checker
Those could probably be bolt bench and bolt fsck in the main CLI too. The harness could probably just be a shell script to combine a series of bolt commands. Thanks for the starting test case. :)

—
Reply to this email directly or view it on GitHub.

benbjohnson mentioned this issue Mar 25, 2014

Fix bucket reclamation #90

Merged

benbjohnson closed this as completed in #90 Mar 25, 2014

benbjohnson reopened this Mar 25, 2014

benbjohnson closed this as completed Mar 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction #89

Compaction #89

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

kardianos commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

Compaction #89

Compaction #89

Comments

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

pkieltyka commented Mar 24, 2014

benbjohnson commented Mar 24, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

kardianos commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014

benbjohnson commented Mar 25, 2014

pkieltyka commented Mar 25, 2014