use segregated hmap to boost the freelist allocate and release performace #138

WIZARD-CXY · 2019-01-15T07:13:38Z

In this pr, I use segregated bitmap to replace the original freeids allocating and releasing approach.
It is much faster than the original version, especially when the db size is large or the fragmentation in the db is large, we can gain 1000x faster performance.

WIZARD-CXY · 2019-01-15T07:55:36Z

@xiang90 ptal

freelist.go

WIZARD-CXY · 2019-01-15T13:14:48Z

@benbjohnson

mitake · 2019-01-15T14:47:24Z

@WIZARD-CXY Interesting! Could you share how you measure the performance improvement?

freelist.go

xiang90 · 2019-01-16T03:22:24Z

freelist.go

 }

 // allocate returns the starting page id of a contiguous list of pages of a given size.
 // If a contiguous block cannot be found then 0 is returned.
 func (f *freelist) allocate(txid txid, n int) pgid {
-	if len(f.ids) == 0 {
+	if n == 0 {


do we really need to add this special case handling?

yeah, common use case we don't request a 0-size page, this is only wanted by unit test in allocating

f.allocate(1, 0) if x := f.free_count(); x != 2 { t.Fatalf("exp=2; got=%v", x) }

freelist.go

WIZARD-CXY · 2019-01-16T04:35:03Z

@mitake we use bboltdb in etcd. In the original approach, when we have a very large db(~50GB), and do the put op(~5000op/s), we find the db spill time is very large(~8s), the put latency is incredibly large too. We look into the code, add some debugging info and find that freelist allocation is the bottleneck.It is because when the freelist size is large and its inner fragmentation is very common. The original algorithm in allocate tries so hard to get a start page id while in my approach we use a hash map to hold the same size of starting pgids. So the allocation is super fast(nearly O(1)) and the release performance is also boosted too(O(1), removed the original page id sort operation)

WIZARD-CXY · 2019-01-16T04:38:01Z

@mitake boltdb/bolt#640 this is the same case

xiang90 · 2019-01-16T07:05:46Z

@WIZARD-CXY Let us make this an option instead of entirely replacing the array based freelist implementation.

Probably add an option called FreelistType: (array/seglist)

freelist.go

xiang90 · 2019-01-16T07:28:51Z

/cc @jpbetz

The original freelist is managed by an array. Both allocate/release operation can be O(N). After a bulk deletion (in etcd case, a compaction), the key put can take up to 8 seconds when the freelist size reaches O(1,000,000).

This PR uses a seglist approach to solve the problem like traditional memory allocator does.

WIZARD-CXY · 2019-01-16T07:30:34Z

@WIZARD-CXY Let us make this an option instead of entirely replacing the array based freelist implementation.Probably add an option called FreelistType: (array/seglist)

working on it

WIZARD-CXY · 2019-01-17T03:32:05Z

@mitake @xiang90 @hormes petal add a commit to add an option of using the new approach.

db.go

freelist.go

allocate_test.go

.gitignore

…formance

xiang90 · 2019-01-22T06:36:37Z

@WIZARD-CXY Can you create a new PR? It can be cleaner.

mitake · 2019-01-22T19:01:48Z

@WIZARD-CXY thanks for sharing the detail and sorry for my delayed reply. It's really interesting. Do you mean you already have etcd cluster with 50GB of data? Then are you seeing availability problem comes from the large snapshot?
I'm asking for understanding your use case and for curious: I heard alibaba has very large clusters. Is this related that?

WIZARD-CXY · 2019-01-24T07:33:46Z

@mitake yeah, we stored many data in etcd since we have large clusters

WIZARD-CXY force-pushed the new branch from 801720e to aa2e6fe Compare January 15, 2019 07:35

hormes reviewed Jan 15, 2019

View reviewed changes