etcdserver: initial read index implementation #6212

xiang90 · 2016-08-18T05:10:29Z

The actual readindex implementation in raft still depends on clock.

But this is pretty much what i would expect in etcdserver.

@gyuho Can you please do a benchmark for this? See if there is any perf improvement?

xiang90 · 2016-08-18T05:51:22Z

on my local machine, there is a 500% improvement.

base s-read:

./benchmark --endpoints=127.0.0.1:2379,127.0.0.1:22379,127.0.0.1:32379 --clients=100 --conns=100 range --total=100000 --consistency=s foo
bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%7s

Summary:
  Total:        7.4628 secs.
  Slowest:      0.1260 secs.
  Fastest:      0.0002 secs.
  Average:      0.0047 secs.
  Stddev:       0.0050 secs.
  Requests/sec: 13399.8265

Response time histogram:
  0.000 [1]     |
  0.013 [95488] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.025 [3483]  |∎
  0.038 [639]   |
  0.050 [247]   |
  0.063 [119]   |
  0.076 [10]    |
  0.088 [3]     |
  0.101 [6]     |
  0.113 [2]     |
  0.126 [2]     |

Latency distribution:
  10% in 0.0009 secs.
  25% in 0.0020 secs.
  50% in 0.0037 secs.
  75% in 0.0057 secs.
  90% in 0.0086 secs.
  95% in 0.0122 secs.
  99% in 0.0258 secs.

before

./benchmark --endpoints=127.0.0.1:2379,127.0.0.1:22379,127.0.0.1:32379 --clients=100 --conns=100 range --total=100000 foo
bench with linearizable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%47s

Summary:
  Total:        47.4401 secs.
  Slowest:      0.1776 secs.
  Fastest:      0.0089 secs.
  Average:      0.0472 secs.
  Stddev:       0.0185 secs.
  Requests/sec: 2107.9199

Response time histogram:
  0.009 [1]     |
  0.026 [3574]  |∎∎∎
  0.043 [40043] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.060 [39865] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.076 [11188] |∎∎∎∎∎∎∎∎∎∎∎
  0.093 [2753]  |∎∎
  0.110 [974]   |
  0.127 [637]   |
  0.144 [740]   |
  0.161 [158]   |
  0.178 [67]    |

Latency distribution:
  10% in 0.0292 secs.
  25% in 0.0323 secs.
  50% in 0.0472 secs.
  75% in 0.0549 secs.
  90% in 0.0656 secs.
  95% in 0.0773 secs.
  99% in 0.1266 secs.

after

./benchmark --endpoints=127.0.0.1:2379,127.0.0.1:22379,127.0.0.1:32379 --clients=100 --conns=100 range --total=100000 foo
bench with linearizable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%8s

Summary:
  Total:        8.8687 secs.
  Slowest:      0.1732 secs.
  Fastest:      0.0002 secs.
  Average:      0.0064 secs.
  Stddev:       0.0061 secs.
  Requests/sec: 11275.5991

Response time histogram:
  0.000 [1]     |
  0.017 [96911] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.035 [2552]  |∎
  0.052 [402]   |
  0.069 [67]    |
  0.087 [14]    |
  0.104 [17]    |
  0.121 [1]     |
  0.139 [1]     |
  0.156 [0]     |
  0.173 [34]    |

Latency distribution:
  10% in 0.0016 secs.
  25% in 0.0030 secs.
  50% in 0.0053 secs.
  75% in 0.0080 secs.
  90% in 0.0114 secs.
  95% in 0.0145 secs.
  99% in 0.0272 secs.

gyuho · 2016-08-18T06:02:23Z

With patch

$ ./benchmark --endpoints=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379 --clients=100 --conns=100 range --total=100000 --consistency=l foo
bench with linearizable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2s

Summary:
  Total:    2.9057 secs.
  Slowest:  0.0223 secs.
  Fastest:  0.0003 secs.
  Average:  0.0024 secs.
  Stddev:   0.0016 secs.
  Requests/sec: 34414.8033

Response time histogram:
  0.000 [1] |
  0.003 [63837] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.005 [30936] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.007 [2276]  |∎
  0.009 [1982]  |∎
  0.011 [661]   |
  0.014 [183]   |
  0.016 [41]    |
  0.018 [50]    |
  0.020 [20]    |
  0.022 [13]    |

Latency distribution:
  10% in 0.0011 secs.
  25% in 0.0015 secs.
  50% in 0.0021 secs.
  75% in 0.0028 secs.
  90% in 0.0037 secs.
  95% in 0.0049 secs.
  99% in 0.0091 secs.

Without patch

$ ./benchmark --endpoints=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379 --clients=100 --conns=100 range --total=100000 --consistency=l foo
bench with linearizable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%6s

Summary:
  Total:    6.6010 secs.
  Slowest:  0.0257 secs.
  Fastest:  0.0018 secs.
  Average:  0.0065 secs.
  Stddev:   0.0024 secs.
  Requests/sec: 15149.2299

Response time histogram:
  0.002 [1] |
  0.004 [16913] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.007 [37914] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.009 [30898] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.011 [11044] |∎∎∎∎∎∎∎∎∎∎∎
  0.014 [2263]  |∎∎
  0.016 [690]   |
  0.019 [189]   |
  0.021 [72]    |
  0.023 [12]    |
  0.026 [4] |

Latency distribution:
  10% in 0.0038 secs.
  25% in 0.0046 secs.
  50% in 0.0062 secs.
  75% in 0.0080 secs.
  90% in 0.0095 secs.
  95% in 0.0106 secs.
  99% in 0.0137 secs.

Great. +2x faster!

xiang90 · 2016-08-18T06:07:15Z

@gyuho Is this the same setup with https://github.com/coreos/etcd/blob/master/Documentation/op-guide/performance.md?

In the perf doc, we use 1000 clients not 100 clients.

gyuho · 2016-08-18T06:09:44Z

@xiang90 It's with slower machines (--custom-cpu=4 --custom-memory=8). I will run more tests with the same environments.

gyuho · 2016-08-18T17:12:10Z

For reference, new test results with 8-CPU, 16GB memory machines

Now linearized reads are almost same as serializable reads

benchmark --endpoints=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379 --conns=100 --clients=1000 range foo --total=100000 --consistency=l

bench with linearizable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%0s

Summary:
  Total:    0.9486 secs.
  Slowest:  0.0673 secs.
  Fastest:  0.0004 secs.
  Average:  0.0074 secs.
  Stddev:   0.0054 secs.
  Requests/sec: 105416.6610

Response time histogram:
  0.000 [1] |
  0.007 [56413] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.014 [33022] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.020 [7868]  |∎∎∎∎∎
  0.027 [1865]  |∎
  0.034 [565]   |
  0.041 [181]   |
  0.047 [52]    |
  0.054 [14]    |
  0.061 [11]    |
  0.067 [8] |

Latency distribution:
  10% in 0.0021 secs.
  25% in 0.0034 secs.
  50% in 0.0062 secs.
  75% in 0.0099 secs.
  90% in 0.0140 secs.
  95% in 0.0172 secs.
  99% in 0.0262 secs.

bench with linearizable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%0s

Summary:
  Total:  0.9435 secs.
  Slowest:  0.0547 secs.
  Fastest:  0.0003 secs.
  Average:  0.0070 secs.
  Stddev: 0.0047 secs.
  Requests/sec: 105991.6876

Response time histogram:
  0.000 [1] |
  0.006 [50145] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.011 [33038] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.017 [12831] |∎∎∎∎∎∎∎∎∎∎
  0.022 [2853]  |∎∎
  0.028 [869] |
  0.033 [181] |
  0.038 [55]  |
  0.044 [21]  |
  0.049 [5] |
  0.055 [1] |

Latency distribution:
  10% in 0.0023 secs.
  25% in 0.0035 secs.
  50% in 0.0058 secs.
  75% in 0.0096 secs.
  90% in 0.0132 secs.
  95% in 0.0157 secs.
  99% in 0.0226 secs.

benchmark --endpoints=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379 --conns=100 --clients=1000 range foo --total=100000 --consistency=s

bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1s

Summary:
  Total:    1.1777 secs.
  Slowest:  0.2462 secs.
  Fastest:  0.0003 secs.
  Average:  0.0071 secs.
  Stddev:   0.0128 secs.
  Requests/sec: 84911.0008

Response time histogram:
  0.000 [1] |
  0.025 [98661] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.049 [792]   |
  0.074 [96]    |
  0.099 [93]    |
  0.123 [58]    |
  0.148 [21]    |
  0.172 [25]    |
  0.197 [14]    |
  0.222 [158]   |
  0.246 [81]    |

Latency distribution:
  10% in 0.0012 secs.
  25% in 0.0025 secs.
  50% in 0.0047 secs.
  75% in 0.0092 secs.
  90% in 0.0133 secs.
  95% in 0.0164 secs.
  99% in 0.0283 secs.

bench with serializable range
 100000 / 100000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%0s

Summary:
  Total:  0.8832 secs.
  Slowest:  0.0445 secs.
  Fastest:  0.0003 secs.
  Average:  0.0059 secs.
  Stddev: 0.0045 secs.
  Requests/sec: 113228.7147

Response time histogram:
  0.000 [1] |
  0.005 [51281] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.009 [28305] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.014 [14520] |∎∎∎∎∎∎∎∎∎∎∎
  0.018 [3956]  |∎∎∎
  0.022 [1092]  |
  0.027 [576] |
  0.031 [215] |
  0.036 [47]  |
  0.040 [5] |
  0.044 [2] |

Latency distribution:
  10% in 0.0014 secs.
  25% in 0.0025 secs.
  50% in 0.0046 secs.
  75% in 0.0083 secs.
  90% in 0.0117 secs.
  95% in 0.0142 secs.
  99% in 0.0215 secs.

benchmark --endpoints=${HOST_1}:2379 --conns=1 --clients=1 range foo --total=100000 --consistency=l

bench with linearizable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%37s

Summary:
  Total:  37.6047 secs.
  Slowest:  0.0065 secs.
  Fastest:  0.0003 secs.
  Average:  0.0004 secs.
  Stddev: 0.0001 secs.
  Requests/sec: 2659.2406

Response time histogram:
  0.000 [1] |
  0.001 [99320] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.002 [234] |
  0.002 [312] |
  0.003 [108] |
  0.003 [10]  |
  0.004 [4] |
  0.005 [6] |
  0.005 [2] |
  0.006 [1] |
  0.007 [2] |

Latency distribution:
  10% in 0.0003 secs.
  25% in 0.0003 secs.
  50% in 0.0003 secs.
  75% in 0.0004 secs.
  90% in 0.0004 secs.
  95% in 0.0005 secs.
  99% in 0.0006 secs.

heyitsanthony · 2016-08-18T17:17:21Z

etcdserver/v3_server.go

+		ok := make(chan struct{})
+
+		select {
+		case s.readwaitc <- ok:


I'm wary of this pattern. Would it be possible to do something like

func (s *EtcdServer) notifyLinearRead() <-chan struct{} { rlock() defer runlock() return s.notifyNextLinearRead }

not really. we need one chan per read.

or you want server side to return the chan instead of creating chan here

why does it need a channel per read? the logic looks a lot like a barrier...

xiang90 · 2016-08-19T02:55:53Z

@heyitsanthony

clock independent version

./benchmark --endpoints=127.0.0.1:2379,127.0.0.1:22379,127.0.0.1:32379 --clients=100 --conns=100 range --total=100000 foo
bench with linearizable range
 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%9s

Summary:
  Total:        9.5157 secs.
  Slowest:      1.0616 secs.
  Fastest:      0.0005 secs.
  Average:      0.0079 secs.
  Stddev:       0.0366 secs.
  Requests/sec: 10508.9396

~10% slower than clock dependent one, but acceptable.

if the latency between leader, follower is high, then the latency difference can be significant though.

mitake · 2016-08-24T07:31:05Z

@xiang90 does this PR implement an optimization described in section 6.4 of the thesis? Then, I can close my PR (#5912) because it is almost same to this PR (I noticed about it based on the discussion in the raft-dev).

siddontang · 2016-08-24T11:53:43Z

Hi @xiang90
Do you plan to using this to speed up read and replace quorum read?

gyuho · 2016-08-24T15:39:34Z

@mitake Yeah this is etcd implementation of Raft §6.4 Processing read-only queries more efficiently, p.72.

mitake · 2016-08-25T02:05:55Z

@gyuho I see, thanks!

xiang90 · 2016-09-13T08:40:27Z

@heyitsanthony all fixed. PTAL. I still need to add backward compatibility for this feature.

heyitsanthony · 2016-09-13T17:29:30Z

etcdserver/v3_server.go

-	result, err := s.processInternalRaftRequest(ctx, pb.InternalRaftRequest{Range: r})
-	if err != nil {
-		return nil, err
+	var resp *pb.RangeResponse


why shuffle this around?

nm, I see it falls through from linearized to serialized once it gets the notify

heyitsanthony · 2016-09-13T18:27:13Z

approach looks OK but the synchronization/error handling is a little iffy

xiang90 · 2016-09-22T01:56:12Z

the rate one appears on par or faster than the buffered way of doing it. Where's the latency going?

I will try with more testing + benchmarks.

xiang90 · 2016-09-23T14:24:03Z

@heyitsanthony All fixed. I made some minor changes, the performance got improved for unbuffered case magically. I do not really understand why, but the result seems to match the simple benchmark you wrote. The test has an assumption that we can almost always batch the concurrent requests together. The unbuffered still can be slower than the other two solutions when some of the requests reach the etcd server at slightly different timestamp and miss the batch. (Basically if the requests can be divided into N groups, which consumes N times resources )But the difference is not huge for most of the cases, I guess we should just go with the simplest solution for now.

xiang90 · 2016-09-23T14:28:07Z

If this approach looks good, i will resolve the conflicts and fix tests.

heyitsanthony · 2016-09-23T16:33:41Z

etcdserver/util.go

+	}
+}
+
+func (nc *notifier) close(err error) {


heyitsanthony · 2016-09-23T16:34:09Z

@xiang90 approach looks good. Thanks!

xiang90 · 2016-09-26T11:46:41Z

@heyitsanthony All fixed. PTAL.

heyitsanthony

a few final nits

heyitsanthony · 2016-09-26T20:04:43Z

etcdserver/v3_server.go

+		return nc.err
+	case <-ctx.Done():
+		return ctx.Err()
+	case <-s.done:


s.stopping

this is not a routine etcd server created. This is a per request routine so I assume we should use s.done?

ok, it doesn't really make a difference

heyitsanthony · 2016-09-26T20:07:02Z

etcdserver/v3_server.go

+		for !timeout && !done {
+			select {
+			case rs = <-s.r.readStateC:
+				if !bytes.Equal(rs.RequestCtx, ctx) {


done = bytes.Equal(rs.RequestCtx, ctx)?

heyitsanthony · 2016-09-26T20:17:41Z

e2e/ctl_v3_elect_test.go

-func TestCtlV3Elect(t *testing.T) { testCtl(t, testElect) }
+func TestCtlV3Elect(t *testing.T) {
+	for i := 0; ; i++ {
+		fmt.Println(i)


stray debugging output?

heyitsanthony · 2016-09-26T20:21:30Z

etcdserver/v3_server.go

@@ -86,6 +95,31 @@ type Authenticator interface {
 }

 func (s *EtcdServer) Range(ctx context.Context, r *pb.RangeRequest) (*pb.RangeResponse, error) {


also support l-read Txn?

Use read index to achieve l-read.

xiang90 · 2016-09-27T05:43:58Z

@heyitsanthony Can we improve txn in another pull request? It is more complicated than read. We need to add some code to find out readonly Txn. And then if the txn is serializable, it can access local kv immediately. Or it needs to wait for linearizable notify.

heyitsanthony · 2016-09-27T16:34:52Z

@xiang90 ok can defer the txn stuff but it needs to go in before 3.1

heyitsanthony · 2016-09-27T16:38:21Z

lgtm

ericpai · 2017-03-15T08:51:27Z

Will this improvement be backported to v2.3.x ?

heyitsanthony · 2017-03-15T16:28:23Z

@ericpai no

xiang90 force-pushed the readindex branch from ef34255 to 392c702 Compare August 18, 2016 05:53

heyitsanthony reviewed Aug 18, 2016
View reviewed changes

xiang90 mentioned this pull request Aug 24, 2016

RFC: skip persisting no side effect log entries in WAL #5912

Closed

xiang90 force-pushed the readindex branch 6 times, most recently from 8d14c11 to b13a23a Compare September 13, 2016 08:38

xiang90 changed the title ~~WIP etcdserver: initial read index implementation~~ etcdserver: initial read index implementation Sep 13, 2016

xiang90 force-pushed the readindex branch from b13a23a to 50c08e2 Compare September 13, 2016 08:44

heyitsanthony reviewed Sep 13, 2016
View reviewed changes

xiang90 force-pushed the readindex branch from 50c08e2 to fcf1f23 Compare September 19, 2016 09:06

xiang90 force-pushed the readindex branch from fcf1f23 to 89a5378 Compare September 23, 2016 14:20

heyitsanthony reviewed Sep 23, 2016

View reviewed changes

etcdserver/util.go

}

}

func (nc *notifier) close(err error) {

Copy link

Contributor

heyitsanthony Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify?

xiang90 force-pushed the readindex branch 8 times, most recently from e1cb8e1 to b06b789 Compare September 26, 2016 10:46

xiang90 force-pushed the readindex branch from b06b789 to fa8db4b Compare September 26, 2016 15:23

heyitsanthony suggested changes Sep 26, 2016

View reviewed changes

xiang90 force-pushed the readindex branch from fa8db4b to 4513cc2 Compare September 27, 2016 05:40

etcdserver: support read index

e3e3993

Use read index to achieve l-read.

xiang90 force-pushed the readindex branch from 4513cc2 to e3e3993 Compare September 27, 2016 05:41

heyitsanthony approved these changes Sep 27, 2016

View reviewed changes

xiang90 merged commit 150576f into etcd-io:master Sep 27, 2016

xiang90 deleted the readindex branch September 27, 2016 16:51

xiang90 mentioned this pull request Sep 27, 2016

QGET with less file write & syncs #5193

Closed

mitake mentioned this pull request Apr 1, 2017

optimize quorum read #7645

Closed

		@@ -86,6 +95,31 @@ type Authenticator interface {
		}

		func (s EtcdServer) Range(ctx context.Context, r pb.RangeRequest) (*pb.RangeResponse, error) {

etcdserver: initial read index implementation #6212

etcdserver: initial read index implementation #6212

Conversation

xiang90 commented Aug 18, 2016

xiang90 commented Aug 18, 2016

gyuho commented Aug 18, 2016 • edited

xiang90 commented Aug 18, 2016

gyuho commented Aug 18, 2016

gyuho commented Aug 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Aug 19, 2016 • edited

mitake commented Aug 24, 2016

siddontang commented Aug 24, 2016

gyuho commented Aug 24, 2016

mitake commented Aug 25, 2016

xiang90 commented Sep 13, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyitsanthony commented Sep 13, 2016

xiang90 commented Sep 22, 2016

xiang90 commented Sep 23, 2016 • edited

xiang90 commented Sep 23, 2016

Choose a reason for hiding this comment

heyitsanthony commented Sep 23, 2016

xiang90 commented Sep 26, 2016

heyitsanthony left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Sep 27, 2016

heyitsanthony commented Sep 27, 2016

heyitsanthony commented Sep 27, 2016

ericpai commented Mar 15, 2017

heyitsanthony commented Mar 15, 2017

gyuho commented Aug 18, 2016 •

edited

xiang90 commented Aug 19, 2016 •

edited

xiang90 commented Sep 13, 2016 •

edited

xiang90 commented Sep 23, 2016 •

edited