Question: Should a large get with WithSerializable() option block puts? #7719

mcginne · 2017-04-12T16:04:43Z

Our workload does some periodic large reads from etcd (V3.1.5 using V3 API) (potentially reading 300,000 of 600,000 keys from etcd)

With 600,000 keys in etcd the Get operation takes ~1.3 seconds to complete.
I notice that some puts that occur whilst the get is happening take ~700 milliseconds to complete (usually the puts take ~ 1 millisecond)

I had hoped that as the clientv3.WithSerializable() option is a bit like a dirty read it may not block the puts, but from my tests it seems like it does, as I still see 700ms delays with some puts. Is this expected?

Is there any way to perform a read that will not block puts?

xiang90 · 2017-04-12T16:11:31Z

can you try with current master? We made read nonblocking in master.

mcginne · 2017-04-12T16:16:22Z

That's great news, will give it a go, thanks.

NateRockwell · 2017-04-12T17:01:53Z

Why were nonblocking reads only added to the master?
What happens if the master changes while a large read is in progress?

heyitsanthony · 2017-04-12T17:05:48Z

Why were nonblocking reads only added to the master?

master tracks the next minor revision release; it's a development branch for major changes. Updates to 3.1.x are bug fixes.

What happens if the master changes while a large read is in progress?

The read will reflect the data from the revision when it was started. The write won't affect it.

mcginne · 2017-04-13T09:29:46Z

Hi, I tried with master built today:
2017-04-13 09:25:10.865664 I | etcdmain: etcd Version: 3.2.0+git
2017-04-13 09:25:10.865713 I | etcdmain: Git SHA: 4582a7e

And I didn't see any improvement over 3.1.5, it still appears my puts are being blocked when a large read occurs. I tried with and without WithSerializable() and it didn't make much difference. Is there anything I need to change to enable non-blocking read?
(I didn't rebuild my client - is this necessary?)

xiang90 · 2017-04-13T15:52:31Z

@mcginne maybe cpu or net io has starvation issue? so no matter what etcd does internally, the write still appears blocking? probably you want to pull some system metrics to verify.

xiang90 · 2017-04-22T06:55:20Z

@mcginne kindly ping.

mcginne · 2017-04-24T13:20:02Z

@xiang90 sorry I was on vacation. I don't believe I am CPU limited (running at ~50% utilisation) and I don't believe network would be an issue as I am currently running to localhost in my test env.
I ran my bechmark with 300,000 keys to ensure I see large pauses for the puts when the gets are happening and took some thread dumps. In all of them I can see a Put thread that looks like it is blocked on a semaphore:

sync.runtime_Semacquire(0xc4203d9fb8)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*RWMutex).Lock(0xc4203d9fb0)
	/usr/local/go/src/sync/rwmutex.go:91 +0x98
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTxBuffered).Unlock(0xc4203e4030)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:221 +0x42
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnWrite).End(0xc424752980)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:104 +0x98
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*metricsTxnWrite).End(0xc43bf1cff0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/metrics_txn.go:67 +0xb3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*watchableStoreTxnWrite).End(0xc42038b2e0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store_txn.go:44 +0x253
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Put(0xc420120040, 0x1389260, 0xc42038b2e0, 0xc424752940, 0xc4372dc5b0, 0x0, 0x0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/apply.go:190 +0x342
...

The stacks of the Range varies between runs, but I see one here:

goroutine 1859 [runnable]:
sync.(*Mutex).Unlock(0xc4203d9fd0)
	/usr/local/go/src/sync/mutex.go:102
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*readTx).UnsafeRange(0xc4203d9fb0, 0x1376032, 0x3, 0x3, 0xc430582b80, 0x11, 0x12, 0xc430582ba0, 0x11, 0x12, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/read_tx.go:69 +0x278
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc42ec20f90, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x4b60d, 0x0, 0x0, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:135 +0x208
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnRead).Range(0xc42ec20f90, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x0, 0x0, 0xc42caae500, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:45 +0xaf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*txnReadWrite).Range(0xc42f7fa670, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x0, 0x0, 0xc42caae500, ...)
	<autogenerated>:27 +0xc1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*metricsTxnWrite).Range(0xc42ec20fc0, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x0, 0x0, 0xc420202100, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/metrics_txn.go:38 +0xb6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc420120038, 0x13858a0, 0xc42ec20fc0, 0xc420050d90, 0x0, 0x0, 0x0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/apply.go:253 +0x1c5
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Range.func2()
...

Is this all as expected with the non-blocking reads? I can attach the full thread dumps if they would be of use.

heyitsanthony · 2017-04-24T16:47:05Z

The first lock can be improved a bit by creating a new read buffer instead of modifying the shared one, but there'll some extra copy/allocation overhead. The second lock will probably need a boltdb patch; the next best thing would be lock striping with several read txns.

mcginne · 2017-04-25T09:56:49Z

@heyitsanthony thanks for looking, so just to confirm my understanding; am I using the "non-blocking read" path here, but hitting some further locks that block the puts whilst the read is ongoing? My interpretation of the "non-blocking read" was that other transactions would be able to complete whilst the read was ongoing.

heyitsanthony · 2017-04-25T16:46:30Z

Yes, it's hitting locks.

smarterclayton · 2017-07-03T19:49:30Z

Moved from #8202:

I'm testing a very large 1.5 Kubernetes data set upgraded to 1.6 and etcd v3 mode (etcd 3.2.1). Post v3 migration, I'm performing a kubernetes storage migration (GET -> PUT of all data) to force migration to protobuf. However, as I was letting the migration run in the background (single threaded is taking about 30-40 minutes, since i'm migrating about ~500k kubernetes objects) I tried to do a range read of the /kubernetes.io keyspace to list all of the keys (etcdctl get /kubernetes.io --keys-only) and the migration halted until the range read completed or timed out. I tried both serializable and linearizable, and both seemed to have the same behavior of "blocking" the migration process from continuing (specifically, the PUT of the object from Kubernetes to etcd timed out after 30s).

I didn't expect a range read to block writes, and have not noticed that behavior elsewhere. It's somewhat reproducible on the data set I have, but was not sure where to start looking to debug.

My worry here is that this will become increasingly annoying in large Kubernetes clusters - while large lists are uncommon, they are a common way for components to resync. The worst case would be for large range reads to block lease acquisition on loaded clusters. The cluster I'm describing above is one of the largest Kube clusters that is likely in the near term, and has some pathological distributions (we have lots of secrets, which are very large in etcd), but even at 10k namespaces has many fewer keys than we expect to eventually have. We aren't blocked by this - but I would expect it to be an issue for a subset of large deployers.

SuhasAnand · 2017-07-03T19:52:38Z

The worst case would be for large range reads to block lease acquisition on loaded clusters.

Is similar to what we are seeing in #8114

xiang90 · 2017-07-05T02:43:17Z

My worry here is that this will become increasingly annoying in large Kubernetes clusters - while large lists are uncommon, they are a common way for components to resync. The worst case would be for large range reads to block lease acquisition on loaded clusters. The cluster I'm describing above is one of the largest Kube clusters that is likely in the near term, and has some pathological distributions (we have lots of secrets, which are very large in etcd), but even at 10k namespaces has many fewer keys than we expect to eventually have. We aren't blocked by this - but I would expect it to be an issue for a subset of large deployers.

etcd adds pagination support for this exact reason. I hope k8s can adopt this pattern instead of trying to get ALL keys at the same time.

This is not saying that we should not make read non-blocking, if anyone wants to look into that problem please do!

smarterclayton · 2017-07-13T19:25:30Z

Yeah, I'm going to look at paging - @lavalamp and I had a quick discussion about it, and for me we have some very large clusters that want pagination to smooth out allocation curves for other reasons. I'll definitely be experimenting with pagination soon.

lavalamp · 2017-07-13T19:32:14Z

If pagination solves this issue-- why doesn't etcd internally implement a large get as a series of gets?

xiang90 · 2017-07-13T19:41:34Z

@lavalamp

If pagination solves this issue-

There are two cases:

snapshot get on a revision. current pagination will solve this one problem.
linearizable get at head. current pagination will not solve this one.

Assembling large range results into one gRPC response inside etcd server will still cause memory blow up. Moreover, large message also break gRPC best practice. gRPC has a suggested 4MB max message size.

To sum it up, current pagination only solves part of the problem and there are more things to think about.

why doesn't etcd internally implement a large get as a series of gets?

Everything can be implemented inside etcd in theory, but every line of code is liability. We need to balance the complexity and care about the budget left here. We are still uncertain about how complicated this would be. And besides all that, we have limited human resources. As I mentioned, if anyone wants to look into the problem and come up with a proposal, that is great!

xiang90 · 2018-01-05T23:44:37Z

@jpbetz @gyuho @hexfusion @spzala

I have heard quite some people hitting this issue for large deployments

We'd:

completely solve the blocking issue
or start to log warnings when users issues large ranges (we already warns users some request takes too long, but it is not specific to large ranges)
or/and add a flag to limit the number keys to range/delete to avoid the blocking

xiang90 · 2018-01-05T23:47:18Z

maybe we should investigate on how other databases behavior for large responses: mysql, pgsql, mongodb, redis, etc..

hexfusion · 2018-01-21T15:06:43Z

Going to do some research here, if anyone is working on this please ping me would like to collaborate.

stale · 2020-04-06T23:50:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

heyitsanthony added the area/performance label Apr 25, 2017

xiang90 mentioned this issue Jul 3, 2017

Large range read blocks writes on 3.2.1? #8202

Closed

smarterclayton mentioned this issue Jul 20, 2017

Sketch design: API response paging for large data sets kubernetes/kubernetes#49338

Closed

xiang90 added this to the v3.4.0 milestone Jan 5, 2018

mitake mentioned this issue Jan 23, 2018

WIP, DO NOT MERGE: *: make Range() fine grained #9199

Closed

gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019

stale bot added the stale label Apr 6, 2020

stale bot closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Should a large get with WithSerializable() option block puts? #7719

Question: Should a large get with WithSerializable() option block puts? #7719

mcginne commented Apr 12, 2017

xiang90 commented Apr 12, 2017

mcginne commented Apr 12, 2017

NateRockwell commented Apr 12, 2017

heyitsanthony commented Apr 12, 2017

mcginne commented Apr 13, 2017

xiang90 commented Apr 13, 2017

xiang90 commented Apr 22, 2017

mcginne commented Apr 24, 2017

heyitsanthony commented Apr 24, 2017

mcginne commented Apr 25, 2017

heyitsanthony commented Apr 25, 2017

smarterclayton commented Jul 3, 2017

SuhasAnand commented Jul 3, 2017

xiang90 commented Jul 5, 2017

smarterclayton commented Jul 13, 2017

lavalamp commented Jul 13, 2017

xiang90 commented Jul 13, 2017 •

edited

Loading

xiang90 commented Jan 5, 2018

xiang90 commented Jan 5, 2018

hexfusion commented Jan 21, 2018

stale bot commented Apr 6, 2020

Question: Should a large get with WithSerializable() option block puts? #7719

Question: Should a large get with WithSerializable() option block puts? #7719

Comments

mcginne commented Apr 12, 2017

xiang90 commented Apr 12, 2017

mcginne commented Apr 12, 2017

NateRockwell commented Apr 12, 2017

heyitsanthony commented Apr 12, 2017

mcginne commented Apr 13, 2017

xiang90 commented Apr 13, 2017

xiang90 commented Apr 22, 2017

mcginne commented Apr 24, 2017

heyitsanthony commented Apr 24, 2017

mcginne commented Apr 25, 2017

heyitsanthony commented Apr 25, 2017

smarterclayton commented Jul 3, 2017

SuhasAnand commented Jul 3, 2017

xiang90 commented Jul 5, 2017

smarterclayton commented Jul 13, 2017

lavalamp commented Jul 13, 2017

xiang90 commented Jul 13, 2017 • edited Loading

xiang90 commented Jan 5, 2018

xiang90 commented Jan 5, 2018

hexfusion commented Jan 21, 2018

stale bot commented Apr 6, 2020

xiang90 commented Jul 13, 2017 •

edited

Loading