Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Should a large get with WithSerializable() option block puts? #7719

Closed
mcginne opened this issue Apr 12, 2017 · 21 comments
Closed

Comments

@mcginne
Copy link

mcginne commented Apr 12, 2017

Our workload does some periodic large reads from etcd (V3.1.5 using V3 API) (potentially reading 300,000 of 600,000 keys from etcd)

With 600,000 keys in etcd the Get operation takes ~1.3 seconds to complete.
I notice that some puts that occur whilst the get is happening take ~700 milliseconds to complete (usually the puts take ~ 1 millisecond)

I had hoped that as the clientv3.WithSerializable() option is a bit like a dirty read it may not block the puts, but from my tests it seems like it does, as I still see 700ms delays with some puts. Is this expected?

Is there any way to perform a read that will not block puts?

@xiang90
Copy link
Contributor

xiang90 commented Apr 12, 2017

can you try with current master? We made read nonblocking in master.

@mcginne
Copy link
Author

mcginne commented Apr 12, 2017

That's great news, will give it a go, thanks.

@NateRockwell
Copy link

Why were nonblocking reads only added to the master?
What happens if the master changes while a large read is in progress?

@heyitsanthony
Copy link
Contributor

Why were nonblocking reads only added to the master?

master tracks the next minor revision release; it's a development branch for major changes. Updates to 3.1.x are bug fixes.

What happens if the master changes while a large read is in progress?

The read will reflect the data from the revision when it was started. The write won't affect it.

@mcginne
Copy link
Author

mcginne commented Apr 13, 2017

Hi, I tried with master built today:
2017-04-13 09:25:10.865664 I | etcdmain: etcd Version: 3.2.0+git
2017-04-13 09:25:10.865713 I | etcdmain: Git SHA: 4582a7e

And I didn't see any improvement over 3.1.5, it still appears my puts are being blocked when a large read occurs. I tried with and without WithSerializable() and it didn't make much difference. Is there anything I need to change to enable non-blocking read?
(I didn't rebuild my client - is this necessary?)

@xiang90
Copy link
Contributor

xiang90 commented Apr 13, 2017

@mcginne maybe cpu or net io has starvation issue? so no matter what etcd does internally, the write still appears blocking? probably you want to pull some system metrics to verify.

@xiang90
Copy link
Contributor

xiang90 commented Apr 22, 2017

@mcginne kindly ping.

@mcginne
Copy link
Author

mcginne commented Apr 24, 2017

@xiang90 sorry I was on vacation. I don't believe I am CPU limited (running at ~50% utilisation) and I don't believe network would be an issue as I am currently running to localhost in my test env.
I ran my bechmark with 300,000 keys to ensure I see large pauses for the puts when the gets are happening and took some thread dumps. In all of them I can see a Put thread that looks like it is blocked on a semaphore:

sync.runtime_Semacquire(0xc4203d9fb8)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*RWMutex).Lock(0xc4203d9fb0)
	/usr/local/go/src/sync/rwmutex.go:91 +0x98
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTxBuffered).Unlock(0xc4203e4030)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:221 +0x42
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnWrite).End(0xc424752980)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:104 +0x98
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*metricsTxnWrite).End(0xc43bf1cff0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/metrics_txn.go:67 +0xb3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*watchableStoreTxnWrite).End(0xc42038b2e0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store_txn.go:44 +0x253
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Put(0xc420120040, 0x1389260, 0xc42038b2e0, 0xc424752940, 0xc4372dc5b0, 0x0, 0x0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/apply.go:190 +0x342
...

The stacks of the Range varies between runs, but I see one here:

goroutine 1859 [runnable]:
sync.(*Mutex).Unlock(0xc4203d9fd0)
	/usr/local/go/src/sync/mutex.go:102
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*readTx).UnsafeRange(0xc4203d9fb0, 0x1376032, 0x3, 0x3, 0xc430582b80, 0x11, 0x12, 0xc430582ba0, 0x11, 0x12, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/read_tx.go:69 +0x278
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc42ec20f90, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x4b60d, 0x0, 0x0, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:135 +0x208
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*storeTxnRead).Range(0xc42ec20f90, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x0, 0x0, 0xc42caae500, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore_txn.go:45 +0xaf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*txnReadWrite).Range(0xc42f7fa670, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x0, 0x0, 0xc42caae500, ...)
	<autogenerated>:27 +0xc1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*metricsTxnWrite).Range(0xc42ec20fc0, 0xc42f7fa650, 0x9, 0x10, 0xc42f7fa660, 0x9, 0x10, 0x0, 0x0, 0xc420202100, ...)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/metrics_txn.go:38 +0xb6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc420120038, 0x13858a0, 0xc42ec20fc0, 0xc420050d90, 0x0, 0x0, 0x0)
	/home/vagrant/etcd_master/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/apply.go:253 +0x1c5
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Range.func2()
...

Is this all as expected with the non-blocking reads? I can attach the full thread dumps if they would be of use.

@heyitsanthony
Copy link
Contributor

The first lock can be improved a bit by creating a new read buffer instead of modifying the shared one, but there'll some extra copy/allocation overhead. The second lock will probably need a boltdb patch; the next best thing would be lock striping with several read txns.

@mcginne
Copy link
Author

mcginne commented Apr 25, 2017

@heyitsanthony thanks for looking, so just to confirm my understanding; am I using the "non-blocking read" path here, but hitting some further locks that block the puts whilst the read is ongoing? My interpretation of the "non-blocking read" was that other transactions would be able to complete whilst the read was ongoing.

@heyitsanthony
Copy link
Contributor

Yes, it's hitting locks.

@smarterclayton
Copy link
Contributor

Moved from #8202:

I'm testing a very large 1.5 Kubernetes data set upgraded to 1.6 and etcd v3 mode (etcd 3.2.1). Post v3 migration, I'm performing a kubernetes storage migration (GET -> PUT of all data) to force migration to protobuf. However, as I was letting the migration run in the background (single threaded is taking about 30-40 minutes, since i'm migrating about ~500k kubernetes objects) I tried to do a range read of the /kubernetes.io keyspace to list all of the keys (etcdctl get /kubernetes.io --keys-only) and the migration halted until the range read completed or timed out. I tried both serializable and linearizable, and both seemed to have the same behavior of "blocking" the migration process from continuing (specifically, the PUT of the object from Kubernetes to etcd timed out after 30s).

I didn't expect a range read to block writes, and have not noticed that behavior elsewhere. It's somewhat reproducible on the data set I have, but was not sure where to start looking to debug.

My worry here is that this will become increasingly annoying in large Kubernetes clusters - while large lists are uncommon, they are a common way for components to resync. The worst case would be for large range reads to block lease acquisition on loaded clusters. The cluster I'm describing above is one of the largest Kube clusters that is likely in the near term, and has some pathological distributions (we have lots of secrets, which are very large in etcd), but even at 10k namespaces has many fewer keys than we expect to eventually have. We aren't blocked by this - but I would expect it to be an issue for a subset of large deployers.

@SuhasAnand
Copy link

The worst case would be for large range reads to block lease acquisition on loaded clusters.

Is similar to what we are seeing in #8114

@xiang90
Copy link
Contributor

xiang90 commented Jul 5, 2017

My worry here is that this will become increasingly annoying in large Kubernetes clusters - while large lists are uncommon, they are a common way for components to resync. The worst case would be for large range reads to block lease acquisition on loaded clusters. The cluster I'm describing above is one of the largest Kube clusters that is likely in the near term, and has some pathological distributions (we have lots of secrets, which are very large in etcd), but even at 10k namespaces has many fewer keys than we expect to eventually have. We aren't blocked by this - but I would expect it to be an issue for a subset of large deployers.

etcd adds pagination support for this exact reason. I hope k8s can adopt this pattern instead of trying to get ALL keys at the same time.

This is not saying that we should not make read non-blocking, if anyone wants to look into that problem please do!

@smarterclayton
Copy link
Contributor

Yeah, I'm going to look at paging - @lavalamp and I had a quick discussion about it, and for me we have some very large clusters that want pagination to smooth out allocation curves for other reasons. I'll definitely be experimenting with pagination soon.

@lavalamp
Copy link

If pagination solves this issue-- why doesn't etcd internally implement a large get as a series of gets?

@xiang90
Copy link
Contributor

xiang90 commented Jul 13, 2017

@lavalamp

If pagination solves this issue-

There are two cases:

  1. snapshot get on a revision. current pagination will solve this one problem.
  2. linearizable get at head. current pagination will not solve this one.

Assembling large range results into one gRPC response inside etcd server will still cause memory blow up. Moreover, large message also break gRPC best practice. gRPC has a suggested 4MB max message size.

To sum it up, current pagination only solves part of the problem and there are more things to think about.

why doesn't etcd internally implement a large get as a series of gets?

Everything can be implemented inside etcd in theory, but every line of code is liability. We need to balance the complexity and care about the budget left here. We are still uncertain about how complicated this would be. And besides all that, we have limited human resources. As I mentioned, if anyone wants to look into the problem and come up with a proposal, that is great!

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

@jpbetz @gyuho @hexfusion @spzala

I have heard quite some people hitting this issue for large deployments

We'd:

  1. completely solve the blocking issue
  2. or start to log warnings when users issues large ranges (we already warns users some request takes too long, but it is not specific to large ranges)
  3. or/and add a flag to limit the number keys to range/delete to avoid the blocking

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

maybe we should investigate on how other databases behavior for large responses: mysql, pgsql, mongodb, redis, etc..

@hexfusion
Copy link
Contributor

Going to do some research here, if anyone is working on this please ping me would like to collaborate.

@stale
Copy link

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 6, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants