Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: allow large concurrent reads under light write workload #9296

Merged
merged 1 commit into from Feb 9, 2018

Conversation

@xiang90
Copy link
Contributor

commented Feb 7, 2018

I found that etcd is unable to process large (take more than 100ms) read requests concurrently even under light or no write workload. We already put some effort to solve the current read problem by adding a writeback buffer in front of the database transaction. So I want to fix this problem.

The root cause of the problem is that although we use Read Lock to allow current read go-routines, but we have another long running go-routine that tries to flush data back to disk every 100ms which requires Write Lock.

Obviously, the write lock has a higher priority that read lock, and will always kick in to block any later read locks. So with this behavior, we basically put a barrier for every 100ms to block all reads until the write lock is acquired by the long running go-routine and then released. The write lock can only be acquired after all previous read locks are released, which can be seconds if the reads are large. Thus we can block reads for seconds even when there is no write.

However, if there is no pending write, we do not need to get the write lock at all.

This PR delays the routine to acquire the write lock only when there are pending writes. So under no or light write conditions, large read can be executed concurrently.

/cc @jpbetz @mitake @heyitsanthony

@xiang90

This comment has been minimized.

Copy link
Contributor Author

commented Feb 7, 2018

@mitake

The commit routine is also probably the cause of what you observed #9199.

Basically when you make the range size larger, it blocks the flush longer. Thus the writes batch is larger, and write can be more effective. But it is probably not what we want?

@xiang90 xiang90 force-pushed the xiang90:c_r branch 4 times, most recently from 31af6df to 8e8538b Feb 7, 2018

@xiang90

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2018

I formed some new ideas to solve this issue + the problem #9199 tries to solve... so probably wait for my follow up commit :)

@mitake

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2018

@xiang90

The commit routine is also probably the cause of what you observed #9199.

I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.

 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%14s
Summary:
  Total:        14.5203 secs.
  Slowest:      2.6956 secs.
  Fastest:      0.0006 secs.
  Average:      0.0140 secs.
  Stddev:       0.0862 secs.
  Requests/sec: 6886.9280

Response time histogram:
  0.0006 [1]    |
  0.2701 [99799]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.5396 [100]  |
  0.8091 [0]    |
  1.0786 [0]    |
  1.3481 [0]    |
  1.6176 [0]    |
  1.8871 [0]    |
  2.1566 [0]    |
  2.4261 [0]    |
  2.6956 [100]  |

Latency distribution:
  10% in 0.0060 secs.
  25% in 0.0067 secs.
  50% in 0.0089 secs.
  75% in 0.0116 secs.
  90% in 0.0176 secs.
  95% in 0.0292 secs.
  99% in 0.0411 secs.
  99.9% in 2.6756 secs.

Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no tx.Lock() in the reader side). But it makes the compaction process complicated because the read txns cannot indicate their existence. It means we need to have a deferred reclamation mechanism. It will introduce big design update of mvcc pkg.

For now, making read txns fine grained in a voluntary manner would be an ad-hoc but reasonable solution, I think.

@xiang90

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2018

I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.

This PR will only help on large concurrent read request, but not on blocking puts.

Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no tx.Lock() in the reader side). But it makes the compaction process complicated because the read txns cannot indicate their existence. It means we need to have a deferred reclamation mechanism. It will introduce big design update of mvcc pkg.

Yea. We should dedicate large read request to another boltdb read tx after a commit happens after we receive the request instead of using the singleton read tx in current backend which is able to read the recent write in the writeback buffer.

@xiang90 xiang90 force-pushed the xiang90:c_r branch from 8e8538b to 84909fb Feb 8, 2018

// no need to reset read tx if there is no write.
// call commit to update stats.
if t.batchTx.pending == 0 && !stop {
t.batchTx.commit(stop)

This comment has been minimized.

Copy link
@heyitsanthony

heyitsanthony Feb 8, 2018

Contributor

commit(false)

@@ -231,6 +232,13 @@ func (t *batchTxBuffered) CommitAndStop() {
}

func (t *batchTxBuffered) commit(stop bool) {
// no need to reset read tx if there is no write.
// call commit to update stats.
if t.batchTx.pending == 0 && !stop {

This comment has been minimized.

Copy link
@heyitsanthony

heyitsanthony Feb 8, 2018

Contributor

if this approach isn't working (deadlocking in CI?), it might be easier to move the pending == 0 paths to backend.run() and avoid calling commit() there...

This comment has been minimized.

Copy link
@xiang90

xiang90 Feb 8, 2018

Author Contributor

CI panics since the tx initialization code goes through here. I am still playing with the code to see what is the easier. Probably I should try what you just suggested.

@xiang90 xiang90 force-pushed the xiang90:c_r branch 5 times, most recently from 8ef182b to ff50041 Feb 8, 2018

@codecov-io

This comment has been minimized.

Copy link

commented Feb 8, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@f5d02f0). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9296   +/-   ##
=========================================
  Coverage          ?   75.58%           
=========================================
  Files             ?      365           
  Lines             ?    30695           
  Branches          ?        0           
=========================================
  Hits              ?    23202           
  Misses            ?     5870           
  Partials          ?     1623
Impacted Files Coverage Δ
internal/mvcc/backend/backend.go 80% <100%> (ø)
internal/mvcc/backend/batch_tx.go 78.22% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5d02f0...3ebee21. Read the comment docs.

@xiang90 xiang90 force-pushed the xiang90:c_r branch from ff50041 to e50aacb Feb 8, 2018

@xiang90

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2018

/cc @heyitsanthony can you take another look? change to what you suggested.

@gyuho gyuho added the Release-Note label Feb 8, 2018

b.batchTx.Commit()
b.batchTx.Lock()
pending := b.batchTx.pending
b.batchTx.Unlock()

This comment has been minimized.

Copy link
@gyuho

gyuho Feb 8, 2018

Member

Q. Doesn't b.batchTx.Unlock already reset pending count?

  1. pending := b.(*batchTxBuffered).pending, where pending is 20k and batchLimit is 10k
  2. Goes to b.(*batchTxBuffered).Unlock()
  3. t.pending >= t.backend.batchLimit
  4. (*batchTxBuffered).commit(false)
  5. (*batchTx).commit(false)
  6. (*batchTx).pending = 0

Shouldn't this be b.batchTx.batchTx.Unlock()?

This comment has been minimized.

Copy link
@xiang90

xiang90 Feb 8, 2018

Author Contributor

oh. yea. it should be the lock of the mutex.

@xiang90 xiang90 force-pushed the xiang90:c_r branch from e50aacb to 3ebee21 Feb 8, 2018

@gyuho

This comment has been minimized.

Copy link
Member

commented Feb 9, 2018

Approach looks good. I will try benchmarking this code path by issuing two overlapping readers contending on backend.readTx. If @mitake can cross-check on this, it would be great.

@hexfusion

This comment has been minimized.

Copy link
Member

commented Feb 9, 2018

@heyitsanthony
Copy link
Contributor

left a comment

lgtm

@gyuho

This comment has been minimized.

Copy link
Member

commented Feb 9, 2018

// if no pending writes, should not block concurrent read transactions
//
// without https://github.com/coreos/etcd/pull/9296
//  1. (*backend).(*batchTxBuffered).Commit()
//  2. (*backend).(*batchTxBuffered).Mutex.Lock
//  3. (*backend).(*readTx).RWMutex.Lock
//    - blocks long running read transaction
//
// with https://github.com/coreos/etcd/pull/9296
//  1. (*backend).(*batchTxBuffered).safePending()
//  2. (*backend).(*batchTxBuffered).Mutex.Lock

/*
go test -v -run TestRead
*/

func TestReadBlockingWriteCommit(t *testing.T)    { testRead(t, true) }
func TestReadNonBlockingWriteCommit(t *testing.T) { testRead(t, false) }
func testRead(t *testing.T, block bool) {
	keyN := 1000000

	// trigger commit manually, by giving large batch
	// interval and limit
	b, tmppath := NewTmpBackend(time.Hour, 2*keyN)
	defer b.Close()
	defer os.Remove(tmppath)

	b.batchTx.UnsafeCreateBucket([]byte("key"))

	for i := 0; i < keyN; i++ {
		v := make([]byte, 3*1024)
		rand.Read(v)
		b.batchTx.Lock()
		b.batchTx.UnsafePut([]byte("key"), []byte(fmt.Sprintf("foo%10d", i)), v)
		b.batchTx.Unlock()
	}
	// commits all pending writes
	b.batchTx.Commit()
	fmt.Println("pending:", b.batchTx.pending)

	// for long read transactions
	rangeStart, rangeEnd := []byte(fmt.Sprintf("foo%10d", 0)), []byte(fmt.Sprintf("foo%10d", keyN))

	readFunc := func(name string) {
		log.Printf("read %q waiting for Lock", name)

		now := time.Now()
		b.ReadTx().Lock()
		log.Printf("read %q acquired Lock", name)
		rn := time.Now()
		aa, bb := b.ReadTx().UnsafeRange([]byte("key"), rangeStart, rangeEnd, 0)
		b.ReadTx().Unlock()

		log.Printf("read %q finished (took %v, range took %v) %d %d", name, time.Since(now), time.Since(rn), len(aa), len(bb))
	}

	total := time.Now()
	donec := make(chan struct{})

	go func() {
		readFunc("A")
		donec <- struct{}{}
	}()

	go func() {
		time.Sleep(time.Second)
		log.Println("commit starts")

		now := time.Now()
		if block {
			b.batchTx.Commit()
		} else {
			if b.batchTx.safePending() != 0 {
				b.batchTx.Commit()
			}
		}

		log.Printf("commit finished (took %v)", time.Since(now))
		donec <- struct{}{}
	}()

	time.Sleep(2 * time.Second)
	readFunc("B")

	<-donec
	<-donec

	fmt.Println("total:", time.Since(total))
}

func (t *batchTx) safePending() int {
	t.Mutex.Lock()
	defer t.Mutex.Unlock()
	return t.pending
}

LGTM

@xiang90 xiang90 merged commit 0a20767 into etcd-io:master Feb 9, 2018

5 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-cov Build finished.
Details
jenkins-ppc64le Build finished.
Details
jenkins-proxy-ci Build finished.
Details
semaphoreci The build passed on Semaphore.
Details

@xiang90 xiang90 deleted the xiang90:c_r branch Feb 9, 2018

@gyuho

This comment has been minimized.

Copy link
Member

commented Feb 9, 2018

Instead, compared two compiled etcd process before and after this patch, and compare how periodic commit operation affects incoming read transactions.

I wrote 1M keys, and wait until there's no pending writes. Previously, commit for-loop with batch interval acquires writer locks even when there's no pending write. Thus, new read transactions to acquire reader locks were blocked by commit operation holding writer locks. Had to add some sleeps to have larger window between commit operation writer lock acquire and release, so that read transaction can kick in and be blocked.

Then, spawned two separate read transactions concurrently. Before, each read transaction takes about 2.7s. When launched concurrently, the second reader blocks with commit operation holding writer locks, thus taking >4.47s. After this patch, all read transactions take <3.3s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants
You can’t perform that action at this time.