Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lease/lessor: recheck if exprired lease is revoked #10693

Merged
merged 1 commit into from
May 31, 2019

Conversation

nolouch
Copy link
Contributor

@nolouch nolouch commented Apr 29, 2019

Fix #10686:
This is an easy way to try to fix the issue, and I'm not sure if this is a good method. I need some suggestions about it and then add the test.

@codecov-io
Copy link

codecov-io commented Apr 29, 2019

Codecov Report

Merging #10693 into master will decrease coverage by 0.43%.
The diff coverage is 92.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10693      +/-   ##
==========================================
- Coverage   62.47%   62.04%   -0.44%     
==========================================
  Files         392      392              
  Lines       37158    37197      +39     
==========================================
- Hits        23215    23078     -137     
- Misses      12381    12544     +163     
- Partials     1562     1575      +13
Impacted Files Coverage Δ
lease/lessor.go 88.34% <100%> (+0.17%) ⬆️
etcdserver/server.go 68.07% <100%> (-0.12%) ⬇️
lease/lease_queue.go 91.11% <84.61%> (-8.89%) ⬇️
client/keys.go 55.77% <0%> (-35.68%) ⬇️
auth/simple_token.go 65.85% <0%> (-21.14%) ⬇️
clientv3/balancer/grpc1.7-health.go 11.04% <0%> (-12.8%) ⬇️
client/client.go 44.11% <0%> (-9.16%) ⬇️
auth/store.go 46.08% <0%> (-4.7%) ⬇️
pkg/testutil/recorder.go 77.77% <0%> (-3.71%) ⬇️
clientv3/leasing/txn.go 88.09% <0%> (-3.18%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23a89b0...dc8a31e. Read the comment docs.

@jingyih
Copy link
Contributor

jingyih commented Apr 29, 2019

cc @jpbetz

@jpbetz
Copy link
Contributor

jpbetz commented Apr 29, 2019

The way logic is split between findExpiredLeases and expireExists seems counter productive to fixing this since what we really need to do is keep items in the heap until they're gone from heapMap.

Setting item.time to >= time.Now does fix the issue, but it does change expiry time which, ideally, shouldn't be needed.

Something like this might be more readable and less confusing:

  1. "clean" le.leaseHeap by removing expired items
  2. Find all the expired items and return them in a list, but leave them in the heap
  3. Optional: manage retry backoff logic, this should probably be the responsibility of the expiredLeaseC channel consumer. It would keep track of lease revoke failures as needed to do this..

Getting all the expired items can be done without any Pop or Push calls, just binary search leaseHeap for time.Now() and then copy the part of the slice that contains expired items. Then filter out the ones that are no longer in heapMap. (see below comment)

For "clean", I don't know if we really need any more than a for leaseHeap[0] == nil { heap.Pop(leaseHeap) }. It would be possible to be more aggressive and keep track of the leases that were filtered out because they are no longer in heapMap and then delete them from leaseHeap (directly followed by a heap.Fix call to be efficient?), but I'm not sure if that's really helps unless there is a bug that could prevent some lease from ever being revoked.

lease/lessor.go Outdated Show resolved Hide resolved
@jpbetz
Copy link
Contributor

jpbetz commented Apr 30, 2019

Sorry, I’m not thinking straight today. Ignore my above comment about binary search of leaseHeap, that won’t work. It might be that adjusting the expire time like suggested in the PR is the most efficient solution. If we do that, heap.Fix() is more efficient than Pop/Push.

@jpbetz
Copy link
Contributor

jpbetz commented Apr 30, 2019

Another approach would be to peek and pop from the heap until a non-expired lease is found and to put all the expired leases in a slice separate from the heap. This avoids most (all?) of the complications of having to look up all the expired leases from the heap.

@nolouch
Copy link
Contributor Author

nolouch commented Apr 30, 2019

@jpbetz Actually the lease is expired checked by l.expired() in findExpiredLease, so Fix the time of leaseItme in the heap does not change the expired time, the leaseHeap just notify the loop to revoke the lease if the lease indeed in leaseMap and expired. so, I think it just does retry work if we Fix the leaseItem, and if the lease not found in leaseMap we will do not retry again. This is a simpler solution more than doing retry backoff in the channel consumer because the backoff needs to consider the different error.

@jpbetz
Copy link
Contributor

jpbetz commented Apr 30, 2019

Joe Betz Actually the lease is expired checked by l.expired() in findExpiredLease, so Fix the time of leaseItme in the heap does not change the expired time, the leaseHeap just notify the loop to revoke the lease if the lease indeed in leaseMap and expired. so, I think it just does retry work if we Fix the leaseItem, and if the lease not found in leaseMap we will do not retry again.

My thinking with Fix is instead of doing heap.Pop(); lease.time = now + 3m; heap.Push() instead do lease.time = now + 3m ; heap.Fix().

This is a simpler solution more than doing retry backoff in the channel consumer because the backoff needs to consider the different error.

The more I think about it, the less certain I am that we need any backoff on retry at this point. We might not even need expiredleaseRetryInterval, setting lease.time = now might be sufficient?

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just ramping up on the lease code. Inserted couple thoughts.

lease/lessor.go Outdated
// Candidate expirations are caught up, reinsert this item
// and no need to revoke (nothing is expiry)
return l, false, false
}
// if the lease is actually expired, add to the removal list. If it is not expired, we can ignore it because another entry will have been inserted into the heap

heap.Pop(&le.leaseHeap) // O(log N)

// recheck if revoke is complete after retry interval
item.time = now.Add(expiredleaseRetryInterval).UnixNano()
Copy link
Contributor

@jingyih jingyih Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fix the original issue. My concerns:

  1. Not sure how much overhead this is adding. Pushing back to heap is O(logn), this is an additional cost for every expiring lease. Eventually we might want to benchmark the performance.

  2. After push back, we have mismatched information on item.time between leaseMap and leaseHeap. I do not think it is causing any issue now, but it might be confusing to have the same lease ID with different time. Alternatively, can we push lease back to heap without modifying the time, but simply delay the push, as opposed to push back right away. There is another potential benefit of delayed push: potentially we can check if the lease is revoked successfully at the end of delay, therefore avoid pushing back to heap in case lease is revoked successfully. Conceptually something like:

select {
case <-l.revokec: // lease is revoked successfully
case <-time.After(expiredleaseRetryInterval): // push lease back to heap (no need to modify lease's time)
}

Just some preliminary thoughts, not sure how hard it is to properly implement (I am aware that the current solution is very straight forward and simple).

lease/lessor.go Outdated Show resolved Hide resolved
@nolouch
Copy link
Contributor Author

nolouch commented May 1, 2019

@jingyih @jpbetz  Thanks for your comments, I will do some benchmark.

@nolouch
Copy link
Contributor Author

nolouch commented May 2, 2019

master:

goos: linux
goarch: amd64
pkg: go.etcd.io/etcd/lease
BenchmarkLessorFindExpired1-6           20000000               198 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired10-6           5000000               239 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired100-6          5000000               290 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired1000-6        10000000               288 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired10000-6       10000000               143 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired100000-6      10000000               191 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired1000000-6     10000000               153 ns/op             128 B/op          1 allocs/op
BenchmarkLessorGrant1-6                  1000000              1966 ns/op             967 B/op         10 allocs/op
BenchmarkLessorGrant10-6                 1000000              2051 ns/op             968 B/op         10 allocs/op
BenchmarkLessorGrant100-6                1000000              2002 ns/op             969 B/op         10 allocs/op
BenchmarkLessorGrant1000-6               1000000              1977 ns/op             967 B/op         10 allocs/op
BenchmarkLessorGrant10000-6              1000000              2092 ns/op             967 B/op         10 allocs/op
BenchmarkLessorGrant100000-6             1000000              1954 ns/op             955 B/op         10 allocs/op
BenchmarkLessorGrant1000000-6            1000000              2082 ns/op             960 B/op         10 allocs/op
BenchmarkLessorRenew1-6                 100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew10-6                100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew100-6               100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew1000-6              100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew10000-6             100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew100000-6            100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew1000000-6           100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRevoke1-6                10000000               180 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke10-6               10000000               180 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke100-6              10000000               180 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke1000-6             10000000               181 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke10000-6            10000000               180 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke100000-6           10000000               180 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke1000000-6          10000000               183 ns/op               0 B/op          0 allocs/op

the branch:

goos: linux
goarch: amd64
pkg: go.etcd.io/etcd/lease
BenchmarkLessorFindExpired1-6           20000000               205 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired10-6           5000000               318 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired100-6          5000000               289 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired1000-6        10000000               261 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired10000-6       10000000               170 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired100000-6      10000000               181 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired1000000-6     10000000               150 ns/op             128 B/op          1 allocs/op
BenchmarkLessorGrant1-6                  1000000              2036 ns/op             969 B/op         10 allocs/op
BenchmarkLessorGrant10-6                 1000000              2094 ns/op             969 B/op         10 allocs/op
BenchmarkLessorGrant100-6                1000000              2087 ns/op             969 B/op         10 allocs/op
BenchmarkLessorGrant1000-6               1000000              1980 ns/op             969 B/op         10 allocs/op
BenchmarkLessorGrant10000-6              1000000              2055 ns/op             968 B/op         10 allocs/op
BenchmarkLessorGrant100000-6             1000000              1930 ns/op             955 B/op         10 allocs/op
BenchmarkLessorGrant1000000-6            1000000              2092 ns/op             960 B/op         10 allocs/op
BenchmarkLessorRenew1-6                 100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew10-6                100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew100-6               100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew1000-6              100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew10000-6             100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew100000-6            100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew1000000-6           100000000               14.2 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRevoke1-6                10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke10-6               10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke100-6              10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke1000-6             10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke10000-6            10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke100000-6           10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke1000000-6          10000000               186 ns/op               0 B/op          0 allocs/op

@nolouch
Copy link
Contributor Author

nolouch commented May 2, 2019

If keepalive still renew the lease, the leaseQueue will more and more grow longer because of the item not popped even through delayed push.

@jingyih
Copy link
Contributor

jingyih commented May 3, 2019

Benchmark result looks good. Thanks!

I am afraid my previous comments on lease.time which leads to the introduction of a new parameter lease.delayCheckTime has made the code readability worse. Sorry for all the trouble. Maybe we should revert back to simply adjusting the lease.time?

@nolouch
Copy link
Contributor Author

nolouch commented May 4, 2019

@jingyih
I adjust the heap and just save one expired item for the corresponding lease .
the new bench:

pkg: go.etcd.io/etcd/lease
BenchmarkLessorFindExpired1-6           20000000               248 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired10-6           5000000               280 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired100-6          5000000               281 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired1000-6         5000000               211 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired10000-6       10000000               152 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired100000-6      10000000               191 ns/op             128 B/op          1 allocs/op
BenchmarkLessorFindExpired1000000-6     10000000               167 ns/op             128 B/op          1 allocs/op
BenchmarkLessorGrant1-6                  1000000              2410 ns/op            1053 B/op         10 allocs/op
BenchmarkLessorGrant10-6                 1000000              2432 ns/op            1054 B/op         10 allocs/op
BenchmarkLessorGrant100-6                1000000              2322 ns/op            1054 B/op         10 allocs/op
BenchmarkLessorGrant1000-6               1000000              2301 ns/op            1055 B/op         10 allocs/op
BenchmarkLessorGrant10000-6              1000000              2359 ns/op            1052 B/op         10 allocs/op
BenchmarkLessorGrant100000-6             1000000              2274 ns/op            1034 B/op         10 allocs/op
BenchmarkLessorGrant1000000-6            1000000              2527 ns/op            1044 B/op         10 allocs/op
BenchmarkLessorRenew1-6                 100000000               14.3 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew10-6                100000000               14.4 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew100-6               100000000               14.3 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew1000-6              100000000               14.3 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew10000-6             100000000               14.4 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew100000-6            100000000               14.3 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRenew1000000-6           100000000               14.4 ns/op             0 B/op          0 allocs/op
BenchmarkLessorRevoke1-6                10000000               185 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke10-6               10000000               183 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke100-6              10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke1000-6             10000000               184 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke10000-6            10000000               185 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke100000-6           10000000               185 ns/op               0 B/op          0 allocs/op
BenchmarkLessorRevoke1000000-6          10000000               188 ns/op               0 B/op          0 allocs/op

``

@j2gg0s
Copy link

j2gg0s commented May 5, 2019

@nolouch @jingyih
BenchmarkLessorFindExpiredX is not run as we expected.
If input parameter size is less than b.N * 1000 * 2, the time of op is not reliable.
Extra findExpiredLeases will return immediately with empty leaseHeap.

Examples with go's new feature --benchtime Nx

b.N = 1

go test -bench BenchmarkLessorFindExpired1$ -benchtime 1x --count 10
goos: darwin
goarch: amd64
pkg: go.etcd.io/etcd/lease
BenchmarkLessorFindExpired1-4                  1          10118283 ns/op
BenchmarkLessorFindExpired1-4                  1           9128857 ns/op
BenchmarkLessorFindExpired1-4                  1           5947811 ns/op
BenchmarkLessorFindExpired1-4                  1          12088392 ns/op
BenchmarkLessorFindExpired1-4                  1          10679553 ns/op
BenchmarkLessorFindExpired1-4                  1           8280897 ns/op
BenchmarkLessorFindExpired1-4                  1          10376723 ns/op
BenchmarkLessorFindExpired1-4                  1           9948416 ns/op
BenchmarkLessorFindExpired1-4                  1           8724984 ns/op
BenchmarkLessorFindExpired1-4                  1           5968716 ns/op
PASS
ok      go.etcd.io/etcd/lease   37.192s

b.N = 2

go test -bench BenchmarkLessorFindExpired1$ -benchtime 2x --count 10
goos: darwin
goarch: amd64
pkg: go.etcd.io/etcd/lease
BenchmarkLessorFindExpired1-4                  2           4509106 ns/op
BenchmarkLessorFindExpired1-4                  2           5410520 ns/op
BenchmarkLessorFindExpired1-4                  2           3652399 ns/op
BenchmarkLessorFindExpired1-4                  2           4326552 ns/op
BenchmarkLessorFindExpired1-4                  2           5517978 ns/op
BenchmarkLessorFindExpired1-4                  2           3981831 ns/op
BenchmarkLessorFindExpired1-4                  2           4525518 ns/op
BenchmarkLessorFindExpired1-4                  2           4362394 ns/op
BenchmarkLessorFindExpired1-4                  2           3887954 ns/op
BenchmarkLessorFindExpired1-4                  2           3812382 ns/op
PASS
ok      go.etcd.io/etcd/lease   37.151s

b.N = 10

go test -bench BenchmarkLessorFindExpired1$ -benchtime 10x --count 10
goos: darwin
goarch: amd64
pkg: go.etcd.io/etcd/lease
BenchmarkLessorFindExpired1-4                 10            920539 ns/op
BenchmarkLessorFindExpired1-4                 10            664028 ns/op
BenchmarkLessorFindExpired1-4                 10            884361 ns/op
BenchmarkLessorFindExpired1-4                 10           1036946 ns/op
BenchmarkLessorFindExpired1-4                 10           1006975 ns/op
BenchmarkLessorFindExpired1-4                 10            752624 ns/op
BenchmarkLessorFindExpired1-4                 10            902464 ns/op
BenchmarkLessorFindExpired1-4                 10           1007714 ns/op
BenchmarkLessorFindExpired1-4                 10            591109 ns/op
BenchmarkLessorFindExpired1-4                 10           1093682 ns/op
PASS
ok      go.etcd.io/etcd/lease   37.061s

lease/lease_queue.go Show resolved Hide resolved
lease/lessor.go Outdated
heap.Pop(&le.leaseHeap) // O(log N)
// recheck if revoke is complete after retry interval
item.time = now.Add(le.expiredLeaseRetryInterval).UnixNano()
le.leaseHeap.Push(item)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We find expired lease and push them to leaseHeap again before send them to expiredC.
So do these in revokeExpiredLeases maybe better?
And we check lease is really expired in findExpiredLeases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emmm.... What if the lease actually expired but do not delete from the leaseMap?

@yuqitao
Copy link

yuqitao commented May 5, 2019

@nolouch @jingyih
I find something :

default:

It seems like the reason for revoke failure.

// the receiver of expiredC is probably busy handling
// other stuff
// let's try this next time after 500ms

The buffer size of expiredC is 16, if the buffer is full, the default will run.
But the leases have been poped by expireExists.

Just some preliminary thoughts, I have not confirmed it, will do it later.

@j2gg0s
Copy link

j2gg0s commented May 5, 2019

Maybe we can fix this with another simple solution.

In runLoop, we use array to save lease that has been send to expiredC but has not been checked.
In every loop, we first check and find failed cases.

func (le *lessor) runLoop() {
	defer close(le.doneC)

	expiredLeaseRetryInterval := 3 * time.Second
	n := expiredLeaseRetryInterval / (500 * time.Millisecond)
	cycleLeaseIDs := make([][]LeaseID, n)
	for i := range cycleLeaseIDs {
		// ring array is better
		// fix sized array is better
		cycleLeaseIDs[i] = make([]LeaseID, 0)
	}

	current := 0
	for {
		cucyleLeaseIDs[i] = le.revokeExpiredLeases(cycleLeaseIDs[current])
		le.checkpointScheduledLeases()

		select {
		case <-time.After(500 * time.Millisecond):
		case <-le.stopC:
			return
		}

		current += 1
		if current >= n {
			current -= n
		}
	}
}

// revokeExpiredLeases finds all leases past their expiry and sends them to epxired channel for
// to be revoked.
func (le *lessor) revokeExpiredLeases(leaseIDs []LeaseID) []LeaseID {
	var ls []*Lease

	// filter lease had been revoked by itemMap
	// ls = filterRevoked(leaseIDs)

	// rate limit
	revokeLimit := leaseRevokeRate / 2

	le.mu.RLock()
	if le.isPrimary() {
		ls = le.findExpiredLeases(revokeLimit - len(ls))
	}
	le.mu.RUnlock()

	if len(ls) != 0 {
		select {
		case <-le.stopC:
			return
		case le.expiredC <- ls:
		default:
			// the receiver of expiredC is probably busy handling
			// other stuff
			// let's try this next time after 500ms
		}
		leaseIDs = make([]LeaseID, len(ls))
		for i, lease := range ls {
			leaseIDs[i] = lease.ID
		}
		return leaseIDs
	}
	return make([]LeaseID, 0)
}

@nolouch need some advice.

@j2gg0s
Copy link

j2gg0s commented May 5, 2019

@yuqitao
This branch may cause this problem.
However, with log in the issue, send to expiredC is success.

2018/07/17 21:19:47.943 log.go:82: [warning] etcdserver: [failed to revoke 093a64a83ac42057 ("context deadline exceeded"

@nolouch
Copy link
Contributor Author

nolouch commented May 6, 2019

@j2gg0s I think your method also work. but same issue is you should hold all expired itme with same lease if the lease not expired. also if the array length grow, you need to iterator it , maybe also expensive.

@yuqitao
Copy link

yuqitao commented May 6, 2019

@yuqitao
This branch may cause this problem.
However, with log in the issue, send to expiredC is success.

2018/07/17 21:19:47.943 log.go:82: [warning] etcdserver: [failed to revoke 093a64a83ac42057 ("context deadline exceeded"

my mistake

@nolouch
Copy link
Contributor Author

nolouch commented May 9, 2019

I use the failpoint test and works fine, PTAL again @jpbetz @jingyih. and setting lease.time = now might let findExpiredLease always got the only first one item.

 Testing without failpoint
failed to delete failpoint failpoint: failpoint is disabled
OK
lease 694d6a9c99918204 granted with TTL(3s), remaining(2s), attached keys([k1])
lease 694d6a9c99918204 granted with TTL(3s), remaining(1s), attached keys([k1])
lease 694d6a9c99918204 granted with TTL(3s), remaining(0s), attached keys([k1])
lease 694d6a9c99918204 granted with TTL(3s), remaining(0s), attached keys([k1])
lease 694d6a9c99918204 already expired
Testing with failpoint
OK
lease 694d6a9c99918208 granted with TTL(3s), remaining(2s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(1s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(0s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(0s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(-1s), attached keys([k1])
Delete failpoint
lease 694d6a9c99918208 granted with TTL(3s), remaining(-2s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(-3s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(-4s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(-5s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(-6s), attached keys([k1])
lease 694d6a9c99918208 granted with TTL(3s), remaining(-7s), attached keys([k1])
lease 694d6a9c99918208 already expired
lease 694d6a9c99918208 already expired
lease 694d6a9c99918208 already expired

@yuqitao
Copy link

yuqitao commented May 9, 2019

Is beforeRevokeExipredLease true at first in your test? When it became to false, the lease was deleted.

I ask this just for my own confusion about this test.

Rechecking the lease can fix the problem revoking failure occurs occasionally. But if the reason for revoke failure is something which occurs frequently, eg some env problem, there will be too many retry leases.

@nolouch
Copy link
Contributor Author

nolouch commented May 10, 2019

@yuqitao Thanks.
Delete failpoint will let it became to false in the test.
For the retry leases, we only retry to revoke if the lease actually in leaseMap after the ExpiredLeasesRetryInterval, otherwise it will be unregistered in here:

	item := le.leaseExpiredNotifier.Poll()
	l = le.leaseMap[item.id]
	if l == nil {
		// lease has expired or been revoked
		// no need to revoke (nothing is expiry)
		le.leaseExpiredNotifier.Unregister() // O(log N)
		return nil, false, true
	}

It's a memory operator, and I only maintain one item for one lease now, so I just delay some time to delete the item. If you worry about the duplicate revoke, I think it also occurs occasionally.

@xiang90
Copy link
Contributor

xiang90 commented May 14, 2019

/cc @jpbetz @jingyih can you take another look?

@jingyih jingyih self-assigned this May 16, 2019
@jingyih
Copy link
Contributor

jingyih commented May 21, 2019

It appears that the lease benchmarks are broken. Let's get #10710 reviewed and merged so we can do benchmarks properly.

@jingyih
Copy link
Contributor

jingyih commented May 22, 2019

@jpbetz Lease checkpointing has similar behavior. Is it OK if the leases are already popped out of the leaseCheckpointHeap but fail to be checkpointed?

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at unit test later. Otherwise the code change looks good. Let's wait for the benchmarking to be fixed so that we can ensure there is no substantial performance regression.

etcdserver/server.go Outdated Show resolved Hide resolved
lease/lease_queue.go Outdated Show resolved Hide resolved
lease/lessor.go Outdated Show resolved Hide resolved
@nolouch
Copy link
Contributor Author

nolouch commented May 27, 2019

PTAL @jingyih

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after the fixes.

lease/lessor.go Outdated Show resolved Hide resolved
lease/lease_queue_test.go Outdated Show resolved Hide resolved
lease/lease_queue_test.go Outdated Show resolved Hide resolved
Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch
Copy link
Contributor Author

nolouch commented May 29, 2019

@jingyih @j2gg0s Thanks! addressed the comments and rebased the commits.

@jingyih
Copy link
Contributor

jingyih commented May 31, 2019

lgtm

Please note that the lease benchmark is fixed in another PR and the result comparison is posted at #10710 (comment).
Findings:

  • BenchmarkLessorFindExpired is slower due to additional O(logn) push (needed to fix the bug). (EDIT: Just realized there is no additional O(logn) anymore, the actual reason is the same as stated in the next item). I think this is acceptable.
  • BenchmarkLessorRenew is slower probably due to updating the entry in heap instead of just pushing a new entry (with same lease ID) to heap. This change is beneficial if the same lease is been renewed repeatedly.

@xiang90 xiang90 merged commit d8e2e47 into etcd-io:master May 31, 2019
@nolouch nolouch deleted the fix-lease branch June 3, 2019 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

The key is not deleted when the bound lease expires
7 participants