New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lease: refactor lease's benchmark. #10710
Conversation
The final result
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding the bugs in benchmark and fixing them!!! I just read through the existing benchmark code and agree with you on the bugs you found. For your changes, I read through benchmarkLessorGrant
and added few comments.
lease/lessor_bench_test.go
Outdated
func BenchmarkLessorRevoke10000(b *testing.B) { benchmarkLessorRevoke(10000, b) } | ||
func BenchmarkLessorRevoke100000(b *testing.B) { benchmarkLessorRevoke(100000, b) } | ||
func BenchmarkLessorRevoke1000000(b *testing.B) { benchmarkLessorRevoke(1000000, b) } | ||
// NOTE: run with --benchtime Nx, please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? Is -benchtime 10s
not working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ns
working, however the actual use time may exceeding our exceptions because this implementation.
lease git:(refactor-lease-bench-test) ✗ ~/go/src/github.com/golang/go/bin/go test --bench BenchmarkLessorRevoke100000 -benchtime 1s
goos: darwin
goarch: amd64
pkg: go.etcd.io/etcd/v3/lease
BenchmarkLessorRevoke100000-4 8995898 133 ns/op
PASS
ok go.etcd.io/etcd/v3/lease 83.956s
With benchtime Ns
, golang estimate size of benchmark through the last time consuming.
We spend a lot of time to construct fixture and let golang ignore it.
lease/lessor_bench_test.go
Outdated
func BenchmarkLessorFindExpired10000(b *testing.B) { benchmarkLessorFindExpired(10000, b) } | ||
func BenchmarkLessorFindExpired100000(b *testing.B) { benchmarkLessorFindExpired(100000, b) } | ||
|
||
func benchmarkLessorGrant(size int, b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document what the benchmark test does? I think a high-level description is useful, so that people can understand the benchmark test without reading code details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the current code is wrong, but I don't think my implementation is good enough.
So, I push my code and expect some advices.
After having a clear and certain plan, I am happy to fix these.
lease/lessor_bench_test.go
Outdated
le.Promote(0) | ||
for i := 0; i < size; i++ { | ||
le.Grant(LeaseID(i), int64(100+i)) | ||
func BenchmarkLessorGrant10000(b *testing.B) { benchmarkLessorGrant(10000, b) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing other BenchmarkLessorGrantXXX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we use benchmark to find:
- RT under massive data.
- With the size grows, RT increases significantly?
- Is there deadlock?
So, I think BenchmarkLessorGrant100
has little meaning.
My implementation make BenchmarkLessorGrant100
use too much time with huge b.N.
~/go/src/github.com/golang/go/bin/go test -bench BenchmarkLessorGrant100$ -benchtime 100000x
goos: darwin
goarch: amd64
pkg: go.etcd.io/etcd/v3/lease
BenchmarkLessorGrant100-4 100000 2296 ns/op
PASS
ok go.etcd.io/etcd/v3/lease 102.638s
lease/lessor_bench_test.go
Outdated
le.mu.Lock() //Stop the findExpiredLeases call in the runloop | ||
defer le.mu.Unlock() | ||
|
||
maxTTL := int64((size + refresh) / 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I do not understand what is maxTTL and how it is used in this benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I control the probability of repetition ttl by minimum and maximum value.
Comment is need here.
lease/lessor_bench_test.go
Outdated
} | ||
le, tearDown = setUp() | ||
for j := 0; j < size; j++ { | ||
le.Grant(LeaseID(j+1), ttls[j]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to hold le.mu
in order to prevent leases from being auto-revoked by the lessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto-revoked means expire? We can limit the minimum ttl to avoid lease expire before test end.
I do not have brilliant ideas on how to benchmark this. An alternative way would be to start from empty heap in each benchmark run, something like:
I think this is a simpler / more easy-to-read way of doing this. But it benchmarks something different than the original. |
I also thought about benchmarking lease grant and revoke together (however only time grant or revoke), so that we can keep the heap size constant. But lease revoke does not remove the corresponding item from heap, so heap size still increases after each benchmark run.
|
I feel like we do not need the benchmark to be super accurate, but the current benchmark is not providing us any relavant informantion. We just need something that is roughly accurate to unblock #10693. |
@jingyih
I also use
|
Codecov Report
@@ Coverage Diff @@
## master #10710 +/- ##
=========================================
Coverage ? 63.19%
=========================================
Files ? 392
Lines ? 37160
Branches ? 0
=========================================
Hits ? 23482
Misses ? 12091
Partials ? 1587
Continue to review full report at Codecov.
|
@j2gg0s I will take a look tomorrow. Thanks for doing benchmark for the other PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Added few comments.
le.Grant(LeaseID(i+size), int64(100+i+size)) | ||
// MinLeaseTTL is negative, so we can grant expired lease in benchmark. | ||
le = newLessor(lg, be, LessorConfig{MinLeaseTTL: -100}) | ||
le.SetRangeDeleter(func() TxnDelete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making revoke actually deleting leases in leaseMap in the test.
@@ -909,3 +909,9 @@ func (fl *FakeLessor) ExpiredLeasesC() <-chan []*Lease { return nil } | |||
func (fl *FakeLessor) Recover(b backend.Backend, rd RangeDeleter) {} | |||
|
|||
func (fl *FakeLessor) Stop() {} | |||
|
|||
type FakeTxnDelete struct { | |||
backend.BatchTx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include backend BatchTx? Can we simply makes End() to do nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need backend's lock.
- Benchmark save data to tmp file by backend, and backend only support unsafe op (no lock).
- Before save/delete data, we need call
RangeDeleter
to lock. - After save/delete data, we need call
End
to unlock.
And use lock of backend make bench more realistic.
lease/lessor_bench_test.go
Outdated
le.Grant(LeaseID(i), int64(100+i)) | ||
func benchmarkLessorGrant(benchSize int, b *testing.B) { | ||
// avoid lease expire when benchmark | ||
ttls := randomTTL(benchSize, 10, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the min and max chosen? Do they always guarantee that no leases expire during the benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best:
we chosen suitable min and max, make the repeat probability of ttls meet the prod situation.
This is too hard. So i choose two random values.
10 second can avoid expire under 99.99%.
And under other 0.01%, the benchmark will run too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I do not fully understand. It seems that the safe value of min TTL should depend on benchSize and how fast the test machine runs? Maybe 10s is large enough to cover all cases? Can we use really large values to make it obvious that the leases won't expire during the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I change min ttl to 1000 second.
In benchmark, set ExpiredLeasesRetryInterval with 10 microsecond, so benchmark of findExpired will recheck expired lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm after nits.
lease/lessor_bench_test.go
Outdated
le.Grant(LeaseID(i+size), int64(100+i+size)) | ||
// MinLeaseTTL is negative, so we can grant expired lease in benchmark. | ||
// ExpiredLeasesRetryInterval should small, so benchmark of findExpired will recheck expired lease. | ||
le = newLessor(lg, be, LessorConfig{MinLeaseTTL: -100, ExpiredLeasesRetryInterval: 10 * time.Microsecond}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-100 is not enough for the TTLs in benchmarkLessorFindExpired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
lease/lessor_bench_test.go
Outdated
b.StartTimer() | ||
|
||
// refresh fixture after pop all expired lease | ||
for ;;i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failed due to formatting:
https://travis-ci.com/etcd-io/etcd/jobs/205296470#L588
lgtm |
Lease's benchmark were added when we optimize lease expiration check in #9418.
However, some problems exist.
For
benchmarkLessorFindExpired
, lease's ttl is always more than 100 second.When call
findExpiredLeases
, we always only compare leaseHeap's first element ttl with current time and return immediately.Lessor is not primary, so
Renew
do nothing andGrant
ignore ttl.We always call
Grant
with increasing ttl.leaseHeap
degenerates into sorted array.Wrong use size.
Such as BenchmarkLessorGrant1 and BenchmarkLessorGrant10000. Assume b.N is 500000. For BenchmarkLessorGrant1, we call
Grant
500001 times and observer last 500000 call. For BenchmarkLessorGrant10000, we callGrant
510000 times and observer last 500000 call. I don't think the results of these two tests will be too different.