Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

Merged
merged 4 commits into from Jul 24, 2018

Conversation

@jpbetz
Copy link
Contributor

commented Jul 14, 2018

Fixes #9888 by introducing a "lease checkpointing" mechanism.

The basic ideas is that for all leases with TTLs greater than 5 minutes, their remaining TTL will be checkpointed every 5 minutes so that if a new leader is elected, the leases are not auto-renewed to their full TTL, but instead only to the remaining TTL from the last checkpoint. A checkpoint is an entry that persisted to the RAFT consensus log that records the remainingTTL as determined by the leader when the checkpoint occurred.

If keep-alive is called on a lease that has been checkpointed. The remaining TTL will be cleared by a checkpoint entry in the RAFT consensus log where remainingTTL=0, indicating it is unset and that the original TTL should be used.

All checkpointing is scheduled and performed by the leader, and when a new leader is elected, it takes over checkpointing as part of lease.Promote.

An advantage of this approach is that leases where keep-alive is called often will still write at most two entries to the RAFT consensus log every 5 minutes since only the first keep-alive after a checkpoint must be recorded to the RAFT consensus log, all other keep-alives can be ignored.

Additionally, to prevent this mechanism from degrading system performance, it is designed to be best effort. There is a limit on how many checkpoints can be persisted per second, and how many pending checkpoint operations can be scheduled. If these limits are reached, checkpoints may not be scheduled or written to the RAFT consensus log to prevent the checkpointing operations from overwhelming the system, which could otherwise occur if large volumes of long lived leases were granted.

cc @gyuho @wenjiaswe @jingyih

@gyuho
Copy link
Member

left a comment

I will have another look next week as well. And just quick question from first pass, if findDueScheduledCheckpoints returns multiple leases, have we thought about batching them all in one raft request?

@@ -57,6 +70,10 @@ type TxnDelete interface {
// RangeDeleter is a TxnDelete constructor.
type RangeDeleter func() TxnDelete

// Checkpointer permits checkpointing of lease remaining TTLs to the concensus log. Defined here to

This comment has been minimized.

Copy link
@gyuho

gyuho Jul 14, 2018

Member

s/concensus/consensus/?

This comment has been minimized.

Copy link
@jpbetz

jpbetz Jul 16, 2018

Author Contributor

Fixed, thanks!

}

// checkpointScheduledLeases finds all scheduled lease checkpoints that are due and
// submits them to the checkpointer to persist them to the concensus log.

This comment has been minimized.

Copy link
@gyuho

gyuho Jul 14, 2018

Member

s/concensus/consensus/ :)

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2018

@gyuho Batching only briefly crossed my mind, but it's something we should clearly do. I'll add it shortly.

return l.remainingTTL
} else {
return l.ttl
}

This comment has been minimized.

Copy link
@gyuho

gyuho Jul 16, 2018

Member

no need else? just return return l.ttl following Go idioms? :)

This comment has been minimized.

Copy link
@jpbetz

jpbetz Jul 16, 2018

Author Contributor

sounds good! This is one of the hardest idoms to unlearn from other languages that do the opposite :)

@jpbetz jpbetz force-pushed the jpbetz:persist-lease-deadline branch 3 times, most recently from 9392bab to e463c07 Jul 16, 2018

@jpbetz jpbetz changed the title [WIP] lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases Jul 17, 2018

@jpbetz jpbetz removed the WIP label Jul 17, 2018

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2018

Due to the size of this PR, I'll split it into three commits:

  • .proto change and resulting codegen
  • Lessor config and logging change
  • checkpointing mechanism
jpbetz added 2 commits Jul 17, 2018

@jpbetz jpbetz force-pushed the jpbetz:persist-lease-deadline branch from e463c07 to ec26ef2 Jul 17, 2018

@jpbetz

This comment has been minimized.

return cps
}
heap.Pop(&le.leaseCheckpointHeap)
if l, ok := le.leaseMap[lt.id]; ok {

This comment has been minimized.

Copy link
@xiang90

xiang90 Jul 17, 2018

Contributor

we probably need to remove a few indentations here.

if l, ok := ...; !ok {
    continue
}
...

This comment has been minimized.

Copy link
@jpbetz

jpbetz Jul 17, 2018

Author Contributor

Sounds good. I'll flatten this down.


// Limit the total number of scheduled checkpoints, checkpoint should be best effort and it is
// better to throttle checkpointing than to degrade performance.
maxScheduledCheckpoints = 10000

This comment has been minimized.

Copy link
@xiang90

xiang90 Jul 17, 2018

Contributor

how do we come up with these default values? have you done any benchmark?

would it be helpful if we make the checkpoint api accept multiple leases as a batch?

This comment has been minimized.

Copy link
@jpbetz

jpbetz Jul 17, 2018

Author Contributor

how do we come up with these default values? have you done any benchmark?

Not yet, but I need to. I'm betting these number can be much higher. I'll do some benchmarking this week. For the scheduling, we just need to keep the heap size to some reasonable size, so I might look at typical etcd memory footprints and use that to help establish a limit that is based on the worst case memory utilization we're able to accept.

would it be helpful if we make the checkpoint api accept multiple leases as a batch?

We just added the batching of lease checkpointing yesterday (proto change) per @gyuho's suggestion. Since this is not clear from how the leaseCheckpointRate constant is defined, I'll clear that up with some code changes. Maybe by defining a maxLeaseCheckpointBatchSize and using leaseCheckpointRate to define how many patched checkpoint operations can occur per second, which I might set quite low once we have batching.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2018

The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases.

@jpbetz jpbetz force-pushed the jpbetz:persist-lease-deadline branch from ec26ef2 to 904b906 Jul 17, 2018

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2018

The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases.

Thanks @xiang90. I'll post a full benchmark shortly.

@jpbetz jpbetz force-pushed the jpbetz:persist-lease-deadline branch from 904b906 to c939c0a Jul 18, 2018

@codecov-io

This comment has been minimized.

Copy link

commented Jul 18, 2018

Codecov Report

Merging #9924 into master will increase coverage by 0.03%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9924      +/-   ##
==========================================
+ Coverage   68.99%   69.03%   +0.03%     
==========================================
  Files         386      386              
  Lines       35792    35891      +99     
==========================================
+ Hits        24695    24776      +81     
- Misses       9296     9300       +4     
- Partials     1801     1815      +14
Impacted Files Coverage Δ
etcdserver/config.go 79.51% <ø> (ø) ⬆️
lease/lease_queue.go 100% <100%> (ø) ⬆️
integration/cluster.go 82.17% <100%> (+0.05%) ⬆️
etcdserver/server.go 73.6% <100%> (+0.05%) ⬆️
clientv3/snapshot/v3_snapshot.go 64.75% <100%> (ø) ⬆️
etcdserver/apply.go 88.87% <75%> (-0.19%) ⬇️
lease/lessor.go 87.62% <90.35%> (+0.83%) ⬆️
client/keys.go 73.86% <0%> (-17.59%) ⬇️
pkg/tlsutil/tlsutil.go 86.2% <0%> (-6.9%) ⬇️
pkg/netutil/netutil.go 63.11% <0%> (-6.56%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f725e1...d1de41e. Read the comment docs.

index int
id LeaseID
// Unix nanos timestamp.
time int64

This comment has been minimized.

Copy link
@gyuho

gyuho Jul 19, 2018

Member

Can we comment this time field? It can be either expiration timestamp or checkpoint timestamp. Took me a while to find how time is used :)

This comment has been minimized.

Copy link
@jpbetz

jpbetz Jul 19, 2018

Author Contributor

Looks like the field rename from expiration to time only got me from misleading to unclear. I'll add a comment and see if there is anything else I should do to make this more obvious.

This comment has been minimized.

Copy link
@jpbetz

jpbetz Jul 23, 2018

Author Contributor

Added a couple comments to both lease_queue.go and the two places where the time field is used in lessor.go.

@jpbetz jpbetz force-pushed the jpbetz:persist-lease-deadline branch from c939c0a to 37b7484 Jul 23, 2018

@jpbetz jpbetz force-pushed the jpbetz:persist-lease-deadline branch from 37b7484 to d1de41e Jul 23, 2018

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2018

@xiang90 @gyuho

Ran two benchmarks:

Checkpoint heap size Benchmark

Checked etcd server heap size up to 10,000,000 live leases.

  • With checkpointing 3.3GB
  • Without checkpointing 3.3GB
    This makes sense given that the heap is a slice of structs that contain only three int64s, so the total memory usage for all the entries is only about 40MB or just a bit more than 1% of the total memory utilization. I've removed the limit on this heap as it does not seem to be needed.

Checkpoint rate limit Benchmark

Set leases to checkpoint every 1s, created 15k of them, and then checked server performance with benchmark put while the checkpointing is happening concurrently. This was with a 3-member etcd cluster on localhost.

  • Without checkpointing - write latency ~ 0.006ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1 (no batching of checkpoints in RAFT log) - write latency ~0.015ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1000 - write latency ~0.008ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=10000 - write latency ~0.008ms
  • With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 - write latency ~0.006ms

Since 1,000,000 checkpoints per sec seems sufficient, and the limits of maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 appear to have negligible impact on performance, I've gone with those settings.

@gyuho

This comment has been minimized.

Copy link
Member

commented Jul 23, 2018

With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 - write latency ~0.006ms

Results look good to me. Thanks for benchmarks!

@gyuho
gyuho approved these changes Jul 23, 2018
Copy link
Member

left a comment

lgtm /cc @xiang90

@gyuho

This comment has been minimized.

Copy link
Member

commented Jul 23, 2018

@jpbetz Also, can you add this to CHANGELOG? Just separate commit or PR should be fine. Thanks.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2018

LGTM

@jpbetz jpbetz merged commit 750b87d into etcd-io:master Jul 24, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
semaphoreci The build passed on Semaphore.
Details
@hexfusion hexfusion referenced this pull request Sep 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.