New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd: extend rate limiting to consider the number of inflight requests #25817
Conversation
/test |
Related Etcd issue: etcd-io/etcd#15993 |
2d44c0b
to
5d74763
Compare
5d74763
to
496ca38
Compare
@@ -1162,7 +1214,6 @@ func (e *etcdClient) determineEndpointStatus(ctx context.Context, endpointAddres | |||
|
|||
e.getLogger().Debugf("Checking status to etcd endpoint %s", endpointAddress) | |||
|
|||
e.limiter.Wait(ctxTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped the rate limiter wait here, to prevent that the status check fails due to rate limiting, as that would put a high pressure on etcd in case the connection gets restarted.
496ca38
to
2dcb6ca
Compare
The last push added the support to increase the metrics when the rate limit wait process fails. |
2dcb6ca
to
18b4f0d
Compare
Rebased onto main to pick the CI fixes. I've additionally added a new commit to assign |
/test |
pkg/kvstore/etcd.go
Outdated
e.limiter.Wait(ctx) | ||
lr, err := e.limiter.Wait(ctx) | ||
if err != nil { | ||
increaseMetric(key, metricRead, "GetLocked", duration.EndError(err).Total(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do defer increaseMetric(key, metricRead, "GetLocked", duration.EndError(err).Total(), err)
two lines above (not inside if) and remove another increaseMetric below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot with the current increaseMetric
implementation, since duration.EndError(err).Total()
would be executed immediately, hence not counting correctly the duration. It should work modifying increaseMetric
to take the span as input (rather than the pre-computed duration), but the change tends to be on the large side. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea, we could do that in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm giving it a try. If it ends up being too large I'll open a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR adding one more commit which does the increaseMetrics
deferring, to avoid having to repeat that in case of errors. @marseel PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm except for small nits.
18b4f0d
to
2f2bc63
Compare
/test |
This commit extracts the APILimiterObserver implementation from the daemon package to a separate one, to allow for its reuse. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's assign the newly introduced `pkg/rate/metrics` folder to the @cilium/metrics team. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit slightly refactors the approach adopted to increase the kvstore metrics, deferring this operation at the beginning of each function (similarly to how tracing is handled). This allows to transparently handle early returns due to errors. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the etcd rate limiter operates based on a token bucket of a fixed size. Yet, this is problematic when etcd is overloaded, since we may continue to issue new requests even if it cannot keep up. Hence, let's start using the custom APILimiter, which also takes into account the number of currently inflight requests. The maximum amount of inflight requests can configured through the etcd.maxInflight parameter, and defaults to the same value as etcd.qps if unset. The rate limiter wait is removed when checking the status of the etcd endpoints, to prevent that the check fails due to rate limiting, as that would put a high pressure on etcd in case the connection gets restarted. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
2f2bc63
to
e84ba5b
Compare
Rebased onto main to fix conflicts |
/test |
/ci-awscni |
/ci-eks |
Reporting the description of the last commit for convenience: