Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd: extend rate limiting to consider the number of inflight requests #25817

Merged
merged 4 commits into from Jun 9, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jun 1, 2023

Reporting the description of the last commit for convenience:

Currently, the etcd rate limiter operates based on a token bucket of a
fixed size. Yet, this is problematic when etcd is overloaded, since we
may continue to issue new requests even if it cannot keep up. Hence,
let's start using the custom APILimiter, which also takes into account
the number of currently inflight requests. The maximum amount of
inflight requests can configured through the etcd.maxInflight parameter,
and defaults to the same value as etcd.qps if unset.
etcd: extend rate limiting to consider the number of inflight requests

@giorio94 giorio94 added kind/performance There is a performance impact of this. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/kvstore Impacts the KVStore package interactions. labels Jun 1, 2023
@giorio94 giorio94 requested review from a team as code owners June 1, 2023 09:45
@giorio94
Copy link
Member Author

giorio94 commented Jun 1, 2023

/test

@marseel
Copy link
Contributor

marseel commented Jun 1, 2023

Related Etcd issue: etcd-io/etcd#15993

pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
@@ -1162,7 +1214,6 @@ func (e *etcdClient) determineEndpointStatus(ctx context.Context, endpointAddres

e.getLogger().Debugf("Checking status to etcd endpoint %s", endpointAddress)

e.limiter.Wait(ctxTimeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped the rate limiter wait here, to prevent that the status check fails due to rate limiting, as that would put a high pressure on etcd in case the connection gets restarted.

@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

The last push added the support to increase the metrics when the rate limit wait process fails.

@giorio94 giorio94 requested a review from a team as a code owner June 7, 2023 13:35
@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

Rebased onto main to pick the CI fixes. I've additionally added a new commit to assign pkg/rate to @cilium/sig-agent and pkg/rate/metrics to @cilium/metrics (please let me know if a different team would fit better). Pulling in @cilium/metrics for review given the above.

@giorio94 giorio94 requested review from a team and chancez and removed request for a team June 7, 2023 13:38
@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

/test

pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
e.limiter.Wait(ctx)
lr, err := e.limiter.Wait(ctx)
if err != nil {
increaseMetric(key, metricRead, "GetLocked", duration.EndError(err).Total(), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do defer increaseMetric(key, metricRead, "GetLocked", duration.EndError(err).Total(), err) two lines above (not inside if) and remove another increaseMetric below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot with the current increaseMetric implementation, since duration.EndError(err).Total() would be executed immediately, hence not counting correctly the duration. It should work modifying increaseMetric to take the span as input (rather than the pre-computed duration), but the change tends to be on the large side. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea, we could do that in separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm giving it a try. If it ends up being too large I'll open a separate PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR adding one more commit which does the increaseMetrics deferring, to avoid having to repeat that in case of errors. @marseel PTAL

pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except for small nits.

@giorio94 giorio94 marked this pull request as draft June 7, 2023 15:08
@giorio94 giorio94 marked this pull request as ready for review June 8, 2023 06:47
@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/test

This commit extracts the APILimiterObserver implementation from the
daemon package to a separate one, to allow for its reuse.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's assign the newly introduced `pkg/rate/metrics` folder to
the @cilium/metrics team.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit slightly refactors the approach adopted to increase the
kvstore metrics, deferring this operation at the beginning of each
function (similarly to how tracing is handled). This allows to
transparently handle early returns due to errors.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the etcd rate limiter operates based on a token bucket of a
fixed size. Yet, this is problematic when etcd is overloaded, since we
may continue to issue new requests even if it cannot keep up. Hence,
let's start using the custom APILimiter, which also takes into account
the number of currently inflight requests. The maximum amount of
inflight requests can configured through the etcd.maxInflight parameter,
and defaults to the same value as etcd.qps if unset.

The rate limiter wait is removed when checking the status of the etcd
endpoints, to prevent that the check fails due to rate limiting, as that
would put a high pressure on etcd in case the connection gets restarted.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

Rebased onto main to fix conflicts

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/test

@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

/ci-awscni

@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

/ci-eks

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 9, 2023
@dylandreimerink dylandreimerink merged commit 7908f20 into cilium:main Jun 9, 2023
63 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/performance There is a performance impact of this. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/kvstore Impacts the KVStore package interactions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants