Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

antrea-agent readiness probe tolerates longer disconnection #2535

Merged
merged 1 commit into from Aug 5, 2021

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Aug 4, 2021

In large-scale clusters, it may take 40~50 seconds for antrea-agent to
reconnect to antrea service after antrea-controller restarts.
antrea-agent shouldn't be reported as NotReady in this scenario,
otherwise DaemonSet controller would restart them at once, as opposed
to rolling update. Set failureThreshold to 8 so it can tolerate 70s of
disconnection.

Signed-off-by: Quan Tian qtian@vmware.com

Fixes #2534

The purpose of antrea-agent's readiness probe is to make the status of antrea-agent visible, e.g. the connection with antrea-controller, and it's not for service loadbalancing/failover. So it should be fine to be less aggressive to update its ready status.

@tnqn
Copy link
Member Author

tnqn commented Aug 4, 2021

/skip-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we backport it to 1.2?

build/yamls/base/agent.yml Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2021

Codecov Report

Merging #2535 (788d6be) into main (17b0a7b) will decrease coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2535      +/-   ##
==========================================
- Coverage   59.90%   59.79%   -0.12%     
==========================================
  Files         281      281              
  Lines       22207    22239      +32     
==========================================
- Hits        13304    13297       -7     
- Misses       7481     7524      +43     
+ Partials     1422     1418       -4     
Flag Coverage Δ
e2e-tests ∅ <ø> (?)
kind-e2e-tests 46.85% <ø> (-0.16%) ⬇️
unit-tests 42.19% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/controller/networkpolicy/status_controller.go 70.96% <0.00%> (-5.81%) ⬇️
pkg/agent/openflow/pipeline.go 71.34% <0.00%> (-2.26%) ⬇️
pkg/agent/flowexporter/flowrecords/flow_records.go 81.53% <0.00%> (-1.49%) ⬇️
pkg/agent/openflow/client.go 58.59% <0.00%> (-0.49%) ⬇️
pkg/agent/agent.go 50.22% <0.00%> (ø)
...agent/flowexporter/connections/deny_connections.go 83.01% <0.00%> (+0.09%) ⬆️
pkg/agent/flowexporter/connections/connections.go 75.55% <0.00%> (+0.55%) ⬆️
.../flowexporter/connections/conntrack_connections.go 81.30% <0.00%> (+1.12%) ⬆️
pkg/apiserver/storage/ram/store.go 80.45% <0.00%> (+1.50%) ⬆️
pkg/apiserver/handlers/endpoint/handler.go 70.58% <0.00%> (+11.76%) ⬆️

In large-scale clusters, it may take 40~50 seconds for antrea-agent to
reconnect to antrea service after antrea-controller restarts.
antrea-agent shouldn't be reported as NotReady in this scenario,
otherwise DaemonSet controller would restart them at once, as opposed
to rolling update. Set failureThreshold to 8 so it can tolerate 70s of
disconnection.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn tnqn force-pushed the increase-failure-threshold branch from 0f55a62 to 788d6be Compare August 5, 2021 01:19
@tnqn
Copy link
Member Author

tnqn commented Aug 5, 2021

/test-all

@tnqn
Copy link
Member Author

tnqn commented Aug 5, 2021

should we backport it to 1.2?

Yes, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

antrea-agent Pods didn't update in rolling update fashion in large scale cluster
3 participants