Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: stop endpoints in parallel on exit #15447

Merged

Conversation

jaffcheng
Copy link
Contributor

@jaffcheng jaffcheng commented Mar 24, 2021

agent: stop endpoints in parallel on exit

Endpoints are stopped on exit signals in an agent cleanup function.
This patch does this in goroutines to speed it up, reduces the probability
of agent exiting timeout, that is, reduces the possibility of pod network
disconnection caused by interrupted regeneration.

Related: #15446
Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com

Wait for endpoints to be stopped on agent shutdown

@jaffcheng jaffcheng requested a review from a team March 24, 2021 09:22
@jaffcheng jaffcheng requested a review from a team as a code owner March 24, 2021 09:22
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 24, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Mar 24, 2021
@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. sig/loader Impacts the loading of BPF programs into the kernel. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Mar 24, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 24, 2021
@pchaigno pchaigno self-requested a review March 24, 2021 09:27
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
@jaffcheng jaffcheng force-pushed the stop-ep-event-queues-before-exit-upstream branch from f8a9bf9 to 74db935 Compare March 26, 2021 04:43
@jaffcheng jaffcheng changed the title daemon, endpoint: stop endpoint event queues before exiting daemon agent: stop endpoints in parallel on exit Mar 26, 2021
@jaffcheng jaffcheng force-pushed the stop-ep-event-queues-before-exit-upstream branch from 74db935 to adb01e8 Compare March 26, 2021 04:46
@jaffcheng
Copy link
Contributor Author

I guess this should be a minor change now.

@aditighag
Copy link
Member

aditighag commented Mar 26, 2021

This is a nice improvement, but I don't see how this fixes the issue you've tagged.

This happens when cilium-agent is killed (crash/upgrade/operational restart) during endpoint regeneration, more specifically, during synchronizeDirectories function call.

If an agent crashes or killed, then wait groups are not going to add any protection anyway.

In the scenario of RegenerateWithoutDatapath (e.g. one or more identities created or deleted), next dir is empty at this step. So after the following code, we have an empty state dir 1111 and a complete backup 1111_stale
If the agent stops at this point, after the agent restarts, we lose track of this endpoint, tail call maps will get GCed, and the pod will lose connectivity.

The way we take backup across 2 steps (os.Rename is an atomic call) should let us restore from the backup upon agent restart. 🤔 I don't quite follow why "tail call maps will get GC'd" as they would still being referenced in BPF programs.

@jaffcheng
Copy link
Contributor Author

jaffcheng commented Mar 26, 2021

This is a nice improvement, but I don't see how this fixes the issue you've tagged.

Yes, I've commented in the issue: #15446 (comment), also changed the Fixes tag to Related in commit msg.

The way we take backup across 2 steps (os.Rename is an atomic call) should let us restore from the backup upon agent restart. 🤔 I don't quite follow why "tail call maps will get GC'd" as they would still being referenced in BPF programs.

I think we just ignored backups while restoring endpoints. I'm not sure if this is intended or just an overlook.

if _, ok := possibleEPs[ep.ID]; ok {
// If the endpoint already exists then give priority to the directory
// that contains an endpoint that didn't fail to be build.
if strings.HasSuffix(ep.DirectoryPath(), epDirName) {
possibleEPs[ep.ID] = ep
}
} else {
possibleEPs[ep.ID] = ep
}

I think maps are GCed because the endpoint is not restored properly

for _, m := range mapPrefix {
if strings.HasPrefix(filename, m) {
if endpointID := strings.TrimPrefix(filename, m); endpointID != filename {
ms.deleteMapIfStale(path, filename, endpointID)
}
}
}

@pchaigno
Copy link
Member

pchaigno commented Mar 26, 2021

I think we just ignored backups while restoring endpoints. I'm not sure if this is intended or just an overlook.

We should probably fix that.

@jaffcheng jaffcheng force-pushed the stop-ep-event-queues-before-exit-upstream branch 2 times, most recently from 7301291 to d9818e2 Compare March 30, 2021 03:46
@jaffcheng
Copy link
Contributor Author

I think we just ignored backups while restoring endpoints. I'm not sure if this is intended or just an overlook.

We should probably fix that.

I'll try to fix that in a separate PR.

Endpoints are stopped on exit signals in an agent cleanup function.
This patch does this in goroutines to speed it up, reduces the probability
of agent exiting timeout, that is, reduces the possibility of pod network
disconnection caused by interrupted regeneration.

Related: cilium#15446
Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com
@jaffcheng jaffcheng force-pushed the stop-ep-event-queues-before-exit-upstream branch from d9818e2 to 2003901 Compare April 2, 2021 11:59
@jaffcheng
Copy link
Contributor Author

Rebased

@pchaigno
Copy link
Member

pchaigno commented Apr 5, 2021

test-me-please

@pchaigno
Copy link
Member

pchaigno commented Apr 6, 2021

@aanm aanm merged commit c83dca2 into cilium:master Apr 7, 2021
1.10.0 automation moved this from In progress to Done Apr 7, 2021
@jaffcheng jaffcheng deleted the stop-ep-event-queues-before-exit-upstream branch April 7, 2021 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants