Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(executor/pns): remove sleep before sigkill #2995

Merged
merged 1 commit into from May 11, 2020

Conversation

adamglt
Copy link

@adamglt adamglt commented May 11, 2020

Bug in PNS executor when sidecars need to be SIGKILL'd (likely a missed line).

@sonarcloud
Copy link

sonarcloud bot commented May 11, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@adamglt
Copy link
Author

adamglt commented May 11, 2020

While we're here - there's another kind-of-bug with images that start more than one process in the entrypoint.
For example: https://hub.docker.com/layers/wurstmeister/zookeeper/latest/images/sha256-3f43f72cb2832e7a5fbed7f7dbcd0c43004357974d8f32555d101bd53e81e74f?context=explore

The CMD is: CMD ["/bin/sh" "-c" "/usr/sbin/sshd && bash /usr/bin/start-zk.sh"]

Which causes this in the executor:

containerID bfe12d3f9b15caaf9925a89f36d23971b589f663c841c6f5e8a644a4394472a4 mapped to pid 33
containerID 8dd313815f821a33bfee35e5ddffba2e92f4b6d39d2fbf4587471864d6e8ccd6 mapped to pid 59
containerID da155e91637665af4a557fdb2339826a88d7940366aa83f9c4770ff596683612 mapped to pid 102
containerID da155e91637665af4a557fdb2339826a88d7940366aa83f9c4770ff596683612 mapped to pid 113
containerID bb98784aa673f4c622f994ca0d4b73e40b3b41e02f3016cab7f5c317ca3fea9a mapped to pid 156

The container PID for da155e... is overridden in the map here:

p.ctrIDToPid[containerID] = pid

Which causes killContainer to only SIGTERMing one PID, which leads to a hanging container.
I'm not sure how to fix this - we can change ctrIDToPid from map[string]int to map[string][]int and change all of the PID handling around that - but it's quite a bit of changes (mainPID -> mainPIDs logic and so on)

@adamglt
Copy link
Author

adamglt commented May 11, 2020

Another option that seems fine is this:

@@ -362,9 +361,13 @@ func (p *PNSExecutor) updateCtrIDMap() {
                                log.Warnf("Failed to identify containerID for process %d", pid)
                                continue
                        }
-                       log.Infof("containerID %s mapped to pid %d", containerID, pid)
-                       p.ctrIDToPid[containerID] = pid
+                       log.Infof("pid %d mapped to containerID %s", pid, containerID)
                        p.pidToCtrID[pid] = containerID
+                       if _, ok = p.ctrIDToPid[containerID]; !ok {
+                               // Only map containers to the first (lowest) pids
+                               log.Infof("containerID %s mapped to pid %d", containerID, pid)
+                               p.ctrIDToPid[containerID] = pid
+                       }

By only tracking the first containerID->pid and all pid->containerID - nothing else needs to change. It basically assumes that the first pid is the parent, which isn't 100% bulletproof but it's better than the current situation. It works with the ZK image above.

@simster7
Copy link
Member

It seems that the wait is in place to allow enough time for the wait container to upload all artifacts before the main container is killed.

I'll confirm this with @jessesuen today since he wrote the code, but if true we can just check if there are any outputs before running the sleep statement.

@simster7 simster7 self-assigned this May 11, 2020
@simster7
Copy link
Member

You can open a separate PR for the PID issue, we'll review that independently.

Copy link
Member

@simster7 simster7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke to @jessesuen and he confirmed that this was debug code that was never removed. Thanks for this great catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants