Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server does not clean up driver instances when the driver dies #42

Closed
smola opened this issue Jun 28, 2017 · 13 comments · Fixed by #79
Closed

Server does not clean up driver instances when the driver dies #42

smola opened this issue Jun 28, 2017 · 13 comments · Fixed by #79
Assignees
Labels

Comments

@smola
Copy link
Member

smola commented Jun 28, 2017

No description provided.

@bzz
Copy link
Contributor

bzz commented Jul 5, 2017

Could this be a reason why a single thread UAST conversion of every file in 2k repos hangs forever after some time without any messages in logs?

Does anybody know, what is the advised way to get the debug information to verify this issue?

@juanjux
Copy link
Contributor

juanjux commented Jul 5, 2017

@smola I think this problem was the same as #36 that was fixed by PR bblfsh/sdk#135 so I'm closing it tentatively, please reopen if you encounter or are reported of new instances of this problem.

@juanjux juanjux closed this as completed Jul 5, 2017
@juanjux
Copy link
Contributor

juanjux commented Jul 5, 2017

@bzz let's try with the new docker images once published with the latest fixes.

@abeaumont
Copy link
Contributor

@juanjux does that PR prevents driver from dying or cleans it up if dead? If it's only the former I'd still keep this open as an enhancement.

@juanjux
Copy link
Contributor

juanjux commented Jul 5, 2017

I think we confused a driver not working because it was blocked waiting for its stdout to be consumed with a dead driver, but we certainly could test forcing a driver to die with an exit(1) and checking if the server cleans that container instance. Reopening until I can check.

@bzz
Copy link
Contributor

bzz commented Jul 5, 2017

@juanjux

@bzz let's try with the new docker images once published with the latest fixes.

I have tried with latest build of everything and this is reproducible :(
Have put details on how this happens on 380kb file in bblfsh/sdk#130 (comment)

@juanjux
Copy link
Contributor

juanjux commented Jul 5, 2017

@bzz this is about the server not closing instances of containers when the server dies, in that case the message is like "no more instances available", the hangings are related to 130.

@abeaumont
Copy link
Contributor

@zurk has been hit by this issue and reported it at #78

@juanjux
Copy link
Contributor

juanjux commented Jul 21, 2017

I'll take a look at this today.

@juanjux
Copy link
Contributor

juanjux commented Jul 21, 2017

Update

I've been able to reproduce #78 with the steps included (after fixing some imports and command line parameters, I guess because it's the @develop version of ast2vec). After running the provided script, there are 4 zombie runc processes, spawned from the server that are cleared when you close the server.

There weren't errors during the parsing so my previous theory that this happens when a container stops with a fatal error doesn't seem to hold in this case.

Running the provided script leave 4 zombie processes. Those are created pretty quickly together, on the same second. After that, if I don't stop the server and run the script any number of times, sometimes there are new zombies but most of the times there aren't (the accumulated ones don't go away, with the same PID). The first time the number of zombies is always 4 on my machine.

Now, if I do a simple call with the client-python, I get one zombie. The curious thing is that if then, after running my test and without restarting the server, I run the provided script, then I won't get the pack of 4 zombies but most of the time none, and sometimes 1-2 (like subsecuent calls of the same script in the previous case).

So it looks like the first usage of the server is the one that generates zombies in a really reproducible way (and same number for the same client code) but after that is random.

Still looking into it.

@juanjux
Copy link
Contributor

juanjux commented Jul 25, 2017

So, after much debugging and many loud WTFs I found that it's a runc issue/misfeature of not reaping zombie child processes: opencontainers/runc#1443

Looks like it's normal that runc generates a zombie process at that point (?), but it should be reaped (the zombie process is generated exactly here). I've tested this PR and it works perfectly.

@zurk: I'll now investigate how to make Glide use a branch from another repo and do a PR so @abeaumont
can release a new Docker image when he can but if you want to test it on your own do this:

go get github.com/opencontainers
cd $GOPATH/src/github.com/opencontainers
git remote add zombies https://github.com/LittleLightLittleFire/runc.git
git fetch --all
git rebase zombies/1443-runc-reap-child-process

Now to test the server without the docker image:

go get -u github.com/bblfsh/server/...
cd $GOPATH/src/github.com/bblfsh/server
make build
# wait...
cd cmd/bblfsh
go build
sudo ./bblfsh server --log-level=debug 

Now, if you connect with the client-python remember to add a parameter to use the existing server:

python -m bblfsh --disable-bblfsh-autorun --file whatever.py

@juanjux
Copy link
Contributor

juanjux commented Jul 25, 2017

Fixed by #79

@juanjux
Copy link
Contributor

juanjux commented Jul 26, 2017

I've also tested with the script provided by @zurk and no zombie processes were created. Please note that the script doesn't work anymore with the current develop branch of ast2vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants