Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Growing number of driver processes #130

Closed
bzz opened this issue Nov 16, 2017 · 9 comments · Fixed by #138
Closed

Growing number of driver processes #130

bzz opened this issue Nov 16, 2017 · 9 comments · Fixed by #138
Assignees

Comments

@bzz
Copy link
Contributor

bzz commented Nov 16, 2017

While getting UASTs and filtering for identifiers for python files of a single project using Engine, after 30min I can see 350+ driver processes inside the bblfshd container

Logs in details

ps

root      2169  0.0  0.0  18188  3188 pts/0    Ss   08:43   0:00 bash
root      2177  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2203  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2228  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2249  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2269  0.0  0.0      0     0 ?        Z    08:44   0:00 [runc:[1:CHILD]] <defunct>
root      2473  0.0  0.0      0     0 ?        Z    08:44   0:00 [runc:[1:CHILD]] <defunct>
root      2561  0.0  0.0      0     0 ?        Z    08:44   0:00 [runc:[1:CHILD]] <defunct>
root      2562 28.4  0.7  36036 28552 ?        Ssl  08:44   0:01 /opt/driver/bin/driver --log-level info --log-format text -
root      2572 30.7  0.6  81092 27848 ?        S    08:44   0:02 /usr/bin/python3.6 /usr/bin/python_driver

Container log

time="2017-11-16T08:48:35Z" level=info msg="python-driver version: dev-1908ca8 (build: 2017-11-14T11:31:28Z)" id=01bz207xxmc18dppgxwgywr5zs language=python
time="2017-11-16T08:48:35Z" level=info msg="server listening in /tmp/rpc.sock (unix)" id=01bz207xxmc18dppgxwgywr5zs language=python
time="2017-11-16T08:48:36Z" level=info msg="new driver instance started bblfsh/python-driver:latest (01bz207xxmc18dppgxwgywr5zs)"
time="2017-11-16T08:49:07Z" level=info msg="python-driver version: dev-1908ca8 (build: 2017-11-14T11:31:28Z)" id=01bz208xey432evff0pga9dnxr language=python
time="2017-11-16T08:49:07Z" level=info msg="server listening in /tmp/rpc.sock (unix)" id=01bz208xey432evff0pga9dnxr language=python
time="2017-11-16T12:24:51Z" level=error msg="error re-scaling pool: container is not destroyed" language=python

apache spark thread dump

org.bblfsh.client.BblfshClient.filter(BblfshClient.scala:33)
tech.sourced.engine.udf.QueryXPathUDF$$anonfun$queryXPath$2.apply(QueryXPathUDF.scala:45)
tech.sourced.engine.udf.QueryXPathUDF$$anonfun$queryXPath$2.apply(QueryXPathUDF.scala:44)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
tech.sourced.engine.udf.QueryXPathUDF$.queryXPath(QueryXPathUDF.scala:44)

Steps to reproduce, using 30 concurrent clients:

// get Borges from https://github.com/src-d/borges/releases/tag/v0.8.3
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git\nhttps://github.com/jelmer/dulwich.git" > repos.txt
borges pack --loglevel=debug --workers=2 --to=./repos -f repos.txt

// get Apache Spark https://github.com/src-d/engine#quick-start
$SPARK_HOME/bin/spark-shell --driver-memory=4g --packages "tech.sourced:engine:0.1.7"

and then run :paste, paste code below and hit Ctrl+D

import tech.sourced.engine._

val engine = Engine(spark, "repos")
val repos = engine.getRepositories
val refs = repos.getHEAD.withColumnRenamed("hash","commit_hash")

val langs = refs.getFiles.classifyLanguages
val pyTokens = langs
  .where('lang === "Python")
  .extractUASTs.queryUAST("//*[@roleIdentifier]", "uast", "result")
  .extractTokens("result", "tokens")

val tokensToWrite = pyTokens
  .join(refs, "commit_hash")
  .select('repository_id, 'name, 'commit_hash, 'file_hash, 'path, 'lang, 'tokens)

spark.conf.set("spark.sql.shuffle.partitions", "30") //instead of default 200
tokensToWrite.show

then, if exec'ed to bblfshd container, one can see number of driver processes growing

apt-get update && apt-get install -y procps
ps aux | wc -l
@bzz
Copy link
Contributor Author

bzz commented Nov 16, 2017

Update: same happens with just 3 clients instead of 30, it only takes longer, ~30min to reproduce.

Relevant issue exists in Engine src-d/sourced-ce#196 this one is just about zombie processes in bblfshd container.

@bzz
Copy link
Contributor Author

bzz commented Nov 16, 2017

Noticed two new errors in logs, that are not posted above

time="2017-11-16T12:20:44Z" level=error msg="request processed content 3487 bytes, status Fatal" elapsed=43.828518ms language=python
time="2017-11-16T12:24:51Z" level=error msg="error re-scaling pool: container is not destroyed" language=python

Going to post logs \w debug enabled

@bzz
Copy link
Contributor Author

bzz commented Nov 16, 2017

Here are debug logs \w 93 processes inside container 93-process-bblfshd.log

@juanjux
Copy link
Contributor

juanjux commented Nov 30, 2017

Yes, I can reproduce it, thanks for the steps and the (as always in your case) really awesome bug report, @bzz.

The processes are runc zombie processes that doesn't use resources, so this should not be a performance problem, but it's certainly not pretty having all those zombies until a bblfshd container restart (bblfshd pid is the parent of the zombie herd, as you can see with a ps -l to show the PPID). It's odd because libcontainer some time ago merged a PR that reaped the zombie processes and fixed a similar issue we had. I'll try to update the dependency to the latest version in bblfshd and if that doesn't fix the probem I'll investigate if we're doing something wrong in our process management.

@juanjux
Copy link
Contributor

juanjux commented Nov 30, 2017

Looks like libcontainer from ~master avoids this problem (I've let it running for 20 minutes and there isn't a single defunc process). I'll upload a exported docker image of this version of bblfshd for you to test and if you confirm that it works we can close this after the PR. More details on Slack.

@juanjux
Copy link
Contributor

juanjux commented Dec 1, 2017

After more tests, I still see some defunct driver processes after leaving this test running for a while, but there are like 3-5 after 40 minutes while previously there were hundreds, so while not totally fixed, it's an huge improvement.

Considering that the change was just updating the libcontainer dependency, the problem is surely there.

@bzz
Copy link
Contributor Author

bzz commented Dec 1, 2017

@juanjux It will take few days for me to get back to this to reproduce it, so we can either re-open or keep it here for a while and see.

Sorry, closed by mistake :/

@bzz bzz closed this as completed Dec 1, 2017
@bzz bzz reopened this Dec 1, 2017
@juanjux
Copy link
Contributor

juanjux commented Dec 1, 2017

Ok, so if the maintainer @abeaumont agrees we can merge #138 and close this, feel free to reopen if you find the problem again.

@abeaumont
Copy link
Contributor

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants