Growing number of driver processes #130

bzz · 2017-11-16T11:23:25Z

While getting UASTs and filtering for identifiers for python files of a single project using Engine, after 30min I can see 350+ driver processes inside the bblfshd container

Logs in details

ps

root      2169  0.0  0.0  18188  3188 pts/0    Ss   08:43   0:00 bash
root      2177  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2203  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2228  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2249  0.0  0.0      0     0 ?        Z    08:43   0:00 [runc:[1:CHILD]] <defunct>
root      2269  0.0  0.0      0     0 ?        Z    08:44   0:00 [runc:[1:CHILD]] <defunct>
root      2473  0.0  0.0      0     0 ?        Z    08:44   0:00 [runc:[1:CHILD]] <defunct>
root      2561  0.0  0.0      0     0 ?        Z    08:44   0:00 [runc:[1:CHILD]] <defunct>
root      2562 28.4  0.7  36036 28552 ?        Ssl  08:44   0:01 /opt/driver/bin/driver --log-level info --log-format text -
root      2572 30.7  0.6  81092 27848 ?        S    08:44   0:02 /usr/bin/python3.6 /usr/bin/python_driver

Container log

time="2017-11-16T08:48:35Z" level=info msg="python-driver version: dev-1908ca8 (build: 2017-11-14T11:31:28Z)" id=01bz207xxmc18dppgxwgywr5zs language=python
time="2017-11-16T08:48:35Z" level=info msg="server listening in /tmp/rpc.sock (unix)" id=01bz207xxmc18dppgxwgywr5zs language=python
time="2017-11-16T08:48:36Z" level=info msg="new driver instance started bblfsh/python-driver:latest (01bz207xxmc18dppgxwgywr5zs)"
time="2017-11-16T08:49:07Z" level=info msg="python-driver version: dev-1908ca8 (build: 2017-11-14T11:31:28Z)" id=01bz208xey432evff0pga9dnxr language=python
time="2017-11-16T08:49:07Z" level=info msg="server listening in /tmp/rpc.sock (unix)" id=01bz208xey432evff0pga9dnxr language=python
time="2017-11-16T12:24:51Z" level=error msg="error re-scaling pool: container is not destroyed" language=python

apache spark thread dump

org.bblfsh.client.BblfshClient.filter(BblfshClient.scala:33)
tech.sourced.engine.udf.QueryXPathUDF$$anonfun$queryXPath$2.apply(QueryXPathUDF.scala:45)
tech.sourced.engine.udf.QueryXPathUDF$$anonfun$queryXPath$2.apply(QueryXPathUDF.scala:44)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
tech.sourced.engine.udf.QueryXPathUDF$.queryXPath(QueryXPathUDF.scala:44)

Steps to reproduce, using 30 concurrent clients:

// get Borges from https://github.com/src-d/borges/releases/tag/v0.8.3
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git\nhttps://github.com/jelmer/dulwich.git" > repos.txt
borges pack --loglevel=debug --workers=2 --to=./repos -f repos.txt

// get Apache Spark https://github.com/src-d/engine#quick-start
$SPARK_HOME/bin/spark-shell --driver-memory=4g --packages "tech.sourced:engine:0.1.7"

and then run :paste, paste code below and hit Ctrl+D

import tech.sourced.engine._

val engine = Engine(spark, "repos")
val repos = engine.getRepositories
val refs = repos.getHEAD.withColumnRenamed("hash","commit_hash")

val langs = refs.getFiles.classifyLanguages
val pyTokens = langs
  .where('lang === "Python")
  .extractUASTs.queryUAST("//*[@roleIdentifier]", "uast", "result")
  .extractTokens("result", "tokens")

val tokensToWrite = pyTokens
  .join(refs, "commit_hash")
  .select('repository_id, 'name, 'commit_hash, 'file_hash, 'path, 'lang, 'tokens)

spark.conf.set("spark.sql.shuffle.partitions", "30") //instead of default 200
tokensToWrite.show

then, if exec'ed to bblfshd container, one can see number of driver processes growing

apt-get update && apt-get install -y procps
ps aux | wc -l

The text was updated successfully, but these errors were encountered:

bzz · 2017-11-16T11:53:56Z

Update: same happens with just 3 clients instead of 30, it only takes longer, ~30min to reproduce.

Relevant issue exists in Engine src-d/sourced-ce#196 this one is just about zombie processes in bblfshd container.

bzz · 2017-11-16T12:27:16Z

Noticed two new errors in logs, that are not posted above

time="2017-11-16T12:20:44Z" level=error msg="request processed content 3487 bytes, status Fatal" elapsed=43.828518ms language=python
time="2017-11-16T12:24:51Z" level=error msg="error re-scaling pool: container is not destroyed" language=python

Going to post logs \w debug enabled

bzz · 2017-11-16T12:56:05Z

Here are debug logs \w 93 processes inside container 93-process-bblfshd.log

juanjux · 2017-11-30T12:59:31Z

Yes, I can reproduce it, thanks for the steps and the (as always in your case) really awesome bug report, @bzz.

The processes are runc zombie processes that doesn't use resources, so this should not be a performance problem, but it's certainly not pretty having all those zombies until a bblfshd container restart (bblfshd pid is the parent of the zombie herd, as you can see with a ps -l to show the PPID). It's odd because libcontainer some time ago merged a PR that reaped the zombie processes and fixed a similar issue we had. I'll try to update the dependency to the latest version in bblfshd and if that doesn't fix the probem I'll investigate if we're doing something wrong in our process management.

juanjux · 2017-11-30T14:53:31Z

Looks like libcontainer from ~master avoids this problem (I've let it running for 20 minutes and there isn't a single defunc process). I'll upload a exported docker image of this version of bblfshd for you to test and if you confirm that it works we can close this after the PR. More details on Slack.

juanjux · 2017-12-01T08:44:25Z

After more tests, I still see some defunct driver processes after leaving this test running for a while, but there are like 3-5 after 40 minutes while previously there were hundreds, so while not totally fixed, it's an huge improvement.

Considering that the change was just updating the libcontainer dependency, the problem is surely there.

bzz · 2017-12-01T12:14:30Z

@juanjux It will take few days for me to get back to this to reproduce it, so we can either re-open or keep it here for a while and see.

Sorry, closed by mistake :/

juanjux · 2017-12-01T15:09:58Z

Ok, so if the maintainer @abeaumont agrees we can merge #138 and close this, feel free to reopen if you find the problem again.

abeaumont · 2017-12-01T16:00:27Z

Done

zurk mentioned this issue Nov 24, 2017

error while running how_to_use_ast2vec.ipynb src-d/ml#120

Closed

abeaumont assigned juanjux Nov 29, 2017

juanjux mentioned this issue Nov 30, 2017

Update libcontainer dependency #138

Merged

bzz closed this as completed Dec 1, 2017

bzz reopened this Dec 1, 2017

abeaumont closed this as completed in #138 Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Growing number of driver processes #130

Growing number of driver processes #130

bzz commented Nov 16, 2017 •

edited by juanjux

Loading

bzz commented Nov 16, 2017 •

edited

Loading

bzz commented Nov 16, 2017 •

edited

Loading

bzz commented Nov 16, 2017 •

edited

Loading

juanjux commented Nov 30, 2017 •

edited

Loading

juanjux commented Nov 30, 2017

juanjux commented Dec 1, 2017

bzz commented Dec 1, 2017 •

edited

Loading

juanjux commented Dec 1, 2017

abeaumont commented Dec 1, 2017

Growing number of driver processes #130

Growing number of driver processes #130

Comments

bzz commented Nov 16, 2017 • edited by juanjux Loading

bzz commented Nov 16, 2017 • edited Loading

bzz commented Nov 16, 2017 • edited Loading

bzz commented Nov 16, 2017 • edited Loading

juanjux commented Nov 30, 2017 • edited Loading

juanjux commented Nov 30, 2017

juanjux commented Dec 1, 2017

bzz commented Dec 1, 2017 • edited Loading

juanjux commented Dec 1, 2017

abeaumont commented Dec 1, 2017

bzz commented Nov 16, 2017 •

edited by juanjux

Loading

bzz commented Nov 16, 2017 •

edited

Loading

bzz commented Nov 16, 2017 •

edited

Loading

bzz commented Nov 16, 2017 •

edited

Loading

juanjux commented Nov 30, 2017 •

edited

Loading

bzz commented Dec 1, 2017 •

edited

Loading