New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: WitlessRunner -- reuse event loop #4527
Conversation
I am not sure why but so far I have not ran into the promised in the comments BlockingIOError on my laptop... that is why decided to try and see if there would be any positive OPT effect from reusing the same loop
It'd be great if these BlockIOError lines just vanished. I recall being flooded by those when working on my attempt at the asyncio runner and when playing around with Michael's earlier version. But sadly, with your change, Travis still shows them: https://travis-ci.org/github/datalad/datalad/jobs/686818492#L2700 And locally I still see them. Here's one example. With this PR:
And on with its base (344c628):
|
"Fun"... thanks for pointing to specific test file. Reproduced locally!(git)lena:~datalad/datalad-master[enh-witless]git
$> python -m nose -vs -x datalad/interface/tests/test_rerun.py
datalad.interface.tests.test_rerun.test_new_or_modified ... ok
datalad.interface.tests.test_rerun.test_placeholders ... ok
datalad.interface.tests.test_rerun.test_rerun_explicit ... ok
datalad.interface.tests.test_rerun.test_rerun_subdir ... ok
datalad.interface.tests.test_rerun.test_run_inputs_outputs ... Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
...
^CVersions: appdirs=1.4.3 boto=2.49.0 cmd:7z=16.02 cmd:annex=8.20200501+git61-g64e081d58-1~ndall+1 cmd:bundled-git=2.24.0 cmd:git=2.24.0 cmd:system-git=2.26.2 cmd:system-ssh=8.1p1 exifread=2.1.2 git=3.1.0 gitdb=4.0.2 humanize=2.3.0 iso8601=0.1.12 keyring=18.0.1 keyrings.alt=3.4.0 msgpack=0.6.2 mutagen=1.40.0 requests=2.23.0 tqdm=4.43.0 wrapt=1.11.2
Obscure filename: str=b' "\';a&b&c\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 `| ' repr=' "\';a&b&cΔЙקم๗あ `| '
Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
Environment: PATH='/home/yoh/proj/datalad/datalad-master/venvs/dev3/bin:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin' LANG='en_US.UTF-8' GIT_PAGER='less --no-init --quit-if-one-screen' GIT_PYTHON_GIT_EXECUTABLE='/usr/lib/git-annex.linux/git' The fun part is that travis seems to be happy. one of the appveyor runs failed but I do not see any error message, ends with normal
but stops after the first tests sweep, without running next ones... strange |
FWIW I don't recall ever seeing an actual failure related to these messages. They're just alarming and, er, plentiful. In my notes when working on gh-3660, I pointed to https://bugs.python.org/issue21595, but I'm not sure how relevant it actually is. datalad/runner-asyncio-notes.org Lines 2 to 16 in f79b5f2
Yeah, that's one test file I happened to recall showing a lot. But I don't think it's limited to that one. |
FWIW, eyeballed number of open files while running that test with this abominable cmdline
never goes too high... I found https://stackoverflow.com/a/52391791/1265472 which points to the same issue on python bug tracker, provides summary which fits the bill for us: "It looks like the limiting factor is running lots of short-lived subprocesses. " and extracts relevant information from the report: "Exception ignored when trying to write to the signal wakeup fd" message comes from the signal handler in Modules/signalmodule.c. The problem is that Python gets a lot of SIGCHLD signals (the test scripts creates +300 processes per second on my computer). The producer (signal handler writing the signal number into the "self" pipe) is faster than the consumer (BaseSelectorEventLoop._read_from_self callback)." which is probably the case for us with all quick "I changed my code to restrict how many create_subprocess_execes could run at once. I stopped seeing the errors when I got below 35, though I'll probably set it to 20 just to be sure. Your mileage may vary." async def myTask(stuff, semaphore, loop):
with semaphore:
process = await asyncio.create_subprocess_exec('short_program', loop=loop)
def taskRunner(stuffs):
loop = asyncio.get_event_loop()
semaphore = asyncio.Semaphore(20) # limit how many can run at a time
tasks = [
asyncio.ensure_future(myTask(semaphore, loop))
for i in range(20000)
]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close() I am yet to give something like that a shot -- yet to grasp better the async constructs , someone is most welcome to see if we could do smth like that (add semaphore), and not sure if that somehow wouldn't slow us down even further. |
another interesting issue in pytest-asyncio, with proposed workaround which is pretty much what was done in our case - just restart the loop: pytest-dev/pytest-asyncio#114 but providing detailed Reason on how it comes about,although I am really not sure why something like that should be the case in our case. May be that is the "producer/consumer" issue mentioned above? I wonder if there is a way to check the state of the "consumer" and let it consume once in a while to assure that that queue gets emptied before we run subprocess again... |
… handlers for logging Decided to do it this way, instead of just creating a logger with desired handling of its own ASYNCIO_* env variables using our LogHelper, since even though it would add more granular control, it would require setting all the env vars, and who knows -- might conflict with some "native" ones if somehow they manage to provide similar convenience ;) also anyways we need to explicitly enable debug for the loop. Since only in this branch we rely on a single loop, I did not cherry pick into a separate PR against stock master, but we might decide to do that later [ci skip]
Some time some proper "datalad config --list-variables" would come about [ci skip]
FWIW, I was hopeful that
I see no asyncio log messages anywhere around the point where the pukes happen, which is quite surprising actually (if it is not a matter of logs-vs-stdout flushing etc). Full log is at http://www.onerussian.com/tmp/asyncioblocking-20200515-1.txt.gz . timing:
|
[ This is a meta-comment. I haven't looked at any of your updates. ] I would vote strongly against having any substantial change in this area being merged before the 0.13.0 release. |
yeap, I think so too, no worries ;) but we need to get to the bottom of it |
filed python bug report: https://bugs.python.org/issue40634 |
No activity on bug report, and we have not figured it out -- so will close for now |
Comments in the code (removed in this WiP PR) mentioned
BlockingIOError
. I have tried on datalad/tests/test_witless_runner.py and a limited selection of unittests and then with other pythons (using conda) and didn't trigger it.the only strange side effect I got is with python 3.5.5 which was not present on master is an ignored TypeError from `__del__` deeper within python
So decided to test across our CIs while hopefully they aren't too busy doing useful stuff.
I also moved loop into class variable since I guess eventually if we introduce more of async, we would need a central one (I guess). On local run of benchmarks, no notable differences were detected.