OPT: WitlessRunner -- reuse event loop #4527

yarikoptic · 2020-05-14T01:11:17Z

Comments in the code (removed in this WiP PR) mentioned BlockingIOError. I have tried on datalad/tests/test_witless_runner.py and a limited selection of unittests and then with other pythons (using conda) and didn't trigger it.

the only strange side effect I got is with python 3.5.5 which was not present on master is an ignored TypeError from `__del__` deeper within python

$> python -m nose -s -v datalad/tests/test_witless_runner.py
datalad.tests.test_witless_runner.test_runner_stderr_capture ... ok
datalad.tests.test_witless_runner.test_runner_stdout_capture ... ok
datalad.tests.test_witless_runner.test_runner_parametrized_protocol ... ok
datalad.tests.test_witless_runner.test_runner_cwd_encoding ... ok
datalad.tests.test_witless_runner.test_runner_failure ... ok
datalad.tests.test_witless_runner.test_runner_fix_PWD ... ok
datalad.tests.test_witless_runner.test_runner_stdin ... ok
datalad.tests.test_witless_runner.test_runner ... ok
Versions: appdirs=1.4.3 boto=2.49.0 cmd:7z=16.02 cmd:annex=8.20200226-g2d3ef2c07 cmd:bundled-git=2.25.1 cmd:git=2.25.1 cmd:system-git=2.26.2 cmd:system-ssh=8.1p1 exifread=2.1.2 git=3.1.2 gitdb=4.0.5 humanize=0.5.1 iso8601=0.1.12 keyring=13.2.1 keyrings.alt=3.1 msgpack=0.5.6 mutagen=1.41.1 requests=2.23.0 wrapt=1.10.11
Obscure filename: str=b' "\';a&b&c\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 `| ' repr=' "\';a&b&cΔЙקم๗あ `| '
Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
Environment: GIT_PYTHON_GIT_EXECUTABLE='/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/share/git-annex-8.20200226-1/git' GIT_PAGER='less --no-init --quit-if-one-screen' PATH='/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/bin:/home/yoh/anaconda-5.2.0-2.7/bin/:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin' LANG='en_US.UTF-8'

----------------------------------------------------------------------
Ran 8 tests in 1.745s

OK
Exception ignored in: <bound method BaseEventLoop.__del__ of <_UnixSelectorEventLoop running=False closed=True debug=False>>
Traceback (most recent call last):
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/asyncio/base_events.py", line 511, in __del__
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/asyncio/unix_events.py", line 65, in close
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/asyncio/unix_events.py", line 146, in remove_signal_handler
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/signal.py", line 47, in signal
TypeError: signal handler must be signal.SIG_IGN, signal.SIG_DFL, or a callable object

$> python --version
Python 3.5.5

So decided to test across our CIs while hopefully they aren't too busy doing useful stuff.
I also moved loop into class variable since I guess eventually if we introduce more of async, we would need a central one (I guess). On local run of benchmarks, no notable differences were detected.

I am not sure why but so far I have not ran into the promised in the comments BlockingIOError on my laptop... that is why decided to try and see if there would be any positive OPT effect from reusing the same loop

kyleam · 2020-05-14T19:10:08Z

It'd be great if these BlockIOError lines just vanished. I recall being flooded by those when working on my attempt at the asyncio runner and when playing around with Michael's earlier version. But sadly, with your change, Travis still shows them: https://travis-ci.org/github/datalad/datalad/jobs/686818492#L2700

And locally I still see them. Here's one example.

With this PR:

% python -m nose -vs -x datalad/interface/tests/test_rerun.py >/tmp/rerun-pr 2>&1
% grep -c Blocking /tmp/rerun-pr
115

And on with its base (344c628):

% python -m nose -vs -x datalad/interface/tests/test_rerun.py >/tmp/rerun-base 2>&1
% grep -c Blocking /tmp/rerun-base 
0

yarikoptic · 2020-05-14T21:11:42Z

"Fun"... thanks for pointing to specific test file.

Reproduced locally!

(git)lena:~datalad/datalad-master[enh-witless]git
$> python -m nose -vs -x datalad/interface/tests/test_rerun.py
datalad.interface.tests.test_rerun.test_new_or_modified ... ok
datalad.interface.tests.test_rerun.test_placeholders ... ok
datalad.interface.tests.test_rerun.test_rerun_explicit ... ok
datalad.interface.tests.test_rerun.test_rerun_subdir ... ok
datalad.interface.tests.test_rerun.test_run_inputs_outputs ... Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
...
^CVersions: appdirs=1.4.3 boto=2.49.0 cmd:7z=16.02 cmd:annex=8.20200501+git61-g64e081d58-1~ndall+1 cmd:bundled-git=2.24.0 cmd:git=2.24.0 cmd:system-git=2.26.2 cmd:system-ssh=8.1p1 exifread=2.1.2 git=3.1.0 gitdb=4.0.2 humanize=2.3.0 iso8601=0.1.12 keyring=18.0.1 keyrings.alt=3.4.0 msgpack=0.6.2 mutagen=1.40.0 requests=2.23.0 tqdm=4.43.0 wrapt=1.11.2
Obscure filename: str=b' "\';a&b&c\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 `| ' repr=' "\';a&b&cΔЙקم๗あ `| '
Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
Environment: PATH='/home/yoh/proj/datalad/datalad-master/venvs/dev3/bin:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin' LANG='en_US.UTF-8' GIT_PAGER='less --no-init --quit-if-one-screen' GIT_PYTHON_GIT_EXECUTABLE='/usr/lib/git-annex.linux/git'

The fun part is that travis seems to be happy. one of the appveyor runs failed but I do not see any error message, ends with normal

Ran 384 tests in 1183.819s
OK (SKIP=119)

but stops after the first tests sweep, without running next ones... strange

kyleam · 2020-05-14T21:21:03Z

fun part is that travis seems to be happy

FWIW I don't recall ever seeing an actual failure related to these messages. They're just alarming and, er, plentiful.

In my notes when working on gh-3660, I pointed to https://bugs.python.org/issue21595, but I'm not sure how relevant it actually is.

datalad/runner-asyncio-notes.org

Lines 2 to 16 in f79b5f2

    
           * BlockingIOError wakeup messages 
        
           When running some tests (e.g., test_gitrepo.py), the test pass but 
        
           there are lots of errors like this 
        
           #+begin_quote 
        
             Exception ignored when trying to write to the signal wakeup fd: 
        
             BlockingIOError: [Errno 11] Resource temporarily unavailable 
        
           #+end_quote 
        
           For more information, see [[https://bugs.python.org/issue21595][here]], where it was somewhat mitigated but 
        
           not solved. 
        
           I've blindly tried various ways of cleaning things up with asyncio, 
        
           but haven't been able to stumble onto anything that removes the error.

thanks for pointing to specific test file

Yeah, that's one test file I happened to recall showing a lot. But I don't think it's limited to that one.

yarikoptic · 2020-05-15T04:20:43Z

FWIW, eyeballed number of open files while running that test with this abominable cmdline

$> while true; do total=0; pstree -l -a --compact-not -T -p `pgrep -f 'python -m nose'` | sed -e 's,--library-path /[^ ]*,,g' -e 's,/usr/lib/git-annex.linux/shimmed/git/,,g' -e 's,--git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true,,g' | nl | while read l; do pid=$(echo "$l" | sed -e 's/.*,\([0-9][0-9]*\).*/\1/g'); of=$(lsof -Ki -p $pid 2>/dev/null|grep -v COMMAND | wc -l); echo "$l" | sed -e "s/,$pid/,$pid = $of/g"; total=$(($total + $of)) ; done; echo "Total: $total open files across all processes"; done

never goes too high...

I found https://stackoverflow.com/a/52391791/1265472 which points to the same issue on python bug tracker, provides summary which fits the bill for us:

"It looks like the limiting factor is running lots of short-lived subprocesses. " and extracts relevant information from the report:

"Exception ignored when trying to write to the signal wakeup fd" message comes from the signal handler in Modules/signalmodule.c. The problem is that Python gets a lot of SIGCHLD signals (the test scripts creates +300 processes per second on my computer). The producer (signal handler writing the signal number into the "self" pipe) is faster than the consumer (BaseSelectorEventLoop._read_from_self callback)."

which is probably the case for us with all quick config calls ;) and then provides a specific workaround:

"I changed my code to restrict how many create_subprocess_execes could run at once. I stopped seeing the errors when I got below 35, though I'll probably set it to 20 just to be sure. Your mileage may vary."

async def myTask(stuff, semaphore, loop):
    with semaphore:
        process = await asyncio.create_subprocess_exec('short_program', loop=loop)

def taskRunner(stuffs):
    loop = asyncio.get_event_loop()
    semaphore = asyncio.Semaphore(20)  # limit how many can run at a time
    tasks = [
        asyncio.ensure_future(myTask(semaphore, loop))
        for i in range(20000)
    ]
    loop.run_until_complete(asyncio.gather(*tasks))
    loop.close()

I am yet to give something like that a shot -- yet to grasp better the async constructs , someone is most welcome to see if we could do smth like that (add semaphore), and not sure if that somehow wouldn't slow us down even further.

yarikoptic · 2020-05-15T04:27:11Z

another interesting issue in pytest-asyncio, with proposed workaround which is pretty much what was done in our case - just restart the loop: pytest-dev/pytest-asyncio#114 but providing detailed Reason on how it comes about,although I am really not sure why something like that should be the case in our case. May be that is the "producer/consumer" issue mentioned above? I wonder if there is a way to check the state of the "consumer" and let it consume once in a while to assure that that queue gets emptied before we run subprocess again...

… handlers for logging Decided to do it this way, instead of just creating a logger with desired handling of its own ASYNCIO_* env variables using our LogHelper, since even though it would add more granular control, it would require setting all the env vars, and who knows -- might conflict with some "native" ones if somehow they manage to provide similar convenience ;) also anyways we need to explicitly enable debug for the loop. Since only in this branch we rely on a single loop, I did not cherry pick into a separate PR against stock master, but we might decide to do that later [ci skip]

Some time some proper "datalad config --list-variables" would come about [ci skip]

yarikoptic · 2020-05-15T15:09:07Z

FWIW, I was hopeful that asyncio debug log messages would tell us something, so I pushed c001220 (and also tune up to docs about _NAME variable). But unfortunately with running

DATALAD_LOG_PID=1 DATALAD_LOG_TIMESTAMP=1 DATALAD_LOG_NAME=1 DATALAD_LOG_LEVEL=2 python -m nose -vs -x datalad/interface/tests/test_rerun.py 2>&1 | tee /tmp/asyncioblocking.log

I see no asyncio log messages anywhere around the point where the pukes happen, which is quite surprising actually (if it is not a matter of logs-vs-stdout flushing etc). Full log is at http://www.onerussian.com/tmp/asyncioblocking-20200515-1.txt.gz .
So -- those pukes happen around git config calls where we do not even use WitlessRunner... so my main suspect is that it is indeed just some flushing issue since unlikely it is some that much delayed reaction from asyncio (although -- who knows) -- if only there were time stamps with those exception messages (should we add them to exc_str at least for our own? ;)).

timing:

preceding last asyncio msg:

2020-05-15 10:44:35,320 asyncio         {497115}[INFO] <_UnixSubprocessTransport pid=548254 running stdout=<_UnixReadPipeTransport closed fd=8 closed> stderr=<_UnixReadPipeTransport closed fd=10 closed>> exited with return code 0

puke

2020-05-15 10:44:45,505 datalad.cmd     {497115}[Level 5] Running: ['git', 'config', '-z', '-l', '--show-origin', '--file', '/home/yoh/.tmp/datalad_temp_test_run_inputs_outputstwmb22g4/s0/s1_1/s2/.datalad/config']
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
2020-05-15 10:44:45,543 datalad.cmd     {497115}[Level 8] Finished running ['git', 'config', '-z', '-l', '--show-origin', '--file', '/home/yoh/.tmp/datalad_temp_test_run_inputs_outputstwmb22g4/s0/s1_1/s2/.datalad/config'] with status 0

next asyncio below

2020-05-15 10:44:47,598 asyncio         {497115}[DEBUG] execute program 'git' stdout=<pipe> stderr=<pipe>

kyleam · 2020-05-15T15:18:44Z

[ This is a meta-comment. I haven't looked at any of your updates. ]

I would vote strongly against having any substantial change in this area being merged before the 0.13.0 release.

yarikoptic · 2020-05-15T15:25:06Z

yeap, I think so too, no worries ;) but we need to get to the bottom of it

yarikoptic · 2020-05-15T15:30:47Z

filed python bug report: https://bugs.python.org/issue40634

yarikoptic · 2020-06-15T12:49:17Z

No activity on bug report, and we have not figured it out -- so will close for now

yarikoptic added 2 commits May 13, 2020 20:02

ENH: reuse EventLoop instance within WitlessRunner instance

18c1b07

I am not sure why but so far I have not ran into the promised in the comments BlockingIOError on my laptop... that is why decided to try and see if there would be any positive OPT effect from reusing the same loop

RF: make event loop class level variable -- still works for me

3f38440

This was referenced May 14, 2020

FOI: system git vs "bundled" git; use conda more? #4535

Closed

idea: Be "pro-active" in config (re)loading? #4536

Closed

yarikoptic added 2 commits May 15, 2020 10:56

DOC: Add a note to CONTRIBUTING.md about DATALAD_LOG_NAME

6b3f547

Some time some proper "datalad config --list-variables" would come about [ci skip]

yarikoptic closed this Jun 15, 2020

yarikoptic added the stale-PR-closed-without-merge label Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPT: WitlessRunner -- reuse event loop #4527

OPT: WitlessRunner -- reuse event loop #4527

yarikoptic commented May 14, 2020

kyleam commented May 14, 2020 •

edited

yarikoptic commented May 14, 2020

kyleam commented May 14, 2020

yarikoptic commented May 15, 2020

yarikoptic commented May 15, 2020

yarikoptic commented May 15, 2020

kyleam commented May 15, 2020 •

edited

yarikoptic commented May 15, 2020

yarikoptic commented May 15, 2020

yarikoptic commented Jun 15, 2020

OPT: WitlessRunner -- reuse event loop #4527

OPT: WitlessRunner -- reuse event loop #4527

Conversation

yarikoptic commented May 14, 2020

kyleam commented May 14, 2020 • edited

yarikoptic commented May 14, 2020

kyleam commented May 14, 2020

yarikoptic commented May 15, 2020

yarikoptic commented May 15, 2020

yarikoptic commented May 15, 2020

kyleam commented May 15, 2020 • edited

yarikoptic commented May 15, 2020

yarikoptic commented May 15, 2020

yarikoptic commented Jun 15, 2020

kyleam commented May 14, 2020 •

edited

kyleam commented May 15, 2020 •

edited