Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: WitlessRunner -- reuse event loop #4527

Closed
wants to merge 4 commits into from

Conversation

yarikoptic
Copy link
Member

Comments in the code (removed in this WiP PR) mentioned BlockingIOError. I have tried on datalad/tests/test_witless_runner.py and a limited selection of unittests and then with other pythons (using conda) and didn't trigger it.

the only strange side effect I got is with python 3.5.5 which was not present on master is an ignored TypeError from `__del__` deeper within python
$> python -m nose -s -v datalad/tests/test_witless_runner.py
datalad.tests.test_witless_runner.test_runner_stderr_capture ... ok
datalad.tests.test_witless_runner.test_runner_stdout_capture ... ok
datalad.tests.test_witless_runner.test_runner_parametrized_protocol ... ok
datalad.tests.test_witless_runner.test_runner_cwd_encoding ... ok
datalad.tests.test_witless_runner.test_runner_failure ... ok
datalad.tests.test_witless_runner.test_runner_fix_PWD ... ok
datalad.tests.test_witless_runner.test_runner_stdin ... ok
datalad.tests.test_witless_runner.test_runner ... ok
Versions: appdirs=1.4.3 boto=2.49.0 cmd:7z=16.02 cmd:annex=8.20200226-g2d3ef2c07 cmd:bundled-git=2.25.1 cmd:git=2.25.1 cmd:system-git=2.26.2 cmd:system-ssh=8.1p1 exifread=2.1.2 git=3.1.2 gitdb=4.0.5 humanize=0.5.1 iso8601=0.1.12 keyring=13.2.1 keyrings.alt=3.1 msgpack=0.5.6 mutagen=1.41.1 requests=2.23.0 wrapt=1.10.11
Obscure filename: str=b' "\';a&b&c\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 `| ' repr=' "\';a&b&cΔЙקم๗あ `| '
Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
Environment: GIT_PYTHON_GIT_EXECUTABLE='/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/share/git-annex-8.20200226-1/git' GIT_PAGER='less --no-init --quit-if-one-screen' PATH='/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/bin:/home/yoh/anaconda-5.2.0-2.7/bin/:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin' LANG='en_US.UTF-8'

----------------------------------------------------------------------
Ran 8 tests in 1.745s

OK
Exception ignored in: <bound method BaseEventLoop.__del__ of <_UnixSelectorEventLoop running=False closed=True debug=False>>
Traceback (most recent call last):
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/asyncio/base_events.py", line 511, in __del__
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/asyncio/unix_events.py", line 65, in close
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/asyncio/unix_events.py", line 146, in remove_signal_handler
  File "/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.5/lib/python3.5/signal.py", line 47, in signal
TypeError: signal handler must be signal.SIG_IGN, signal.SIG_DFL, or a callable object

$> python --version
Python 3.5.5

So decided to test across our CIs while hopefully they aren't too busy doing useful stuff.
I also moved loop into class variable since I guess eventually if we introduce more of async, we would need a central one (I guess). On local run of benchmarks, no notable differences were detected.

I am not sure why but so far I have not ran into the promised in the
comments BlockingIOError on my laptop... that is why decided to try and
see if there would be any positive OPT effect from reusing the same loop
@kyleam
Copy link
Contributor

kyleam commented May 14, 2020

It'd be great if these BlockIOError lines just vanished. I recall being flooded by those when working on my attempt at the asyncio runner and when playing around with Michael's earlier version. But sadly, with your change, Travis still shows them: https://travis-ci.org/github/datalad/datalad/jobs/686818492#L2700

And locally I still see them. Here's one example.

With this PR:

% python -m nose -vs -x datalad/interface/tests/test_rerun.py >/tmp/rerun-pr 2>&1
% grep -c Blocking /tmp/rerun-pr
115

And on with its base (344c628):

% python -m nose -vs -x datalad/interface/tests/test_rerun.py >/tmp/rerun-base 2>&1
% grep -c Blocking /tmp/rerun-base 
0

@yarikoptic
Copy link
Member Author

"Fun"... thanks for pointing to specific test file.

Reproduced locally!
(git)lena:~datalad/datalad-master[enh-witless]git
$> python -m nose -vs -x datalad/interface/tests/test_rerun.py
datalad.interface.tests.test_rerun.test_new_or_modified ... ok
datalad.interface.tests.test_rerun.test_placeholders ... ok
datalad.interface.tests.test_rerun.test_rerun_explicit ... ok
datalad.interface.tests.test_rerun.test_rerun_subdir ... ok
datalad.interface.tests.test_rerun.test_run_inputs_outputs ... Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
...
^CVersions: appdirs=1.4.3 boto=2.49.0 cmd:7z=16.02 cmd:annex=8.20200501+git61-g64e081d58-1~ndall+1 cmd:bundled-git=2.24.0 cmd:git=2.24.0 cmd:system-git=2.26.2 cmd:system-ssh=8.1p1 exifread=2.1.2 git=3.1.0 gitdb=4.0.2 humanize=2.3.0 iso8601=0.1.12 keyring=18.0.1 keyrings.alt=3.4.0 msgpack=0.6.2 mutagen=1.40.0 requests=2.23.0 tqdm=4.43.0 wrapt=1.11.2
Obscure filename: str=b' "\';a&b&c\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 `| ' repr=' "\';a&b&cΔЙקم๗あ `| '
Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
Environment: PATH='/home/yoh/proj/datalad/datalad-master/venvs/dev3/bin:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin' LANG='en_US.UTF-8' GIT_PAGER='less --no-init --quit-if-one-screen' GIT_PYTHON_GIT_EXECUTABLE='/usr/lib/git-annex.linux/git'

The fun part is that travis seems to be happy. one of the appveyor runs failed but I do not see any error message, ends with normal

Ran 384 tests in 1183.819s
OK (SKIP=119)

but stops after the first tests sweep, without running next ones... strange

@kyleam
Copy link
Contributor

kyleam commented May 14, 2020

fun part is that travis seems to be happy

FWIW I don't recall ever seeing an actual failure related to these messages. They're just alarming and, er, plentiful.

In my notes when working on gh-3660, I pointed to https://bugs.python.org/issue21595, but I'm not sure how relevant it actually is.

* BlockingIOError wakeup messages
When running some tests (e.g., test_gitrepo.py), the test pass but
there are lots of errors like this
#+begin_quote
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
#+end_quote
For more information, see [[https://bugs.python.org/issue21595][here]], where it was somewhat mitigated but
not solved.
I've blindly tried various ways of cleaning things up with asyncio,
but haven't been able to stumble onto anything that removes the error.

thanks for pointing to specific test file

Yeah, that's one test file I happened to recall showing a lot. But I don't think it's limited to that one.

@yarikoptic
Copy link
Member Author

FWIW, eyeballed number of open files while running that test with this abominable cmdline

$> while true; do total=0; pstree -l -a --compact-not -T -p `pgrep -f 'python -m nose'` | sed -e 's,--library-path /[^ ]*,,g' -e 's,/usr/lib/git-annex.linux/shimmed/git/,,g' -e 's,--git-dir=.git --work-tree=. --literal-pathspecs -c annex.dotfiles=true,,g' | nl | while read l; do pid=$(echo "$l" | sed -e 's/.*,\([0-9][0-9]*\).*/\1/g'); of=$(lsof -Ki -p $pid 2>/dev/null|grep -v COMMAND | wc -l); echo "$l" | sed -e "s/,$pid/,$pid = $of/g"; total=$(($total + $of)) ; done; echo "Total: $total open files across all processes"; done

never goes too high...

I found https://stackoverflow.com/a/52391791/1265472 which points to the same issue on python bug tracker, provides summary which fits the bill for us:

"It looks like the limiting factor is running lots of short-lived subprocesses. " and extracts relevant information from the report:

"Exception ignored when trying to write to the signal wakeup fd" message comes from the signal handler in Modules/signalmodule.c. The problem is that Python gets a lot of SIGCHLD signals (the test scripts creates +300 processes per second on my computer). The producer (signal handler writing the signal number into the "self" pipe) is faster than the consumer (BaseSelectorEventLoop._read_from_self callback)."

which is probably the case for us with all quick config calls ;) and then provides a specific workaround:

"I changed my code to restrict how many create_subprocess_execes could run at once. I stopped seeing the errors when I got below 35, though I'll probably set it to 20 just to be sure. Your mileage may vary."

async def myTask(stuff, semaphore, loop):
    with semaphore:
        process = await asyncio.create_subprocess_exec('short_program', loop=loop)

def taskRunner(stuffs):
    loop = asyncio.get_event_loop()
    semaphore = asyncio.Semaphore(20)  # limit how many can run at a time
    tasks = [
        asyncio.ensure_future(myTask(semaphore, loop))
        for i in range(20000)
    ]
    loop.run_until_complete(asyncio.gather(*tasks))
    loop.close()

I am yet to give something like that a shot -- yet to grasp better the async constructs , someone is most welcome to see if we could do smth like that (add semaphore), and not sure if that somehow wouldn't slow us down even further.

@yarikoptic
Copy link
Member Author

another interesting issue in pytest-asyncio, with proposed workaround which is pretty much what was done in our case - just restart the loop: pytest-dev/pytest-asyncio#114 but providing detailed Reason on how it comes about,although I am really not sure why something like that should be the case in our case. May be that is the "producer/consumer" issue mentioned above? I wonder if there is a way to check the state of the "consumer" and let it consume once in a while to assure that that queue gets emptied before we run subprocess again...

… handlers for logging

Decided to do it this way, instead of just creating a logger with
desired handling of its own ASYNCIO_* env variables using our
LogHelper, since even though it would add more granular control, it
would require setting all the env vars, and who knows -- might
conflict with some "native" ones if somehow they manage to provide
similar convenience ;)  also anyways we need to explicitly enable
debug for the loop.

Since only in this branch we rely on a single loop, I did not cherry
pick into a separate PR against stock master, but we might decide to
do that later

[ci skip]
Some time some proper "datalad config --list-variables" would come
about

[ci skip]
@yarikoptic
Copy link
Member Author

FWIW, I was hopeful that asyncio debug log messages would tell us something, so I pushed c001220 (and also tune up to docs about _NAME variable). But unfortunately with running

DATALAD_LOG_PID=1 DATALAD_LOG_TIMESTAMP=1 DATALAD_LOG_NAME=1 DATALAD_LOG_LEVEL=2 python -m nose -vs -x datalad/interface/tests/test_rerun.py 2>&1 | tee /tmp/asyncioblocking.log

I see no asyncio log messages anywhere around the point where the pukes happen, which is quite surprising actually (if it is not a matter of logs-vs-stdout flushing etc). Full log is at http://www.onerussian.com/tmp/asyncioblocking-20200515-1.txt.gz .
So -- those pukes happen around git config calls where we do not even use WitlessRunner... so my main suspect is that it is indeed just some flushing issue since unlikely it is some that much delayed reaction from asyncio (although -- who knows) -- if only there were time stamps with those exception messages (should we add them to exc_str at least for our own? ;)).

timing:

  • preceding last asyncio msg:
2020-05-15 10:44:35,320 asyncio         {497115}[INFO] <_UnixSubprocessTransport pid=548254 running stdout=<_UnixReadPipeTransport closed fd=8 closed> stderr=<_UnixReadPipeTransport closed fd=10 closed>> exited with return code 0
  • puke
2020-05-15 10:44:45,505 datalad.cmd     {497115}[Level 5] Running: ['git', 'config', '-z', '-l', '--show-origin', '--file', '/home/yoh/.tmp/datalad_temp_test_run_inputs_outputstwmb22g4/s0/s1_1/s2/.datalad/config']
Exception ignored when trying to write to the signal wakeup fd:
BlockingIOError: [Errno 11] Resource temporarily unavailable
2020-05-15 10:44:45,543 datalad.cmd     {497115}[Level 8] Finished running ['git', 'config', '-z', '-l', '--show-origin', '--file', '/home/yoh/.tmp/datalad_temp_test_run_inputs_outputstwmb22g4/s0/s1_1/s2/.datalad/config'] with status 0
  • next asyncio below
2020-05-15 10:44:47,598 asyncio         {497115}[DEBUG] execute program 'git' stdout=<pipe> stderr=<pipe>

@kyleam
Copy link
Contributor

kyleam commented May 15, 2020

[ This is a meta-comment. I haven't looked at any of your updates. ]

I would vote strongly against having any substantial change in this area being merged before the 0.13.0 release.

@yarikoptic
Copy link
Member Author

yeap, I think so too, no worries ;) but we need to get to the bottom of it

@yarikoptic
Copy link
Member Author

filed python bug report: https://bugs.python.org/issue40634

@yarikoptic
Copy link
Member Author

No activity on bug report, and we have not figured it out -- so will close for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants