Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: explicitely terminate the Popen process if exception (e.g. KeyboardInterrupt) #3794

Merged
merged 3 commits into from Oct 20, 2019

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Oct 18, 2019

Otherwise it was observed (#3613) that batched parallel git annex get instances might still remain
running consuming bandwidth and resources.

I have added try/except around .terminate() call itself, although it might be
an overkill since in normal scenarios I do not expect that to fail.

I have tested locally by interrupting datalad get invocations, and it seems to do the trick!

Closes #3613

TODOs

  • test well, could be done, even if only for smoke testing that code logic, but wasn't in the best traditions of the fast progress I think I will not be lazy, making a test

…rdInterrupt)

Otherwise it was observed that batched git annex instances might still remain
running consuming bandwidth and resources.

I have added try/except around .terminate() call itself, although it might be
an overkill since in normal scenarios I do not expect that to fail.

Closes datalad#3613
@mih
Copy link
Member

@mih mih commented Oct 18, 2019

Thanks for doing that! This was a rather annoying behavior that has hit me frequently. Only the PY2 tests are failing (beloved unicode) -- those are not relevant for master.

I now realized that targeting 0.11 for such fixes will bypass all the extended windows tests, such that any potential failure would have to be dealt with in bulk merges from 0.11 into master -- this could easily turn into a nightmare, because of the division of labor we have ATM (I am doing most of the windows work, but I do not touch 0.11). Also, this further introduces more six usage into the codebase that needs to be taken out again. How shall we deal with this?

@codecov
Copy link

@codecov codecov bot commented Oct 18, 2019

Codecov Report

No coverage uploaded for pull request base (0.11.x@d5e7f6d). Click here to learn what that means.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff            @@
##             0.11.x    #3794   +/-   ##
=========================================
  Coverage          ?   81.21%           
=========================================
  Files             ?      256           
  Lines             ?    33962           
  Branches          ?        0           
=========================================
  Hits              ?    27581           
  Misses            ?     6381           
  Partials          ?        0
Impacted Files Coverage Δ
datalad/support/exceptions.py 72.67% <100%> (ø)
datalad/cmd.py 88.25% <20%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5e7f6d...aa43256. Read the comment docs.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Oct 18, 2019

Only the PY2 tests are failing (beloved unicode) -- those are not relevant for master.

and I don't think all these crashes are really "caused" by these changes -- only brought up to surface.

What confuses me is that even though we have `__str__` defined for CommandError, it is not used when we/I do `str(exc)`:
*(Pdb) p exc.__str__()
u'CommandError: command \'[\'ssh\', \'-o\', \'ControlPath="/home/yoh/.cache/datalad/sockets/ef2d633e"\', \'localhost\', \'export "PATH=/usr/lib/git-annex.linux:$PATH"; mkdir -p /home/yoh/.tmp/datalad_temp_test_failon_no_permissions2LNBd3/ds\']\' failed with exitcode 1\nFailed to run [\'ssh\', \'-o\', \'ControlPath="/home/yoh/.cache/datalad/sockets/ef2d633e"\', \'localhost\', \'export "PATH=/usr/lib/git-annex.linux:$PATH"; mkdir -p /home/yoh/.tmp/datalad_temp_test_failon_no_permissions2LNBd3/ds\'] under None. Exit code=1. out= err=mkdir: cannot create directory \u2018/home/yoh/.tmp/datalad_temp_test_failon_no_permissions2LNBd3/ds\u2019: Permission denied\n'
*(Pdb) p str(exc)
*** UnicodeEncodeError: UnicodeEncodeError('ascii', u'CommandError: command \'[\'ssh\', \'-o\', \'ControlPath="/home/yoh/.cache/datalad/sockets/ef2d633e"\', \'localhost\', \'export "PATH=/usr/lib/git-annex.linux:$PATH"; mkdir -p /home/yoh/.tmp/datalad_temp_test_failon_no_permissions2LNBd3/ds\']\' failed with exitcode 1\nFailed to run [\'ssh\', \'-o\', \'ControlPath="/home/yoh/.cache/datalad/sockets/ef2d633e"\', \'localhost\', \'export "PATH=/usr/lib/git-annex.linux:$PATH"; mkdir -p /home/yoh/.tmp/datalad_temp_test_failon_no_permissions2LNBd3/ds\'] under None. Exit code=1. out= err=mkdir: cannot create directory \u2018/home/yoh/.tmp/datalad_temp_test_failon_no_permissions2LNBd3/ds\u2019: Permission denied\n', 542, 543, 'ordinal not in range(128)')

ok -- figured it out. It was failing while converting to bytes after __str__ returned unicode. Fix pushed, and 6bce96d 992ee4d should simply be reverted upon merge into master

... How shall we deal with this?

I guess only via the suboptimal workflow you have described. And probably you know that, it will be happening until 0.12 is released and I could use it without reverting to use/maintain 0.11 whenever it takes way too long to accomplish a basic operation.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Oct 18, 2019

What I am curious about also is how to test this fix so whenever we RF the Runner we do not bring this issue back. Any ideas?

@mih
Copy link
Member

@mih mih commented Oct 18, 2019

and I could use it without reverting to use/maintain 0.11 whenever it takes way too long to accomplish a basic operation.

Please remember to file issues for these cases. At the moment there is only a single issue with a percentage difference. I cannot fix what I dont know.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Oct 18, 2019

another aspect is that we have more Popen calls, but I guess it should be a separate issue, so here is #3800 to either have it fixed or just closed because of old age.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Oct 18, 2019

Please remember to file issues for these cases. At the moment there is only a single issue with a percentage difference.

yeap -- AFAIK the most grave exponential growth was addressed.

I cannot fix what I dont know.

sure thing!

Otherwise __str__ was returning unicode, which could fail to encode into bytes.

Should be reverted in master where there is no PY2 compatibility
@yarikoptic yarikoptic force-pushed the bf-terminate-subprocess branch from 6bce96d to 992ee4d Compare Oct 18, 2019
@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Oct 19, 2019

grr , for some reason datalad-crawler run now gets stuck right away:

$ http_proxy= PATH=$PWD/../tools/coverage-bin:$PATH $NOSE_WRAPPER `which nosetests` $NOSE_OPTS -A "$NOSE_SELECTION_OP($NOSE_SELECTION)" --with-doctest --with-cov --cover-package datalad --logging-level=INFO $TESTS_TO_PERFORM
...
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

whenever 4 days ago 0.11.x completed just fine.

@yarikoptic yarikoptic mentioned this pull request Oct 19, 2019
@mih mih added this to to-be-categorized in Release 0.12 via automation Oct 19, 2019
@mih mih removed this from to-be-categorized in Release 0.12 Oct 19, 2019
@yarikoptic yarikoptic force-pushed the bf-terminate-subprocess branch 3 times, most recently from 2e41e24 to 362c0a0 Compare Oct 19, 2019
@yarikoptic yarikoptic force-pushed the bf-terminate-subprocess branch from 362c0a0 to aa43256 Compare Oct 19, 2019
@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Oct 20, 2019

ok, I restricted to non-CommandError and the stall went away, so ready to be merged

@mih mih merged commit 8eefa55 into datalad:0.11.x Oct 20, 2019
5 checks passed
mih added a commit to mih/datalad that referenced this issue Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants