Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_async_cmd: Work around dropped output of speedy commands #4835

Merged
merged 2 commits into from Aug 27, 2020

Conversation

kyleam
Copy link
Collaborator

@kyleam kyleam commented Aug 26, 2020

These patches hopefully work around the WitlessRunner failing to capture the output of quickly exiting commands.

#4773 (comment)

datalad-based reproducer mentioned in second commit
import sys
import tempfile

import datalad.api as dl
from datalad.support.exceptions import CommandError
from datalad.support.exceptions import FileNotInRepositoryError

ntimes = int(sys.argv[1]) if len(sys.argv) > 1 else 40

ds = dl.Dataset(tempfile.mkdtemp(prefix="dl-dropped-stderr-"))
ds.create()
print(ds)

for i in range(ntimes):
    print(i)
    try:
        ds.repo.commit(files="not-existing")
    except FileNotInRepositoryError:
        pass   # expected result
    except CommandError as e:
        print(f"\n\nError on iteration {i + 1}\n\n"
              f"e.stdout {e.stdout!r}\n\n"
              f"e.stderr {e.stderr!r}\n")
        raise

print("no unexpected errors")

kyleam added 2 commits Aug 26, 2020
WitlessProtocol.process_exited() stores the result in the Future
object that run_async_cmd() waits for.  That usually works, but if the
command completes quickly, process_exited() may be called before
pipe_data_received(), in which case the result will not have the
output.

Instead make run_async_cmd() retrieve the result directly from the
protocol.  This makes it possible to add another await that is not
tied to process_exited() call, which will be done in the next commit.
And it probably isn't too strange of an approach because it follows
what's done in the DateProtocol example of the Python documentation.

https://docs.python.org/3/library/asyncio-protocol.html#loop-subprocess-exec-and-subprocessprotocol
run_async_cmd() waits for a Future that is marked as done in
WitlessProtocol.process_exited() and then collects the output---which
was received via WitlessProtocol.pipe_data_received()---and returns
it.  However, if the command exits quickly, process_exited() may be
called before pipe_data_received(), in which case run_async_cmd()
returns empty output.  Some of our calls to git seem fast enough to
trigger this issue (dataladgh-4773).

This appears to be a bug in asyncio [1] and can be triggered using a
Protocol class that is identical to one in the Python documentation
[2] aside from using a faster command.  Even if this is resolved
upstream, though, we need a workaround for the Python versions that we
support.

Attempt to reliably capture the output by waiting for
`transport._wait()`.  While there may be more proper ways to solve
this, it's the only workaround I've been able to come up.  The
following suggests that it might do the trick:

  * create_subprocess_{exec,shell} don't seem to suffer from this
    problem.  The Process object they return has a .wait method that
    waits for `transport._wait`.  However, those functions use a very
    different protocol than WitlessProtocol, so it's possible that
    that wait isn't the key difference.  Still, it suggests it's a
    safe thing to do.

  * An open PR, dataladgh-4699, switches the ConfigManager over to using the
    WitlessRunner and is triggering a spread of failures that is
    plausibly due to losing output of the quick 'git config' calls.
    Merging the change here into that PR resulted in a passing Travis
    build [3].

  * With the script posted at the Python bug report, I've been unable
    to trigger any dropped output on two runs with 10000 iterations if
    I add `await asyncio.ensure_future(transport._wait())`.  Without
    that line, I've consistently been able to trigger it with many
    fewer iterations.

  * In a reproducer script (included in this PR) that involves Datalad
    and `git commit -- non-existing-file` (inspired by dataladgh-4773), I've
    been unable to trigger the error in hundreds of iterations with
    the changes in this series.  Before these changes, I've been able
    to consistently trigger it using fewer than 100 iterations.

This reverts 2955ba1 (TST: Skip known failure in
test_AnnexRepo_commit, 2020-08-04), as it should resolve the source of
that flaky test failure.

[1] Filed at https://bugs.python.org/issue41594
[2] https://docs.python.org/3/library/asyncio-protocol.html#loop-subprocess-exec-and-subprocessprotocol
[3] https://travis-ci.org/github/datalad/datalad/builds/721449200

Closes datalad#4773
@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #4835 into maint will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #4835      +/-   ##
==========================================
+ Coverage   89.63%   89.68%   +0.05%     
==========================================
  Files         288      288              
  Lines       40371    40371              
==========================================
+ Hits        36186    36207      +21     
+ Misses       4185     4164      -21     
Impacted Files Coverage Δ
datalad/cmd.py 92.40% <100.00%> (+0.04%) ⬆️
datalad/support/tests/test_annexrepo.py 96.73% <100.00%> (-0.01%) ⬇️
datalad/core/distributed/clone.py 88.62% <0.00%> (-0.60%) ⬇️
datalad/support/gitrepo.py 90.40% <0.00%> (ø)
datalad/downloaders/base.py 78.92% <0.00%> (+0.35%) ⬆️
datalad/downloaders/http.py 84.94% <0.00%> (+3.08%) ⬆️
datalad/downloaders/tests/test_http.py 91.15% <0.00%> (+3.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb327c9...b61bb51. Read the comment docs.

@mih
Copy link
Member

mih commented Aug 27, 2020

Ah! Wonderful! Now that you figured it out, it makes total sense ;-)

@mih
Copy link
Member

mih commented Aug 27, 2020

The reproducer worked perfectly for me -- both in triggering the failure in master and passing with this PR.

I am excited and I will merge this now.

Thx very much @kyleam

@mih mih merged commit 160d16a into datalad:maint Aug 27, 2020
4 checks passed
@mih
Copy link
Member

mih commented Aug 27, 2020

It has the desired effect on #4699 too!

@kyleam
Copy link
Collaborator Author

kyleam commented Aug 27, 2020

It has the desired effect on #4699 too!

To add another data point to that, I observed the same thing when I pushed a scratch branch before submitting the PR:

  * An open PR, gh-4699, switches the ConfigManager over to using the
    WitlessRunner and is triggering a spread of failures that is
    plausibly due to losing output of the quick 'git config' calls.
    Merging the change here into that PR resulted in a passing Travis
    build [3].

...

[3] https://travis-ci.org/github/datalad/datalad/builds/721449200

@kyleam kyleam deleted the asyncio-too-fast branch Aug 27, 2020
@yarikoptic
Copy link
Member

yarikoptic commented Aug 27, 2020

Awesome, thank you @kyleam ! I would even say - worth a release!

@yarikoptic
Copy link
Member

yarikoptic commented Aug 27, 2020

Note: benchmarks seems to be unaffected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants