Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: don't write output from within sshrun #7072

Merged
merged 1 commit into from Oct 12, 2022

Conversation

bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Oct 7, 2022

datalad sshrun explicitly calls SSH with log_output=False which results in the use of NoCapture protocol with the runner. Meaning, stdout/stderr of SSH is written out anyway already. When SSH returns, sshrun tried to write both to its stdout/stderr. But: It could not possibly have anything to write. That would not be an issue in and of itself, but sshrun is not necessarily used directly. In particular it is called by git (due to GIT_SSH_COMMAND=datalad sshrun). This resulted in a problem when apparently git has closed the pipe to its ssh executable (sshrun) already and we tried to write to it (although we really didn't even have something to write).

This ultimately led to issue #6599, where the actual ssh ... git-upload-pack execution succeeded and returned 0, but datalad sshrun itself produced a broken pipe error trying to write to stdout and hence returning non-zero.

It's not entirely clear when exactly this happens. When exactly the pipe is closed may be depend on git version as the failing builds are running 2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or older versions of git. In any case:
There can't be anything to write out to begin with, so don't even try.

Closes #6599
Closes #7078

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a plain removal of these lines is not good. It is all but clear that there could not be any output to write from the code in this file. We are doomed to swallow something important at some point.

If this assertion hold (and needs to hold), an assertion should be in place.

But why not simple guard the write call, and turn it into a best-effort attempt? If a pipe is closed, it wont matter, it could not go anywhere anyways. But of it is open, and there is something, it will come out.

And this seems better than an assertion that does not hold, as it will leave the user with a crash.

WDYT?

@bpoldrack
Copy link
Member Author

bpoldrack commented Oct 7, 2022

@mih : Well, I could do that obviously, but I don't see why. log_output=False is unconditional and it should be. A possibly captured output of the SSH call is currently piped through already. Why would we want to change that eventually in favor of a delayed output at the end (same content)? It sabotages any stdin/stdout communication with the caller for no obvious benefit.

Edit: The only reason I can see, how it's "all but clear that there could not be any output to write from the code" is, that the behavior of the log_output option to SSHConnection's __call__ is not documented. That's certainly worth changing. Do you see another reason why it's not clear?

@bpoldrack bpoldrack added the semver-patch Increment the patch version when merged label Oct 7, 2022
@bpoldrack
Copy link
Member Author

@mih : How about this 2231ba2 ? (Will squash if you agree).
Thing is: If out and err exist, future debugging efforts can easily be set on the wrong track, assuming that there should be something in there, since the return value is stored (and even tried to use for writing).

@codecov
Copy link

codecov bot commented Oct 7, 2022

Codecov Report

Base: 74.76% // Head: 75.90% // Increases project coverage by +1.13% 🎉

Coverage data is based on head (2231ba2) compared to base (748e5c6).
Patch coverage: 50.00% of modified lines in pull request are covered.

❗ Current head 2231ba2 differs from pull request most recent head e0b357d. Consider uploading reports for the commit e0b357d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##            maint    #7072      +/-   ##
==========================================
+ Coverage   74.76%   75.90%   +1.13%     
==========================================
  Files         354      354              
  Lines       58945    58940       -5     
  Branches     6310     6310              
==========================================
+ Hits        44072    44736     +664     
+ Misses      14858    14189     -669     
  Partials       15       15              
Impacted Files Coverage Δ
datalad/core/distributed/tests/test_clone.py 76.31% <ø> (+0.82%) ⬆️
datalad/support/sshrun.py 90.24% <33.33%> (-7.26%) ⬇️
datalad/support/sshconnector.py 65.13% <100.00%> (-0.97%) ⬇️
datalad/_version.py 45.68% <0.00%> (-0.28%) ⬇️
datalad/support/tests/test_annexrepo.py 70.49% <0.00%> (+0.09%) ⬆️
datalad/__init__.py 86.00% <0.00%> (+2.50%) ⬆️
datalad/tests/utils.py 51.98% <0.00%> (+20.65%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mih
Copy link
Member

mih commented Oct 8, 2022

Added documentation is essential. Other than that I think the new implementation is as problematic re wrong assumptions about behavior as the one before. Your rational is based on assertions that are not in the code.

Anyways, no need to convince me.

@bpoldrack
Copy link
Member Author

bpoldrack commented Oct 8, 2022

Your rational is based on assertions that are not in the code.

I honestly don't understand what that assertion is. The now documented behavior of log_output=False is the implemented behavior of it and it's the entire point of the switch existing to begin with. I have no idea what "is not in the code". The switch says: "do not grab the output".

To me it's kinda the opposite. If I in 6 months time see sshrun implementing something like this:

out, err = ssh(...)

try:
   os.write(1, out)
except
   pass

I wonder how long it will take me to realize again, that with current implementation there can't be anything in that out and err.

May be I'm not seeing something. I really don't know what it is.

WDYT, @datalad/developers ?

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thinking about it -- I think we better assert the assumptions of the code especially if there was prior code thinking otherwise

changelog.d/20221007_180414_benjaminpoldrack_fix_6599.md Outdated Show resolved Hide resolved
datalad/support/sshrun.py Outdated Show resolved Hide resolved
`datalad sshrun` explicitly calls SSH with `log_output=False` which
results in the use of `NoCapture` protocol with the runner. Meaning,
stdout/stderr of SSH is written out anyway already. When SSH returns,
`sshrun` tried to write both to its stdout/stderr. But: It could not
possibly have anything to write. That would not be an issue in and of
itself, but `sshrun` is not necessarily used directly. In particular it
is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This
resulted in a problem when apparently `git` has closed the pipe to its
ssh executable (`sshrun`) already and we tried to write to it (although
we really didn't even have something to write).

This ultimately led to issue datalad#6599, where the actual `ssh ...
git-upload-pack` execution succeeded and returned 0, but `datalad
sshrun` itself produced a broken pipe error trying to write to stdout
and hence returning non-zero.

It's not entirely clear when exactly this happens. It may be depend on
git version when the pipe is closed as the failing builds are running
2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or
older versions of git. In any case: There can't be anything to write out
to begin with, so don't even try.

Also: Make it clear in the code, that and why we don't expect any
captured output from the SSH subprocess by not storing the empty return
value, so future changes (and debuggers) don't falsely assume that
1. Output can simply be captured (with existing protocols) or
2. The returned value would currently be of any use simply b/c it's
   there.

(Closes datalad#6599)
(Closes datalad#7078)
@yarikoptic
Copy link
Member

Thank you @bpoldrack - let's proceed!

@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.17.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants