-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky SSH cloning failure in tests #6599
Comments
Same test, different failure, this time during
Something seems off with the SSH setup. |
probably unrelated, since also didn't stop it, but why |
I only flew over it, since that host is a generated docker - so I thought we can't know it. But worth having a closer look (especially what exactly is in that known_hosts). |
If the GIN test would be running #6605 one could see a similar thing:
|
Hm. I'm confused. Just leaving a record here, while try to find what's wrong since the sessions are limited to an hour. Logging into a new AppVeyor build, first thing I notice:
Why exactly is Second:
This is not in a venv and not called by datalad, Next thing confusing:
Note, that this is complaining about the same entry (154) that allegedly is offending for Now:
This seems all about gitlab, how does it relate to gin or localhost? Generally |
Ok. So, it seems that these However, that's unlikely to address the original issue. |
Just saw another manifestation of the problem. Again, same test same build failing at a different spot: https://ci.appveyor.com/project/mih/datalad/builds/43196583/job/qdbf2j1ss2vme01h#L4493 |
I will mark it as a |
added skip in 2f1c01f AKA 0.16.5-14-g2f1c01f69 . |
`datalad sshrun` explicitly calls SSH with `log_output=False` which results in the use of `NoCapture` protocol with the runner. Meaning, stdout/stderr of SSH is written out anyway already. When SSH returns, `sshrun` tried to write both to its stdout/stderr. But: It could not possibly have anything to write. That would not be an issue in and of itself, but `sshrun` is not necessarily used directly. In particular it is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This resulted in a problem when apparently `git` has closed the pipe to its ssh executable (`sshrun`) already and we tried to write to it (although we really didn't even have something to write). This ultimately led to issue datalad#6599, where the actual `ssh ... git-upload-pack` execution succeeded and returned 0, but `datalad sshrun` itself produced a broken pipe error trying to write to stdout and hence returning non-zero. It's not entirely clear when exactly this happens. It may be depend on git version when the pipe is closed as the failing builds are running 2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or older versions of git. In any case: There can't be anything to write out to begin with, so don't even try. (Closes datalad#6599)
`datalad sshrun` explicitly calls SSH with `log_output=False` which results in the use of `NoCapture` protocol with the runner. Meaning, stdout/stderr of SSH is written out anyway already. When SSH returns, `sshrun` tried to write both to its stdout/stderr. But: It could not possibly have anything to write. That would not be an issue in and of itself, but `sshrun` is not necessarily used directly. In particular it is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This resulted in a problem when apparently `git` has closed the pipe to its ssh executable (`sshrun`) already and we tried to write to it (although we really didn't even have something to write). This ultimately led to issue datalad#6599, where the actual `ssh ... git-upload-pack` execution succeeded and returned 0, but `datalad sshrun` itself produced a broken pipe error trying to write to stdout and hence returning non-zero. It's not entirely clear when exactly this happens. It may be depend on git version when the pipe is closed as the failing builds are running 2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or older versions of git. In any case: There can't be anything to write out to begin with, so don't even try. (Closes datalad#6599) (cherry picked from commit 3112fb5)
`datalad sshrun` explicitly calls SSH with `log_output=False` which results in the use of `NoCapture` protocol with the runner. Meaning, stdout/stderr of SSH is written out anyway already. When SSH returns, `sshrun` tried to write both to its stdout/stderr. But: It could not possibly have anything to write. That would not be an issue in and of itself, but `sshrun` is not necessarily used directly. In particular it is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This resulted in a problem when apparently `git` has closed the pipe to its ssh executable (`sshrun`) already and we tried to write to it (although we really didn't even have something to write). This ultimately led to issue datalad#6599, where the actual `ssh ... git-upload-pack` execution succeeded and returned 0, but `datalad sshrun` itself produced a broken pipe error trying to write to stdout and hence returning non-zero. It's not entirely clear when exactly this happens. It may be depend on git version when the pipe is closed as the failing builds are running 2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or older versions of git. In any case: There can't be anything to write out to begin with, so don't even try. (Closes datalad#6599) (cherry picked from commit 3112fb5)
`datalad sshrun` explicitly calls SSH with `log_output=False` which results in the use of `NoCapture` protocol with the runner. Meaning, stdout/stderr of SSH is written out anyway already. When SSH returns, `sshrun` tried to write both to its stdout/stderr. But: It could not possibly have anything to write. That would not be an issue in and of itself, but `sshrun` is not necessarily used directly. In particular it is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This resulted in a problem when apparently `git` has closed the pipe to its ssh executable (`sshrun`) already and we tried to write to it (although we really didn't even have something to write). This ultimately led to issue datalad#6599, where the actual `ssh ... git-upload-pack` execution succeeded and returned 0, but `datalad sshrun` itself produced a broken pipe error trying to write to stdout and hence returning non-zero. It's not entirely clear when exactly this happens. It may be depend on git version when the pipe is closed as the failing builds are running 2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or older versions of git. In any case: There can't be anything to write out to begin with, so don't even try. Also: Make it clear in the code, that and why we don't expect any captured output from the SSH subprocess by not storing the empty return value, so future changes (and debuggers) don't falsely assume that 1. Output can simply be captured (with existing protocols) or 2. The returned value would currently be of any use simply b/c it's there. (Closes datalad#6599) (Closes datalad#7078)
`datalad sshrun` explicitly calls SSH with `log_output=False` which results in the use of `NoCapture` protocol with the runner. Meaning, stdout/stderr of SSH is written out anyway already. When SSH returns, `sshrun` tried to write both to its stdout/stderr. But: It could not possibly have anything to write. That would not be an issue in and of itself, but `sshrun` is not necessarily used directly. In particular it is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This resulted in a problem when apparently `git` has closed the pipe to its ssh executable (`sshrun`) already and we tried to write to it (although we really didn't even have something to write). This ultimately led to issue datalad#6599, where the actual `ssh ... git-upload-pack` execution succeeded and returned 0, but `datalad sshrun` itself produced a broken pipe error trying to write to stdout and hence returning non-zero. It's not entirely clear when exactly this happens. It may be depend on git version when the pipe is closed as the failing builds are running 2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or older versions of git. In any case: There can't be anything to write out to begin with, so don't even try. Also: Make it clear in the code, that and why we don't expect any captured output from the SSH subprocess by not storing the empty return value, so future changes (and debuggers) don't falsely assume that 1. Output can simply be captured (with existing protocols) or 2. The returned value would currently be of any use simply b/c it's there. (Closes datalad#6599) (Closes datalad#7078)
Extracted from #6550 (comment):
This only happens on AppVeyor (macOS again):
This seems flaky. Logging into that AppVeyor build, showed, that this happens at different spots in this test.
Sometimes this clone seems to work out fine but then the subsequent
get
on a subdataset fails the same way.So, currently the failure happens at line 958 in
test_clone.py
and on previous run (exact same commit) it only failed at line 1017.Moreover, this should not be the only test where we clone from RIA via SSH. Not clear to me yet, how this is one is different.
Looking into this, I am seeing a Broken Pipe Error:
And
git-upload-pack
seems a bit off indeed:hanging at this point
And, of course, there's no problem running this right afterwards:
The text was updated successfully, but these errors were encountered: