Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky SSH cloning failure in tests #6599

Closed
bpoldrack opened this issue Mar 30, 2022 · 10 comments
Closed

Flaky SSH cloning failure in tests #6599

bpoldrack opened this issue Mar 30, 2022 · 10 comments
Labels
spurious-test-failure Tests that fail unreliably

Comments

@bpoldrack
Copy link
Member

Extracted from #6550 (comment):

This only happens on AppVeyor (macOS again):

======================================================================
ERROR: datalad.core.distributed.tests.test_clone.test_ria_postclonecfg('ssh://datalad-test:/Users/appveyor/DLTMP/datalad_temp_ix8umpb9', '07c27167-6fef-443c-bbb7-3eec35daddc3')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 288, in _wrap_skip_ssh
    return func(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/appveyor/projects/datalad/datalad/core/distributed/tests/test_clone.py", line 958, in _test_ria_postclonecfg
    riaclone = clone('ria+{}#{}'.format(url, dsid), clone_path)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 447, in eval_func
    return return_func(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 439, in return_func
    results = list(results)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 424, in generator_func
    raise IncompleteResultsError(
datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully. 1 failed:
[{'action': 'install',
  'message': ('Failed to clone from any candidate source URL. Encountered '
              'errors per each url were:\n'
              '- %s',
              'ssh://datalad-test/Users/appveyor/DLTMP/datalad_temp_ix8umpb9/07c/27167-6fef-443c-bbb7-3eec35daddc3\n'
              "  CommandError: 'git -c diff.ignoreSubmodules=none clone "
              '--progress '
              'ssh://datalad-test/Users/appveyor/DLTMP/datalad_temp_ix8umpb9/07c/27167-6fef-443c-bbb7-3eec35daddc3 '
              "/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgw5zk_49p' "
              "failed with exitcode 128 [err: 'Cloning into "
              "'/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgw5zk_49p'...\n"
              '\r'
              'remote: Total 37 (delta 7), reused 0 (delta 0)        \n'
              "fatal: remote transport reported error']"),
  'path': '/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgw5zk_49p',
  'source_url': 'ria+ssh://datalad-test:/Users/appveyor/DLTMP/datalad_temp_ix8umpb9#07c27167-6fef-443c-bbb7-3eec35daddc3',
  'status': 'error',
  'type': 'dataset'}]

This seems flaky. Logging into that AppVeyor build, showed, that this happens at different spots in this test.
Sometimes this clone seems to work out fine but then the subsequent get on a subdataset fails the same way.
So, currently the failure happens at line 958 in test_clone.py and on previous run (exact same commit) it only failed at line 1017.
Moreover, this should not be the only test where we clone from RIA via SSH. Not clear to me yet, how this is one is different.

Looking into this, I am seeing a Broken Pipe Error:

[DEBUG] ...>runner:192  Finished ['ssh', '-o', 'ControlPath=/Users/appveyor/Library/Caches/datalad/sockets/fb3f4327', '-o', 'SendEnv=GIT_PROTOCOL', 'datalad-test', "git-upload-pack '/private/var/folders/5s/g225f6nd6jl4g8tshbh1ltk40000gn/T/datalad_temp_vlst5whp/376/1c829-d43c-420a-95fb-4467944477c4'"] with status 0 
[ERROR] ...>main:136,185  [Errno 32] Broken pipe (BrokenPipeError) 
fatal: remote transport reported error']

And git-upload-pack seems a bit off indeed:

appveyor$ git-upload-pack '/private/var/folders/5s/g225f6nd6jl4g8tshbh1ltk40000gn/T/datalad_temp_vlst5whp/376/1c829-d43c-420a-95fb-4467944477c4'
010d74b31e4b1f6a81373783c6520507909436ca0f3b HEADmulti_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed symref=HEAD:refs/heads/dl-test-branch object-format=sha1 agent=git/2.35.1
004774b31e4b1f6a81373783c6520507909436ca0f3b refs/heads/dl-test-branch
0000

hanging at this point

And, of course, there's no problem running this right afterwards:

appveyor$ datalad clone "ria+ssh://datalad-test/private/var/folders/5s/g225f6nd6jl4g8tshbh1ltk40000gn/T/datalad_temp_vlst5whp#3761c829-d43c-420a-95fb-4467944477c4" test5                                                                 
Clone attempt:   0%|                                                                                                                                                                                                              | 0.00/1.00 [00:00<?, ? Candidate locations/s]@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:b7Q9hN2pEJGEvu/BlO2GUD/EV+H/xlmDqx7oCUosGbg.
Please contact your system administrator.
Add correct host key in /Users/appveyor/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/appveyor/.ssh/known_hosts:154
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
install(ok): /Users/appveyor/projects/test5 (dataset) 
@bpoldrack
Copy link
Member Author

bpoldrack commented Mar 30, 2022

Same test, different failure, this time during get (again only one mac build):

======================================================================
ERROR: datalad.core.distributed.tests.test_clone.test_ria_postclonecfg('ssh://datalad-test:/Users/appveyor/DLTMP/datalad_temp_m6dmumeo', '3ff6adaf-3906-4de5-8bfb-9bca892dd5db')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 288, in _wrap_skip_ssh
    return func(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/appveyor/projects/datalad/datalad/core/distributed/tests/test_clone.py", line 1057, in _test_ria_postclonecfg
    res = subds.get('test.txt')
  File "/Users/appveyor/projects/datalad/datalad/distribution/dataset.py", line 502, in apply_func
    return f(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 447, in eval_func
    return return_func(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 439, in return_func
    results = list(results)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 424, in generator_func
    raise IncompleteResultsError(
datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully. 1 failed:
[{'action': 'get',
  'annexkey': 'MD5E-s4--03d59e663c1af9ac33a9949d1193505a.txt',
  'message': 'not available',
  'path': '/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgaavli6ff/sub/test.txt',
  'refds': '/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgaavli6ff/sub',
  'status': 'error',
  'type': 'file'}]

Something seems off with the SSH setup.

@yarikoptic
Copy link
Member

probably unrelated, since also didn't stop it, but why WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!? so is there some process which re-installs/reconfigures ssh in the background? (could potentially explain some odd behaviors)

@bpoldrack
Copy link
Member Author

I only flew over it, since that host is a generated docker - so I thought we can't know it. But worth having a closer look (especially what exactly is in that known_hosts).

@mih
Copy link
Member

mih commented Mar 31, 2022

probably unrelated, since also didn't stop it, but why WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!? so is there some process which re-installs/reconfigures ssh in the background? (could potentially explain some odd behaviors)

If the GIN test would be running #6605 one could see a similar thing:

  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/distributed/tests/test_create_sibling_gin.py", line 27, in check_push
687    ds.push(to='gin', result_renderer='disabled'),
688  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/distribution/dataset.py", line 502, in apply_func
689    return f(*args, **kwargs)
690  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/interface/utils.py", line 447, in eval_func
691    return return_func(*args, **kwargs)
692  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/interface/utils.py", line 439, in return_func
693    results = list(results)
694  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/interface/utils.py", line 357, in generator_func
695    for r in _process_results(
696  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/interface/utils.py", line 544, in _process_results
697    for res in results:
698  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/core/distributed/push.py", line 260, in __call__
699    yield from _push(
700  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/core/distributed/push.py", line 654, in _push
701    repo.call_git(fetch_cmd)
702  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/dataset/gitrepo.py", line 432, in call_git
703    return "\n".join(
704  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/dataset/gitrepo.py", line 478, in call_git_items_
705    for file_no, line in self._generator_call_git(
706  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/dataset/gitrepo.py", line 345, in _generator_call_git
707    for file_no, content in generator:
708  File "/usr/lib/python3.8/_collections_abc.py", line 317, in __next__
709    return self.send(None)
710  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/runner/nonasyncrunner.py", line 98, in send
711    self._check_result()
712  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/runner/nonasyncrunner.py", line 80, in _check_result
713    self.runner._check_result()
714  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/runner/nonasyncrunner.py", line 256, in _check_result
715    raise CommandError(cmd=self.cmd,
716datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none -c url.git@gin.g-node.org:/dataladtester/dltst_gincvaziv_6.git.insteadof=https://gin.g-node.org/dataladtester/dltst_gincvaziv_6 fetch gin git-annex' failed with exitcode 128 [err: '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
717@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
718@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
719IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
720Someone could be eavesdropping on you right now (man-in-the-middle attack)!
721It is also possible that a host key has just been changed.
722The fingerprint for the ECDSA key sent by the remote host is
723SHA256:E35RRG3bhoAm/WD+0dqKpFnxJ9+yi0uUiFLi+H/lkdU.
724Please contact your system administrator.
725Add correct host key in /home/appveyor/.ssh/known_hosts to get rid of this message.
726Offending ECDSA key in /home/appveyor/.ssh/known_hosts:154
727  remove with:
728  ssh-keygen -f "/home/appveyor/.ssh/known_hosts" -R "gin.g-node.org"
729ECDSA host key for gin.g-node.org has changed and you have requested strict checking.
730Host key verification failed.

@bpoldrack
Copy link
Member Author

Hm. I'm confused. Just leaving a record here, while try to find what's wrong since the sessions are limited to an hour.

Logging into a new AppVeyor build, first thing I notice:

appveyor$ ls -la /Users/appveyor/.ssh
total 128
drwx------   7 appveyor  staff    224 Mar 31 10:20 .
drwxr-xr-x+ 74 appveyor  staff   2368 Mar 31 10:47 ..
-rw-------   1 appveyor  staff   1180 Mar 31 10:20 authorized_keys
-rw-r--r--   1 appveyor  staff    203 Mar 31 10:20 config
-rw-------   1 appveyor  staff   2622 Mar 31 10:20 id_rsa
-rw-r--r--   1 appveyor  staff    585 Mar 31 10:20 id_rsa.pub
-rw-r--r--   1 appveyor  staff  47959 Feb 25 21:48 known_hosts

Why exactly is known_hosts that old? Is that even persistent for an appveyor user? (Will check)

Second:

appveyor$ git clone git@gin.g-node.org:/dataladtester/dltst_gincvaziv_6.git
Cloning into 'dltst_gincvaziv_6'...
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:E35RRG3bhoAm/WD+0dqKpFnxJ9+yi0uUiFLi+H/lkdU.
Please contact your system administrator.
Add correct host key in /Users/appveyor/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/appveyor/.ssh/known_hosts:154
ECDSA host key for gin.g-node.org has changed and you have requested strict checking.
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
appveyor$ echo $?
128

This is not in a venv and not called by datalad, GIT_SSH_* is not set, so it shouldn't go through datalad sshrun or involve any datalad code. (Which seems good - indicating it's just the setup, not datalad's handling of it)

Next thing confusing:

appveyor$ ssh localhost
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:b7Q9hN2pEJGEvu/BlO2GUD/EV+H/xlmDqx7oCUosGbg.
Please contact your system administrator.
Add correct host key in /Users/appveyor/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/appveyor/.ssh/known_hosts:154
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
Last login: Thu Mar 31 10:51:10 2022 from 127.0.0.1
Project:       datalad
Build Version: 1.0.7798
URL:           https://ci.appveyor.com/project/mih/datalad/build/job/572yny22wi5q5uel

The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.

Note, that this is complaining about the same entry (154) that allegedly is offending for localhost as well as gin.g-node.org.

Now:

appveyor$ ssh-keygen -f "/Users/appveyor/.ssh/known_hosts" -R "gin.g-node.org"
# Host gin.g-node.org found: line 152
# Host gin.g-node.org found: line 153
# Host gin.g-node.org found: line 154
/Users/appveyor/.ssh/known_hosts updated.
Original contents retained as /Users/appveyor/.ssh/known_hosts.old
appveyor$ diff /Users/appveyor/.ssh/known_hosts /Users/appveyor/.ssh/known_hosts.old 
151a152,154
> gitlab.com,* ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAfuCHKVTjquxvt6CM6tdG4SLp1Btn/nOeHHE5UOzRdf
> gitlab.com,* ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCsj2bNKTBSpIYDEGk9KxsGh3mySTRgMtXL583qmBpzeQ+jqCMRgBqB98u3z++J1sKlXHWfM9dyhSevkMwSbhoR8XIq/U0tCNyokEi/ueaBMCvbcTHhO7FcwzY92WK4Yt0aGROY5qX2UKSeOvuP4D6TPqKF1onrSzH9bx9XUf2lEdWT/ia1NEKjunUqu1xOB/StKDHMoX4/OKyIzuS0q/T1zOATthvasJFoPrAjkohTyaDUz2LN5JoH839hViyEG82yB+MjcFV5MU3N1l1QL3cVUCh93xSaua1N85qivl+siMkPGbO5xR/En4iEY6K2XPASUEMaieWVNTRCtJ4S8H+9
> gitlab.com,* ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFSMqzJeV9rUzU4kWitGjeR4PWSa29SPqJ1fVkhtj3Hw9xjLVXVYrU9QlYWrOLXBpQ6KWjbjTDTdDkoohFzgbEY=

This seems all about gitlab, how does it relate to gin or localhost?

Generally known_hosts has tons of github, gitlab, bitbucket addresses - looks like boilerplate.

@bpoldrack
Copy link
Member Author

bpoldrack commented Apr 1, 2022

Ok. So, it seems that these known_host entries are faulty in that they have a * as the matching IP entry instead of something like 1.2.3.*.
Given the file's mtime, it's not us touching it somehow and generating these entries. Appears to be AppVeyor who is at fault here. Reported: appveyor/ci#3792

However, that's unlikely to address the original issue.

@bpoldrack
Copy link
Member Author

Just saw another manifestation of the problem. Again, same test same build failing at a different spot: https://ci.appveyor.com/project/mih/datalad/builds/43196583/job/qdbf2j1ss2vme01h#L4493

@yarikoptic
Copy link
Member

I will mark it as a @known_failure_osx within #6273 where it keeps manifesting quite reliably.

@yarikoptic
Copy link
Member

Fresh fail (that #6273 was for master -- not yet released, was not CPed into maint I guess) within maint for #6692

@yarikoptic
Copy link
Member

added skip in 2f1c01f AKA 0.16.5-14-g2f1c01f69 .

bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 7, 2022
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 7, 2022
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 7, 2022
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 7, 2022
`datalad sshrun` explicitly calls SSH with `log_output=False` which
results in the use of `NoCapture` protocol with the runner. Meaning,
stdout/stderr of SSH is written out anyway already. When SSH returns,
`sshrun` tried to write both to its stdout/stderr. But: It could not
possibly have anything to write. That would not be an issue in and of
itself, but `sshrun` is not necessarily used directly. In particular it
is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This
resulted in a problem when apparently `git` has closed the pipe to its
ssh executable (`sshrun`) already and we tried to write to it (although
we really didn't even have something to write).

This ultimately led to issue datalad#6599, where the actual `ssh ...
git-upload-pack` execution succeeded and returned 0, but `datalad
sshrun` itself produced a broken pipe error trying to write to stdout
and hence returning non-zero.

It's not entirely clear when exactly this happens. It may be depend on
git version when the pipe is closed as the failing builds are running
2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or
older versions of git. In any case: There can't be anything to write out
to begin with, so don't even try.

(Closes datalad#6599)
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 7, 2022
`datalad sshrun` explicitly calls SSH with `log_output=False` which
results in the use of `NoCapture` protocol with the runner. Meaning,
stdout/stderr of SSH is written out anyway already. When SSH returns,
`sshrun` tried to write both to its stdout/stderr. But: It could not
possibly have anything to write. That would not be an issue in and of
itself, but `sshrun` is not necessarily used directly. In particular it
is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This
resulted in a problem when apparently `git` has closed the pipe to its
ssh executable (`sshrun`) already and we tried to write to it (although
we really didn't even have something to write).

This ultimately led to issue datalad#6599, where the actual `ssh ...
git-upload-pack` execution succeeded and returned 0, but `datalad
sshrun` itself produced a broken pipe error trying to write to stdout
and hence returning non-zero.

It's not entirely clear when exactly this happens. It may be depend on
git version when the pipe is closed as the failing builds are running
2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or
older versions of git. In any case: There can't be anything to write out
to begin with, so don't even try.

(Closes datalad#6599)

(cherry picked from commit 3112fb5)
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 7, 2022
`datalad sshrun` explicitly calls SSH with `log_output=False` which
results in the use of `NoCapture` protocol with the runner. Meaning,
stdout/stderr of SSH is written out anyway already. When SSH returns,
`sshrun` tried to write both to its stdout/stderr. But: It could not
possibly have anything to write. That would not be an issue in and of
itself, but `sshrun` is not necessarily used directly. In particular it
is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This
resulted in a problem when apparently `git` has closed the pipe to its
ssh executable (`sshrun`) already and we tried to write to it (although
we really didn't even have something to write).

This ultimately led to issue datalad#6599, where the actual `ssh ...
git-upload-pack` execution succeeded and returned 0, but `datalad
sshrun` itself produced a broken pipe error trying to write to stdout
and hence returning non-zero.

It's not entirely clear when exactly this happens. It may be depend on
git version when the pipe is closed as the failing builds are running
2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or
older versions of git. In any case: There can't be anything to write out
to begin with, so don't even try.

(Closes datalad#6599)

(cherry picked from commit 3112fb5)
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 8, 2022
`datalad sshrun` explicitly calls SSH with `log_output=False` which
results in the use of `NoCapture` protocol with the runner. Meaning,
stdout/stderr of SSH is written out anyway already. When SSH returns,
`sshrun` tried to write both to its stdout/stderr. But: It could not
possibly have anything to write. That would not be an issue in and of
itself, but `sshrun` is not necessarily used directly. In particular it
is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This
resulted in a problem when apparently `git` has closed the pipe to its
ssh executable (`sshrun`) already and we tried to write to it (although
we really didn't even have something to write).

This ultimately led to issue datalad#6599, where the actual `ssh ...
git-upload-pack` execution succeeded and returned 0, but `datalad
sshrun` itself produced a broken pipe error trying to write to stdout
and hence returning non-zero.

It's not entirely clear when exactly this happens. It may be depend on
git version when the pipe is closed as the failing builds are running
2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or
older versions of git. In any case: There can't be anything to write out
to begin with, so don't even try.

Also: Make it clear in the code, that and why we don't expect any
captured output from the SSH subprocess by not storing the empty return
value, so future changes (and debuggers) don't falsely assume that
1. Output can simply be captured (with existing protocols) or
2. The returned value would currently be of any use simply b/c it's
   there.

(Closes datalad#6599)
(Closes datalad#7078)
bpoldrack added a commit to bpoldrack/datalad that referenced this issue Oct 11, 2022
`datalad sshrun` explicitly calls SSH with `log_output=False` which
results in the use of `NoCapture` protocol with the runner. Meaning,
stdout/stderr of SSH is written out anyway already. When SSH returns,
`sshrun` tried to write both to its stdout/stderr. But: It could not
possibly have anything to write. That would not be an issue in and of
itself, but `sshrun` is not necessarily used directly. In particular it
is called by `git` (due to `GIT_SSH_COMMAND=datalad sshrun`). This
resulted in a problem when apparently `git` has closed the pipe to its
ssh executable (`sshrun`) already and we tried to write to it (although
we really didn't even have something to write).

This ultimately led to issue datalad#6599, where the actual `ssh ...
git-upload-pack` execution succeeded and returned 0, but `datalad
sshrun` itself produced a broken pipe error trying to write to stdout
and hence returning non-zero.

It's not entirely clear when exactly this happens. It may be depend on
git version when the pipe is closed as the failing builds are running
2.35.1 (MacOS on appveyor) whereas otherbuilds have either newer or
older versions of git. In any case: There can't be anything to write out
to begin with, so don't even try.

Also: Make it clear in the code, that and why we don't expect any
captured output from the SSH subprocess by not storing the empty return
value, so future changes (and debuggers) don't falsely assume that
1. Output can simply be captured (with existing protocols) or
2. The returned value would currently be of any use simply b/c it's
   there.

(Closes datalad#6599)
(Closes datalad#7078)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spurious-test-failure Tests that fail unreliably
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants