Skip to content

Flaky SSH cloning failure in tests #6599

@bpoldrack

Description

@bpoldrack

Extracted from #6550 (comment):

This only happens on AppVeyor (macOS again):

======================================================================
ERROR: datalad.core.distributed.tests.test_clone.test_ria_postclonecfg('ssh://datalad-test:/Users/appveyor/DLTMP/datalad_temp_ix8umpb9', '07c27167-6fef-443c-bbb7-3eec35daddc3')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 288, in _wrap_skip_ssh
    return func(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/appveyor/projects/datalad/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/appveyor/projects/datalad/datalad/core/distributed/tests/test_clone.py", line 958, in _test_ria_postclonecfg
    riaclone = clone('ria+{}#{}'.format(url, dsid), clone_path)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 447, in eval_func
    return return_func(*args, **kwargs)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 439, in return_func
    results = list(results)
  File "/Users/appveyor/projects/datalad/datalad/interface/utils.py", line 424, in generator_func
    raise IncompleteResultsError(
datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully. 1 failed:
[{'action': 'install',
  'message': ('Failed to clone from any candidate source URL. Encountered '
              'errors per each url were:\n'
              '- %s',
              'ssh://datalad-test/Users/appveyor/DLTMP/datalad_temp_ix8umpb9/07c/27167-6fef-443c-bbb7-3eec35daddc3\n'
              "  CommandError: 'git -c diff.ignoreSubmodules=none clone "
              '--progress '
              'ssh://datalad-test/Users/appveyor/DLTMP/datalad_temp_ix8umpb9/07c/27167-6fef-443c-bbb7-3eec35daddc3 '
              "/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgw5zk_49p' "
              "failed with exitcode 128 [err: 'Cloning into "
              "'/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgw5zk_49p'...\n"
              '\r'
              'remote: Total 37 (delta 7), reused 0 (delta 0)        \n'
              "fatal: remote transport reported error']"),
  'path': '/Users/appveyor/DLTMP/datalad_temp__test_ria_postclonecfgw5zk_49p',
  'source_url': 'ria+ssh://datalad-test:/Users/appveyor/DLTMP/datalad_temp_ix8umpb9#07c27167-6fef-443c-bbb7-3eec35daddc3',
  'status': 'error',
  'type': 'dataset'}]

This seems flaky. Logging into that AppVeyor build, showed, that this happens at different spots in this test.
Sometimes this clone seems to work out fine but then the subsequent get on a subdataset fails the same way.
So, currently the failure happens at line 958 in test_clone.py and on previous run (exact same commit) it only failed at line 1017.
Moreover, this should not be the only test where we clone from RIA via SSH. Not clear to me yet, how this is one is different.

Looking into this, I am seeing a Broken Pipe Error:

[DEBUG] ...>runner:192  Finished ['ssh', '-o', 'ControlPath=/Users/appveyor/Library/Caches/datalad/sockets/fb3f4327', '-o', 'SendEnv=GIT_PROTOCOL', 'datalad-test', "git-upload-pack '/private/var/folders/5s/g225f6nd6jl4g8tshbh1ltk40000gn/T/datalad_temp_vlst5whp/376/1c829-d43c-420a-95fb-4467944477c4'"] with status 0 
[ERROR] ...>main:136,185  [Errno 32] Broken pipe (BrokenPipeError) 
fatal: remote transport reported error']

And git-upload-pack seems a bit off indeed:

appveyor$ git-upload-pack '/private/var/folders/5s/g225f6nd6jl4g8tshbh1ltk40000gn/T/datalad_temp_vlst5whp/376/1c829-d43c-420a-95fb-4467944477c4'
010d74b31e4b1f6a81373783c6520507909436ca0f3b HEADmulti_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed symref=HEAD:refs/heads/dl-test-branch object-format=sha1 agent=git/2.35.1
004774b31e4b1f6a81373783c6520507909436ca0f3b refs/heads/dl-test-branch
0000

hanging at this point

And, of course, there's no problem running this right afterwards:

appveyor$ datalad clone "ria+ssh://datalad-test/private/var/folders/5s/g225f6nd6jl4g8tshbh1ltk40000gn/T/datalad_temp_vlst5whp#3761c829-d43c-420a-95fb-4467944477c4" test5                                                                 
Clone attempt:   0%|                                                                                                                                                                                                              | 0.00/1.00 [00:00<?, ? Candidate locations/s]@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:b7Q9hN2pEJGEvu/BlO2GUD/EV+H/xlmDqx7oCUosGbg.
Please contact your system administrator.
Add correct host key in /Users/appveyor/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/appveyor/.ssh/known_hosts:154
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
install(ok): /Users/appveyor/projects/test5 (dataset) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions