Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: Make multiplexed SSH connections an optional feature #5042

Merged
merged 16 commits into from Oct 16, 2020

Conversation

mih
Copy link
Member

@mih mih commented Oct 15, 2020

A new switch datalad.ssh.multiplex-connections is introduced (off by default on windows).

A leaner manager and a different SSHConnection object are used in the non-multiplexed mode.

Manager and Connection classes have been RF'ed substantially.

Also streamlined the handling of SSH args. Instead of keeping individual ones, let the constructor(s) lump them into categories "for opening" and "for any call". This seems to be sufficient.

ATM this is targeting maint to have a chance to get SSH cloning working on windows prior 0.14.0 (e.g. #5031). but the necessary changes are substantial. So this will have to be more than a "fix".

TODO

  • Verify that solution is working on a regular win10 box, not just tests.
  • Consider added a travis run that performs SSH tests (only) without connection multiplexing to disentangle windows issues from SSH issues.
    Related separate out SSH test builds #4166

Fixes #2575
Fixes #5031

@mih mih marked this pull request as draft October 15, 2020 07:42
Make room for a representation of a non-multiplexed SSH connection.
Anticipates a leaner SSHManager too, which does not need to keep track
of single-process SSH connections.

Also streamline the handling of SSH args. Instead of keeping individual
ones, let the constructor(s) lump them into categories "for opening" and
"for any call". This seems to be sufficient.

Ping datalad#2575
Off on windows, due to current platform limitations, and on anywhere
else, by default.
RF MultiplexSSHConnection some more to make this new class absolutely
minimal.
@mih mih changed the title RF: MultiplexSSHConnection(BaseSSHConnection) NF: Make multiplexed SSH connections and optional feature Oct 15, 2020
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Didn't try yet, but from looking through the code - LGTM!

@bpoldrack
Copy link
Member

I guess currently missing things like not actually changing PATH for remote execution but directly calling git/annex and replacing Runner better be part of a follow up to keep changes conceivable.

@mih mih marked this pull request as ready for review October 15, 2020 11:50
@codecov
Copy link

codecov bot commented Oct 15, 2020

Codecov Report

Merging #5042 into maint will decrease coverage by 0.08%.
The diff coverage is 78.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #5042      +/-   ##
==========================================
- Coverage   89.97%   89.88%   -0.09%     
==========================================
  Files         292      292              
  Lines       40742    40799      +57     
==========================================
+ Hits        36656    36674      +18     
- Misses       4086     4125      +39     
Impacted Files Coverage Δ
datalad/support/annexrepo.py 86.78% <50.00%> (-0.06%) ⬇️
datalad/support/sshconnector.py 82.35% <78.16%> (-3.50%) ⬇️
datalad/__init__.py 90.78% <100.00%> (ø)
datalad/interface/common_cfg.py 100.00% <100.00%> (ø)
datalad/support/gitrepo.py 90.67% <100.00%> (ø)
datalad/downloaders/http.py 81.85% <0.00%> (-3.09%) ⬇️
datalad/downloaders/tests/test_http.py 88.22% <0.00%> (-2.89%) ⬇️
datalad/downloaders/base.py 80.70% <0.00%> (-0.36%) ⬇️
datalad/cmd.py 93.71% <0.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5e9edb...59df029. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would have been valuable to get at least some basic smoke test explicitly for each (where supported) of the managers... and/(or at least) I would have also made one of the SSH runs of travis to set multiplex-connections to 0 so we have a run through with that manager even on linux.

Left various comments/suggestions, didn't try to run yet

'ui': ('question', {
'title': "Whether to use a single shared connection for multiple SSH processes aiming at the same target."}),
'destination': 'global',
'default': False if on_windows else True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'default': False if on_windows else True,
'default': not on_windows,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to tune.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do now and will add travis run for non-multiplexed.

@@ -268,6 +269,13 @@
'destination': 'global',
'default': None,
},
'datalad.ssh.multiplex-connections': {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be could also be useful for #4705 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And others, that is the reason why this is not implemented as a windows-specific fix.

return False

def open(self):
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no docstring here, only in the "original" implementation where it says

        bool
          True when SSH reports success opening the connection, False when
          a ControlMaster for an open connection already exists.

so -- does it mean that we the ControlMaster always exists there? should docs be tuned up or should we introduce returning None if there is no ControlMaster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but it works this way.

"""Interface for an SSHManager
"""
def ensure_initialized(self):
"""Assures that manager is initialized"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Assures that manager is initialized"""
"""Ensures that manager is initialized"""

;)

-------
BaseSSHConnection
"""
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we start using abc and @abstractmethod more? not that I foresee other subclasses to come but IMHO getting consistent use of abc for interface/base classes would be of advantage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with them. But maybe not in the context of an urgent fix

# by itself
self.open()
@property
def _runner(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlikely anyone relies on it but especially since PR is against maint -- what is the reason for making runner protected ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made no sense to expose it when building the base class. I you feel like this should be split, I am OK with that.

"""Assures that manager is initialized - knows socket_dir, previous connections
"""
if self._socket_dir is not None:
return
from ..config import ConfigManager
cfg = ConfigManager()
from datalad import cfg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note to myself: didn't check the logs etc, but may be that original code was on purpose to avoid circular imports? or probably be indeed no longer the case since ensure_initialized is commented out in the constructor... (code coverage annotation is silent for me for this PR ATM for some reason :-( )

@@ -604,6 +796,11 @@ def close(self, allow_fail=True, ctrl_path=None):
self._connections = dict()


# retain backward compat with 0.13.4 and earlier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to move that check of cfg from __init__ here and bind those two accordingly on the cfg setting right here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know.

Copy link
Member Author

@mih mih Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be stripped also. I don't know of any consumer.

# we presently do not support any interactive authentication
# at the time of process execution
'-o', 'PasswordAuthentication=no',
'-o', 'KbdInteractiveAuthentication=no',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I know that it is "inefficient", "annoying", and might be just infeasible in proper real-life cases etc but why to disable completely ? Especially for maint... I wonder if may be worth adding a dedicated config option, e.g. datalad.ssh.allow-password. We could even set it to False right away since it would be hit only on Windows (by default)... That still would allow where/when really needed for password based authentication, even though demanding serial execution, multiple entries, etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed that in the call this week. It is not possible for most calls, because stdout/err is consumed by some kind of handler. I am sure somehow this could be made possible. But I don't know how, and I have no plans to implement it, in this context of this urgent fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.


def close(self):
# we perform blocking execution, we should not return from __call__ until
# the connection is already closed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this comment in this context of pass content

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no open connections (unless a remote process is still running), hence no need to close any, ever. Feel free to remove, if it confuses.

@mih
Copy link
Member Author

mih commented Oct 15, 2020

Thanks for the review @yarikoptic Feel free (anyone) to make any changes that you see fit, and still keep it working ;-)

I will not be able to get back at this any time soon. Sorry.

@mih
Copy link
Member Author

mih commented Oct 15, 2020

Re travis run comment: That was suggested in the top-comment, and there is a link to another issue suggesting a separate run is a good idea. But again, I think getting this out ASAP has higher prior than most of these aspects.

@yarikoptic
Copy link
Member

On "urgency": FWIW trying it out in Win7 VM with anaconda and python 3.8.5. With or without this PR, if I do have an ssh key setup, I can `datalad clone`, just getting those mux error messages printed out
(datalad-dev) c:\tmp>datalad  clone smaug.datalad.org:/tmp/testds2
Clone attempt:   0%|          | 0.00/1.00 [00:00<?, ? Candidate locations/s]mm_receive_fd: no message header
mux_master_process_new_session: failed to receive fd 0 from slave
mm_receive_fd: no message header
mux_master_process_new_session: failed to receive fd 0 from slave
[INFO] Detected a filesystem without fifo support.
[INFO] Disabling ssh connection caching.
[INFO] Detected a crippled filesystem.
[INFO] Scanning for unlocked files (this may take some time)
[INFO] Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.
[INFO] Switched to branch 'adjusted/master(unlocked)'
mm_receive_fd: no message header
mux_master_process_new_session: failed to receive fd 0 from slave
install(ok): c:\tmp\testds2 (dataset)

To minimize diff, and retain public API within maint
For consistency in naming with SSHManager and the fact that even with Multiplex
someone might have just a single process, so imho NoMultiplex is more pertinent
So if we introduce some logging etc, it is just inherited
@mih
Copy link
Member Author

mih commented Oct 15, 2020

Ok, no urgency then. Win7 VMs and Python 3.8.5 for all ;-) And sunglasses.

@mih mih changed the title NF: Make multiplexed SSH connections and optional feature NF: Make multiplexed SSH connections an optional feature Oct 15, 2020
@yarikoptic
Copy link
Member

I took the invitation and pushed some changes, including to make tests work whenever a non-multiplex ssh is used.

test_ssh_custom_identity_file relies on captured log messages to see if
-i is passed to the actual ssh invocation.  With multiplexed ssh it would
have obtained that log message from "open" which does nothing for non-multiplexed.
Instead of adding a dummy log message to open of non-multiplexed, I decided that
it might be useful to just log actual ssh command ran for any of them, and what
connection had ran it.  So I added another log message.

Alternative fix could ha been: Used deeper to actually invoke ssh Runner
logs only at level 5.  Making swallow_logs swallow at level 5 works too, but
I like log messages, so went for solution above

Also, while at the code I spotted that lgr.debug for open was interpolating
the log string, so fixed that as well
@mih
Copy link
Member Author

mih commented Oct 16, 2020

Thx @yarikoptic !

@mih mih merged commit c728888 into datalad:maint Oct 16, 2020
0 of 2 checks passed
@mih mih deleted the enh-sshnomultiplex branch October 16, 2020 06:06
@yarikoptic
Copy link
Member

FTR -- a few tests were still failing but seems unrelated

appveyor - test_create.test_saving_prior
======================================================================
ERROR: datalad.core.local.tests.test_create.test_saving_prior
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 573, in _wrap_with_tree
    return t(*(arg + (d,)), **kw)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\core\local\tests\test_create.py", line 372, in test_saving_prior
    ds1 = create(topdir, force=True)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\interface\utils.py", line 494, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\interface\utils.py", line 482, in return_func
    results = list(results)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\interface\utils.py", line 413, in generator_func
    allkwargs):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\interface\utils.py", line 552, in _process_results
    for res in results:
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\core\local\create.py", line 406, in __call__
    fake_dates=fake_dates
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\support\repo.py", line 151, in __call__
    instance = type.__call__(cls, *new_args, **new_kwargs)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\support\annexrepo.py", line 276, in __init__
    self._init(version=version, description=description)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\support\annexrepo.py", line 1281, in _init
    annex_options=opts)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\support\annexrepo.py", line 1104, in _run_annex_command
    **kwargs)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\cmd.py", line 144, in run_gitcommand_on_file_list_chunks
    results.append(func(cmd, *args, **kwargs))
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\cmd.py", line 509, in run
    **results,
datalad.support.exceptions.CommandError: CommandError: '"git" "annex" "init" "-c" "annex.dotfiles=true" "--version" "6"' failed with exitcode 1 under C:\Users\appveyor\AppData\Local\Temp\1\datalad_temp_tree_jqqixz23
-------------------- >> begin captured logging << --------------------
asyncio: DEBUG: Using proactor: IocpProactor
asyncio: DEBUG: taking long time to close proactor
--------------------- >> end captured logging << ---------------------
travis - nfs run test_archives.test_decompress_file('strip',)
======================================================================
ERROR: datalad.tests.test_archives.test_decompress_file('strip',)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/utils.py", line 573, in _wrap_with_tree
    return t(*(arg + (d,)), **kw)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/test_archives.py", line 79, in check_decompress_file
    leading_directories=leading_directories)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/support/archives.py", line 99, in decompress_file
    rmdir(widow_dir)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/utils.py", line 468, in rmdir
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/tmp/nfsmount/datalad-rbhc9bsf/simple-extracted/ |;&%b5{}\'"ΔЙקم๗あ .dbtc '

kyleam added a commit to kyleam/datalad that referenced this pull request Oct 16, 2020
The non-trivial conflicts are between master's a575224 (RF: SSH
connector code uses WitlessRunner instead of Runner, 2020-07-16) and
maint's c728888 (Merge pull request datalad#5042 from
mih/enh-sshnomultiplex, 2020-10-16).  Resolve these by taking the
changes from c728888 and then reintegrating the
Runner->WitlessRunner changes.
@yarikoptic yarikoptic added this to the 0.13.5 milestone Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants