Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tell git-annex to go through datalad-sshrun #5389

Merged
merged 4 commits into from Mar 4, 2021

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Jan 28, 2021

This series addresses gh-5382. The first commit fixes a test bug that was revealed by the datalad-sshrun switch, the second does the actual switch, and the rest are non-essential tweaks on top.

@codecov
Copy link

codecov bot commented Jan 28, 2021

Codecov Report

Merging #5389 (bf6a773) into master (1b986aa) will decrease coverage by 3.42%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5389      +/-   ##
==========================================
- Coverage   89.59%   86.16%   -3.43%     
==========================================
  Files         296      296              
  Lines       41711    45308    +3597     
==========================================
+ Hits        37372    39041    +1669     
- Misses       4339     6267    +1928     
Impacted Files Coverage Δ
datalad/support/tests/test_annexrepo.py 94.74% <81.81%> (-2.25%) ⬇️
datalad/__init__.py 85.03% <100.00%> (+0.63%) ⬆️
datalad/cmd.py 92.33% <100.00%> (+0.02%) ⬆️
datalad/interface/common_cfg.py 100.00% <100.00%> (ø)
datalad/support/annexrepo.py 90.34% <100.00%> (-0.26%) ⬇️
datalad/support/gitrepo.py 91.10% <100.00%> (-0.85%) ⬇️
...talad/distributed/tests/test_create_sibling_ria.py 58.08% <0.00%> (-41.92%) ⬇️
datalad/downloaders/tests/test_docker_registry.py 61.76% <0.00%> (-38.24%) ⬇️
datalad/interface/tests/test_run_procedure.py 61.86% <0.00%> (-38.14%) ⬇️
datalad/core/local/tests/test_run.py 64.19% <0.00%> (-34.99%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b986aa...bf6a773. Read the comment docs.

@yarikoptic
Copy link
Member

Thank you @kyleam !! I am yet to better grasp the situation, but I wonder if we force annex to disregard annex-ssh-options (by always setting GIT_SSH_COMMAND) what side effects we cause? Eg, in reproman then there would be no way forward to make data getable from ec2 with pure datalad get (instead of orchestrator which points to the identity file ATM).

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and the consolidating all this logic outside annexrepo.py is very attractive.

Thank you @kyleam !! I am yet to better grasp the situation, but I wonder if we force annex to disregard annex-ssh-options (by always setting GIT_SSH_COMMAND) what side effects we cause? Eg, in reproman then there would be no way forward to make data getable from ec2 with pure datalad get (instead of orchestrator which points to the identity file ATM).

AFAICS datalad sshrun supports the config switch datalad.ssh.identityfile (by using SSHManager.get_connection() which queries it). I am not sure what other configurability is needed for reproman, but at least your given example should not be an obstacle.

Thx @kyleam !

Edit: I now see that precisely the support for that switch was the original motivation for this PR #5382

@yarikoptic
Copy link
Member

AFAICS datalad sshrun supports the config switch datalad.ssh.identityfile (by using SSHManager.get_connection() which queries it). I am not sure what other configurability is needed for reproman, but at least your given example should not be an obstacle.

right - somehow I have missed this obvious solution! Thanks

There would remain the issue of rendering git config's annex-ssh-options possibly set by user "manually" (did we set it? did git-annex?) ineffective... but FWIW I do not think I personally have ever set/relied on it.

@kyleam
Copy link
Contributor Author

kyleam commented Jan 29, 2021

Thanks @yarikoptic @mih @bpoldrack for the reviews/feedback.

@yarikoptic says:

There would remain the issue of rendering git config's annex-ssh-options possibly set by user "manually" (did we set it? did git-annex?) ineffective... but FWIW I do not think I personally have ever set/relied on it

Yes, true, that's a breaking change here, and I didn't give it much consideration, being a bit too focused on "fix things to make git-annex use sshrun as intended". Even without the annex-ssh-options behavior change, this is aggressive enough that I think it should be against master, and there is a semantic conflict, so I'll change the base (though with 0.14 quickly approaching, it of course doesn't really matter).

As for how to deal with the change in behavior, I guess the options are:

  • Discard support for annex-ssh-options (what this series does at the moment), taking that as the tradeoff for the changes in this series (i.e. using sshrun, as originally intended, which comes along with the nice "consolidating all this logic outside annexrepo.py" that @mih noted).

    From my viewpoint, that's a net win, but I'm not at all confident we won't break someone's workflow. We can point to .ssh/config and datalad.ssh.identityfile as alternatives, but it's of course still breakage. (And datalad.ssh.identityfile isn't remote-specific.)

  • Don't make git-annex go through sshrun (i.e. drop most of this series). That means that datalad.ssh.identityfile option is only sometimes in effect, and any logic we have in sshconnector won't apply to git-annex (except the stuff we communicate through -c arguments).

  • Make whether we set GIT_ANNEX_USE_GIT_SSH configurable, retaining the _set_shared_connection stuff for now. We could go through a deprecation cycle ("announce and keep current behavior as default" -> "change default/deprecate old behavior" -> "remove", perhaps dropping the first step).

Thoughts? Other options to consider?

@kyleam kyleam changed the base branch from maint to master January 29, 2021 19:17
setup_package() overrides $HOME during the tests.  This invalidates
the default values for a good number of configuration options.  One
particularly problematic case is datalad.locations.sockets because
tests inspect SSHManager's socket_dir, which still points to the
non-test location (e.g., $HOME/.cache/datalad/sockets/), _but_ when
'datalad sshrun' is executed via git or git-annex, the test location
is in effect.

test_annex_ssh() is the test that should hit into the above scenario.
The reason it doesn't fail is because git-annex isn't actually going
through datalad-sshrun despite GIT_SSH_COMMAND being set, so
everything is operating with the non-test location.  That will change
with the next commit.

To fix this, store the defaults that are subject to change as
functions and add a helper that uses these to set "default" in
common_cfg.definitions.  This gives setup_package() and
teardown_package() an easy hook to use.
git-annex started honoring GIT_SSH and GIT_SSH_COMMAND in 6.20170321.
However, since 6.20170510, it won't do so unless the caller sets
GIT_ANNEX_USE_GIT_SSH to confirm that the custom SSH command supports
`-n`.  So, although get_git_environ_adjusted() sets GIT_SSH_COMMAND to
'datalad-ssrun' (which supports `-n`), git-annex doesn't actually use
it.

Update get_git_environ_adjusted() to set GIT_ANNEX_USE_GIT_SSH as
well.

Fixes datalad#5382.
With the last commit wiring up git-annex to go through datalad-sshrun,
_set_shared_connection() is no longer relevant.  git-annex ignores
.annex-ssh-options when GIT_SSH_COMMAND is set.  And nothing is gained
by priming things with ssh_manager.get_connection(url); it's whether
the connection is open that matters for it staying alive after a given
process ends.
There are already some spots that call
ssh_manager.get_connection(url).open() to enable SSH connection
caching (e.g., Git.Repo._fetch_push_helper() and GitRepo.pull()).  If
they didn't, each datalad-sshrun subprocess (invoked via git) would
close the connections because _it_ was the one that opened them.

Add a helper that handles the common case, and use it in a few more
obvious spots, including AnnexRepo.get() and AnnexRepo.copy_to().  (As
of a few commits back, git-annex now goes through datalad-sshrun.)
These are probably the lowest hanging fruit, but there are almost
certainly other spots that would benefit from opening up a connection,
and perhaps there's a better level to address this than individual
methods.
@yarikoptic
Copy link
Member

sad that we have not settled down on this PR before releasing. As for approaches: it is not feasible/difficult to not set GIT_ANNEX_USE_GIT_SSH if remote has annex-ssh-options set, right?

@bpoldrack
Copy link
Member

it is not feasible/difficult to not set GIT_ANNEX_USE_GIT_SSH if remote has annex-ssh-options set, right?

Would disable sshrun functionality and thereby render respective datalad configs inactive. Not obvious for users and probably harder to find out when and how to use it correctly than simply saying: datalad sets GIT_ANNEX_USE_GIT_SSH.

However, if we decide users should be able to separate datalad and annex wrt to SSH, we arguably would also need to check whether or not GIT_ANNEX_USE_GIT_SSH and GIT_SSH_COMMAND are already set and leave them untouched.
But this currently would result in effects that are not at all easy to see, I think. While such an approach technically doesn't break user configs it's probably still surprising behavior (like: I set the datalad config for identity file, but when I run a datalad get it doesn't use it?).
And incorporating annex-ssh-options and GIT_SSH_COMMAND to our SSH call(s) from within sshrun instead, leads down a rabbit hole we don't want to get into. Plus: There's no general solution since GIT_SSH_COMMAND can be anything. We don't know what this thing would do, what configs it considers etc.
Seems to me - one way or another we break sophisticated user setups for SSH w/ git+git-annex.

@kyleam
Copy link
Contributor Author

kyleam commented Feb 4, 2021

Not obvious for users and probably harder to find out when and how to use it correctly than simply saying: datalad sets GIT_ANNEX_USE_GIT_SSH.

That's pretty much where I'm landing. On top of that, I think from a developer perspective the datalad code around annex ssh handling is fairly hard to troubleshoot and reason about; adding a branch in the logic would make matters worse (whereas the current state of this PR drops much of the custom annexrepo logic).

I do think it is unfortunate that git annex get and datalad get could diverge in their handling, but I'm not sure it matters much in practice.

@yarikoptic
Copy link
Member

just another observation: if I prescribe ControlPath etc in ~/.ssh/config (e.g. to ease my life while interacting with systems requiring 2FA, where i am doomed to login interactively first) that setting would effectively be ignored by datalad which would provide its own ControlPath settings.

@kyleam
Copy link
Contributor Author

kyleam commented Feb 17, 2021

if I prescribe ControlPath etc in ~/.ssh/config (e.g. to ease my life while interacting with systems requiring 2FA, where i am doomed to login interactively first) that setting would effectively be ignored by datalad which would provide its own ControlPath settings.

Isn't that already the current state of things with DataLad [*]? Say I have this entry in my ~/.ssh/config:

Host datalad-test
HostName localhost
User dl
ControlPersist 600
ControlMaster auto
ControlPath /tmp/dl-test-socket-%r@%h:%p
Port 42241
datalad-get script
set -eu

cd "$(mktemp -d /tmp/dl-ssh-XXXXXXX)"
tdir="$(pwd)"

find ~/.cache/datalad/sockets -type s
ls /tmp/dl-test-socket-* || :

datalad create a
(
    cd a
    echo one >one
    datalad save
)

datalad clone datalad-test:"$tdir"/a b
(
    cd b
    datalad -l 8 get .
)

find ~/.cache/datalad/sockets -type s
ls /tmp/dl-test-socket-* || :

If I run that on the base of this PR (1b986aa), the socket specified in ssh_config isn't used.

ls: cannot access '/tmp/dl-test-socket-*': No such file or directory
[INFO] Creating a new annex repo at /tmp/dl-ssh-frswTdt/a 
[... 59 lines ...]
[DEBUG] Async run ['git', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '--debug', '-c', 'annex.dotfiles=true', '-c', 'remote.origin.annex-ssh-options=-o ControlMaster=auto -S /home/kyle/.cache/datalad/sockets/bddf1a95', '-c', 'annex.retry=3', '--', '.'] 
[DEBUG] Launching process ['git', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '--debug', '-c', 'annex.dotfiles=true', '-c', 'remote.origin.annex-ssh-options=-o ControlMaster=auto -S /home/kyle/.cache/datalad/sockets/bddf1a95', '-c', 'annex.retry=3', '--', '.'] 
[... 11 lines ...]
[DEBUG] from origin... [get(/tmp/dl-ssh-frswTdt/b/one)] 
get(ok): one (file) [from origin...]
[DEBUG] Closing 1 SSH connections... 
[DEBUG] Not closing MultiplexSSHConnection(ctrl_path=<<PosixPath('/ho++37 chars++95')>>, sshri=SSHRI(hostname='datalad-test')) since was not opened by itself 
/home/kyle/.cache/datalad/sockets/bddf1a95
ls: cannot access '/tmp/dl-test-socket-*': No such file or directory

[*] as well as with git-annex directly, unless annex.sshcaching=false is set to false

@yarikoptic
Copy link
Member

Isn't that already the current state of things with DataLad [*]?

yes... it was just an observation, and indeed it attests to the fact that datalad (and possibly git-annex, I do not remember if I have tried) already is not playing 100% nice with system/use-wide ssh configuration... so most likely this PR would not widen the already existing rift.

@bpoldrack
Copy link
Member

bpoldrack commented Feb 18, 2021

As for how to deal with the change in behavior, I guess the options are:

Discard support for annex-ssh-options (what this series does at the moment), taking that as the tradeoff for the changes in this series (i.e. using sshrun, as originally intended, which comes along with the nice "consolidating all this logic outside annexrepo.py" that @mih noted).

From my viewpoint, that's a net win, but I'm not at all confident we won't break someone's workflow. We can point to .ssh/config and datalad.ssh.identityfile as alternatives, but it's of course still breakage. (And datalad.ssh.identityfile isn't remote-specific.)

My vote is on this option and therefore just merge. I don't think it's possible to not break any possible setup around SSH+git+git-annex that is designed w/o datalad. Neither the create-sibling-* madness nor the special remotes can possibly invoke SSH via git/git-annex which means that we would need to reimplement everything git/git-annex does in that regard. That's just not maintainable even if we would get there for a particular combination of versions (which I doubt).
Ultimate point where we would prob. always fail to be aligned with git/git-annex is GIT_SSH_COMMAND pointing to something that may not even use openssh underneath leaving us with no idea how exactly to use it for calls that requiring more arguments than git's specification of the interface for this has foreseen (i.e. any actual use of a remote shell).

@yarikoptic
Copy link
Member

I am also thinking about having a config item which would instruct datalad to not bother doing anything about SSH, but ideally that one should be set'able per remote. This way users would be able to configure via .git/config or cmdline (-c remote.origin.datalad-sshrun=false) to disable it (under assumption that remote name is consistent across subdatasets, but we already have that assumption with all the sibling name specifications for recursive operation)

@bpoldrack
Copy link
Member

I am also thinking about having a config item which would instruct datalad to not bother doing anything about SSH, but ideally that one should be set'able per remote. This way users would be able to configure via .git/config or cmdline (-c remote.origin.datalad-sshrun=false) to disable it (under assumption that remote name is consistent across subdatasets, but we already have that assumption with all the sibling name specifications for recursive operation)

I see the appeal of such an approach, but more complicated than it sounds, I think. What would that mean? Not touching any of the env vars when calling git/git-annex? Not even sufficient. How do you know which remote git-annex-get is about to use? Duplicating annex' logic to find out beforehand?
And even if we get there, it's GitWitlessRunner that ultimately needs to do it (or NOT do it). And the runner doesn't know anything about what remote will be involved. So would need to be passed down from every toplevel command at least. Again probably insufficient, since then lower level functions wouldn't know when they are called via python API.
Maybe there's a way I don't see, but it looks like a huge thing that lots of places need to be aware of.

@mih
Copy link
Member

mih commented Mar 4, 2021

Looking at the discussion again after a month, and weighing the pros and cons I still think the benefits dominate the penalties. So I stand by my original approval. Looking into a better way to make datalad and SSH play with each other is much desirable, but this particular PR does not make things worse at the behavior level...code-wise it is a substantial improvement.

I restarted the error-test on Travis FTR, but I would merge once things are green.

@mih mih merged commit b15ebe9 into datalad:master Mar 4, 2021
@kyleam kyleam deleted the git-annex-use-git-ssh branch March 4, 2021 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants