Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: Helper to support Git-style URL rewriting #4064

Merged
merged 6 commits into from Jan 22, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
46 changes: 46 additions & 0 deletions datalad/config.py
Expand Up @@ -722,3 +722,49 @@ def unset(self, var, where='dataset', reload=True):

# use unset all as it is simpler for now
self._run(['--unset-all', var], where=where, reload=reload)


def rewrite_url(cfg, url):
"""Any matching 'url.<base>.insteadOf' configuration is applied

Any URL that starts with such a configuration will be rewritten
to start, instead, with <base>. When more than one insteadOf
strings match a given URL, the longest match is used.

Parameters
----------
cfg : ConfigManager or dict
dict-like with configuration variable name/value-pairs.
url : str
URL to be rewritten, if matching configuration is found.

Returns
-------
str
Rewritten or unmodified URL.
"""
insteadof = {
# only leave the base url
k[4:-10]: v
for k, v in cfg.items()
if k.startswith('url.') and k.endswith('.insteadof')
}
prev_url = None
while url != prev_url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, git itself doesn't recursively resolve the insteadof values. Is there a concrete use case for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it to be an awesome thing to do, but now I cannot think of an actual need for it. Will remove (but keep the memory of having done it close to my heart).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

prev_url = url
# all config that applies
matches = {k: len(v) for k, v in insteadof.items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see one problem here: You match all configs (well, the longest). However, what if I have a --global config that is long and a --local overwrite which is shorter? I think this should use ConfigManager's get, check for a returned tuple and consider order therein.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not how gitconfig works:

% git config --show-origin -l | grep 'github.*insteadof'
file:/home/mih/.gitconfig       url.git@github.com:.insteadof=github:
% git config --add 'url.git@github.com:.insteadof' 'mike:'
% git config --show-origin -l | grep 'github.*insteadof'                                                      
file:/home/mih/.gitconfig       url.git@github.com:.insteadof=github:
file:.git/config        url.git@github.com:.insteadof=mike:
% git config --get 'url.git@github.com:.insteadof'
mike:

Local overwrites global, it does not append.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my suggestion is wrong. The usual tuple considers the keys, not the values. So, get doesn't work. Still, I think same replacements should overwrite each other. So, assuming items yielding in correct order (least significant first), that insteadof dict, needs to be based on the values, I guess.

Copy link
Member Author

@mih mih Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't get it yet. What is it that this function does or doesn't that git itself does when rewriting URLs. The point here is to be as clever (no more, no less) than Git.
Here is the definition: https://git-scm.com/docs/git-config#Documentation/git-config.txt-urlltbasegtinsteadOf

Copy link
Member

@bpoldrack bpoldrack Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is to be as clever (no more, no less) than Git.

I absolutely agree.
I was talking about using the same "label" with different replacements (basically ria usecase). So, by definition the matches don't differ in length. And I thought that git itself would overwrite them, if specified at different levels. However, that doesn't seem to be the case.

/tmp/some$ git config --add --global url."example.com:".insteadOf ex:
/tmp/some$ git config --add --local url."other.com:".insteadOf ex:
/tmp/some$ git remote add test ex:/some/where
/tmp/some$ git remote -v
test    example.com:/some/where (fetch)
test    example.com:/some/where (push)

That's quite unexpected to me.

Edit: Also note, that on same length of the real URL piece the same thing happens: .git/config isn't applied, but ~/.gitconfig is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also well be, that the two of user interpret differently what length is considered by git.

When more than one insteadOf strings match a given URL, the longest match is used.

In my example I'd consider ex: the thing that needs to be matched and whose length is relevant. If it's actually the length of the replacement it's a different story. But then we need to have better wording than git itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last for now: Tried to somehow apply local config for that ex: label above. So, length of replacement doesn't matter, length of label does (as it should judging by my understanding of the docs). However, at equal length apparently --global takes precedence over --local, which I find weird.
Not sure, whether we really account for that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the match is relevant, not the length of the replacement (see len(v) where v is the value of the config).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like te discussion ended up being about unexpected behavior that is a consequence of git's decision to use the replacement as the key (discussed here). But I think Ben's point about the value potentially being a tuple still needs to be considered to be safe. Consider for example the unlikely event that the user has repeat values for the same key:

$ cd $(mktemp -d --tmpdir ga-XXXXXXX)
$ git init
$ git config url.'bar:'.insteadof foo:
$ git config --add url.'bar:'.insteadof foo:
$ git remote add test bar:abc
$ git remote -v
test	bar:abc (fetch)
test	bar:abc (push)

Here's what rewrite_url would currently return.

$  python -c 'from datalad import config; cfg = config.ConfigManager(); print(config.rewrite_url(cfg, "foo:abc"))'
bar:o:abc

That's because the value is a tuple and the length is calculated as 2:

$ python -c 'from datalad import cfg; print(cfg.get("url.bar:.insteadof"))'
('foo:', 'foo:')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if url.startswith(v)}
# find longest match, like Git does
if matches:
rewrite_base, match_len = sorted(
matches.items(),
key=lambda x: x[1],
reverse=True,
)[0]
url = '{}{}'.format(rewrite_base, url[match_len:])
return url


# for convenience, bind to class too
ConfigManager.rewrite_url = rewrite_url
3 changes: 3 additions & 0 deletions datalad/core/distributed/clone.py
Expand Up @@ -771,6 +771,9 @@ def decode_source_spec(spec, cfg=None):
props['type'] = 'dataladri'
props['giturl'] = source_ri.as_git_url()
elif isinstance(source_ri, URL) and source_ri.scheme.startswith('ria+'):
# Git never gets to see these URLs, so let's manually apply any
# rewrite configuration Git might know about
source_ri = RI(cfg.rewrite_url(spec))
# parse a RIA URI
dsid, version = source_ri.fragment.split('@', maxsplit=1) \
if '@' in source_ri.fragment else (source_ri.fragment, None)
Expand Down
24 changes: 24 additions & 0 deletions datalad/core/distributed/tests/test_clone.py
Expand Up @@ -64,6 +64,7 @@
with_sameas_remote,
known_failure,
known_failure_appveyor,
patch_config,
)
from datalad.core.distributed.clone import (
decode_source_spec,
Expand Down Expand Up @@ -686,6 +687,29 @@ def test_ria_http(lcl, storepath, url):
lcl / 'clone_failed')
assert_in("not found in upstream", str(cme.exception))

# lastly test if URL rewriting is in effect
# on the surface we clone from an SSH source identified by some custom
# label, no full URL, but URL rewriting setup maps it back to the
# HTTP URL used above
with patch_config({
'url.ria+{}#.insteadof'.format(url): 'ria+ssh://somelabel#'}):
cloned_by_label = clone(
'ria+ssh://somelabel#{}'.format(origds.id),
lcl / 'cloned_by_label',
)
# so we get the same setup as above, but....
eq_(origds.id, cloned_by_label.id)
if not ds.repo.is_managed_branch():
# test logic cannot handle adjusted branches
eq_(origds.repo.get_hexsha(), cloned_by_label.repo.get_hexsha())
ok_(cloned_by_label.config.get('remote.origin.url').startswith(url))
eq_(cloned_by_label.config.get('remote.origin.annex-ignore'), 'true')
# ... the clone candidates go with the label-based URL such that
# future get() requests acknowlege a (system-wide) configuration
# update
eq_(cloned_by_label.config.get('datalad.get.subdataset-source-candidate-origin'),
'ria+ssh://somelabel#{id}')


@skip_if_no_network
@with_tempfile()
Expand Down
30 changes: 29 additions & 1 deletion datalad/tests/test_config.py
Expand Up @@ -32,7 +32,10 @@

from datalad.distribution.dataset import Dataset
from datalad.api import create
from datalad.config import ConfigManager
from datalad.config import (
ConfigManager,
rewrite_url,
)
from datalad.cmd import CommandError

from datalad.support.external_versions import external_versions
Expand Down Expand Up @@ -373,3 +376,28 @@ def test_overrides():
cfg._cfgfiles,
[Path(f).read_text() for f in cfg._cfgfiles if Path(f).exists()],
))


def test_rewrite_url():
test_cases = (
# no match
('unicorn', 'unicorn'),
# custom label replacement
('example:datalad/datalad.git', 'git@example.com:datalad/datalad.git'),
# protocol enforcement
('git://example.com/some', 'https://example.com/some'),
# chained rewrite
('ria+ssh://myinst#SOMEID', 'ria+file://some/path#SOMEID'),
)
cfg_in = {
'git@example.com:': 'example:',
'https://example': 'git://example',
'ria+ssh://fully.qualified.com': 'ria+ssh://myinst',
'ria+file://some/path': 'ria+ssh://fully.qualified.com',
}
cfg = {
'url.{}.insteadof'.format(k): v
for k, v in cfg_in.items()
}
for input, output in test_cases:
assert_equal(rewrite_url(cfg, input), output)