Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone URL mapping ability #5749

Merged
merged 5 commits into from Jun 30, 2021
Merged

Clone URL mapping ability #5749

merged 5 commits into from Jun 30, 2021

Conversation

mih
Copy link
Member

@mih mih commented Jun 16, 2021

This implements a simple feature and fix for #4822. It allows for configuring an arbitrary number of URL mapping specification. With this change one can configure datalad to transfer URLs like https://osf.io/q8xnk/ (copied from a browser window), which do not work, but can be reliably mapped into working URLs (osf://q8xnk in this case) via a regular expression specification.

In the example above, this would do the trick:

datalad -c 'datalad.url.substitute.osf=,^https://osf.io/([^/]+)[/]*$,osf://\1' clone https://osf.io/q8xnk/

as usual, configuration is read from all supported places. Here is an example of a mapping rule
for GitHub

% datalad x-configuration
...
# GitHub URL substitution rule
# Substitution specification defined as a string with a match and
# substitution expression, each following Python's regular expression
# syntax. Both expressions are concatenated to a single string with an
# arbitrary delimiter character. The delimiter is defined by prefixing
# the string with the delimiter. Prefix and delimiter are stripped
# from the expressions (Example: ",^http://(.*)$,https://\1"). This
# setting can be defined multiple times. Substitutions will be applied
# incrementally, in order of their definition. The first substitution
# in such a series must match, otherwise no further substitutions in a
# series will be considered. However, following the first match all
# further substitutions in a series are processed, regardless whether
# intermediate expressions match or not.
datalad.clone.url-substitute.github=(',https?://github.com/([^/]+)/(.*)$,\\1###\\2', ',[/\\\\]+,-', ',\\s+|(%2520)+|(%20)+,_', ',([^#]+)###(.*),https://github.com/\\1/\\2')
...

It would be possible to implement a mechanism to read additional specifications from an entrypoint that can be provided by extensions. However, it would make also sense to finally support extension-provided (default) configuration and use that mechanism.

TODO:

  • Implement sketch for discussion
  • Documentation
  • Test
  • should it rather be datalad.clone.url-substitute? (to limit the scope)
  • Institutionalize OSF-related mapping from example above?

Closes #4822

@codecov
Copy link

codecov bot commented Jun 16, 2021

Codecov Report

Merging #5749 (14ba2cf) into master (f5b6db0) will decrease coverage by 35.40%.
The diff coverage is 50.39%.

❗ Current head 14ba2cf differs from pull request most recent head 3864a53. Consider uploading reports for the commit 3864a53 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5749       +/-   ##
===========================================
- Coverage   90.63%   55.23%   -35.41%     
===========================================
  Files         308      306        -2     
  Lines       42181    42345      +164     
===========================================
- Hits        38231    23389    -14842     
- Misses       3950    18956    +15006     
Impacted Files Coverage Δ
datalad/config.py 75.00% <0.00%> (-21.72%) ⬇️
datalad/core/local/create.py 97.16% <ø> (-0.71%) ⬇️
datalad/core/local/diff.py 90.72% <ø> (ø)
datalad/core/local/run.py 98.15% <ø> (ø)
datalad/core/local/status.py 96.42% <ø> (ø)
datalad/distributed/create_sibling_ria.py 67.81% <ø> (-14.95%) ⬇️
datalad/distributed/export_archive_ora.py 38.46% <ø> (-43.08%) ⬇️
datalad/distributed/ora_remote.py 29.07% <0.00%> (-2.74%) ⬇️
datalad/distribution/create_sibling.py 51.81% <ø> (-29.40%) ⬇️
datalad/distribution/dataset.py 92.46% <ø> (-1.99%) ⬇️
... and 255 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5b6db0...3864a53. Read the comment docs.

# MIH: unclear if readding the original URL as a fallback,
# in case the mapping was stupid is desirable.
# if the candidate URL was hopeless to begin with, it would
# result in two failed attempts instead of one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re comment - I agree. Don't "fix" explicitly given configs. However, we could use the hint mechanism to point out, that a candidate failed that was the result of such a replacement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: By agreeing I mean: Do NOT add the original one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re comment - I agree. Don't "fix" explicitly given configs. However, we could use the hint mechanism to point out, that a candidate failed that was the result of such a replacement.

I am not so sure about that. Config could come from all kinds of places, also sources that are not under a user's control. If any of those substitutions is too loose, what then? File a bug and wait for a release?

Copy link
Member

@bpoldrack bpoldrack Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of those substitutions is too loose, what then? File a bug and wait for a release?

Put a high-priority noop substitution in your confgi? ;-)

Pointing to one issue: If the config is read from everywhere (as it should), explicit priorities seem to be needed (rather than just implicitly by git-config-return order. I think that is true independently of the question at hand.
There could be conflicting substitutions and you don't want to try the wrong one first everytime.

@yarikoptic
Copy link
Member

  • should it rather be datalad.clone.url-substitute? (to limit the scope)

FWIW - +1 on that. I just wonder if we would have some other 'clone.url' related config options (.backends, etc) and might want to make it datalad.clone.url.substitute although we do have only few of such 4 level'ed configs and generally they are more of the 3 level form with last level composing into a single word.

@yarikoptic
Copy link
Member

  • should it rather be datalad.clone.url-substitute? (to limit the scope)

FWIW - +1 on that. ...

FWIW, I thought that may be we could/should parallel how git does it, e.g. I have

[url "git@github.com:"]
    pushinsteadOf = https://yarikoptic@github.com/
    pushinsteadOf = https://github.com/
    pushinsteadOf = http://github.com/
    pushInsteadOf = git://github.com/

but trying it out, with

[datalad "url.osf://\1"]
    cloneinsteadOf = "https://osf.io/([^/]*)/?"

points to shortcoming of swallowed backslash

$> git config -f /tmp/urls.config --get-regex 'url\.*'
datalad.url.osf://1.cloneinsteadof https://osf.io/([^/]*)/?

but also if target pattern has . in it, I guess splitting apart would also be tricky

$> cat /tmp/urls.config
[datalad "url.http://something.com/\1/download"]
    cloneinsteadOf = "https://buga.io/([^/]*)/?"

$> git config -f /tmp/urls.config --get-regex 'url\.*'
datalad.url.http://something.com/1/download.cloneinsteadof https://buga.io/([^/]*)/?

although we could always assume that it is datalad.url.<str>.whatinsteadof and thus .rsplit(1) everything after datalad.url.

@mih
Copy link
Member Author

mih commented Jun 21, 2021

re config style @yarikoptic

I was toying with the idea to adopt that pattern a bit more

[url "osf://\\1"]
        datalad-clone-insteadOf = "https://osf.io/([^/]*)/?"
% git config -l | grep osf
url.osf://\1.datalad-clone-insteadof=https://osf.io/([^/]*)/?

(also note the escaped backslash)

I don't see why we could not use the url. namespace directly in this fashion, and it would make things look nicer and better grouped, per URL (i.e. github handling, OSF handling ...).

The counter argument is that we need/require(?) a different URL specification style -- which actually makes it a similar, but different beast.

However, which could extend (instead of replacing) the specification, such that if we see something like this:

[url "git@github.com:"]
    datalad-clone-insteadof = "https://yarikoptic@github.com/"

we replace it internally with:

[url "git@github.com:\\1"]
    datalad-clone-insteadof = "https://yarikoptic@github.com/(.*)$"

i.e. mimic the base URL replacement done by Git.

re priorities @bpoldrack

Pointing to one issue: If the config is read from everywhere (as it should), explicit priorities seem to be needed (rather than just implicitly by git-config-return order. I think that is true independently of the question at hand. There could be conflicting substitutions and you don't want to try the wrong one first everytime.

I am struggling to come up with a real use case for this -- if you had some in mind, please share. If there is no real use case, I think we should not implement prioritization. If we find out later that we need it, we could easily extend the setting name scheme ala datalad.get.subdataset-source-candidate-00...:

[url "osf://\\1"]
        datalad-clone-insteadof-100onlythisworks = "https://osf.io/([^/]*)/?"

My rational for "we don't need this" is:

  • for subdataset clone candidates we already have a mechanism that can do this -- no need to have two alternatives -- so we are talking about initial clones only
  • for initial clones the cost of a failed attempt is much lower (i.e. not attempted wrongly for each of 10k subdataset, but only once for the root)
  • we have a prioritization of config sources already, a system/user/repo-provided suboptimal configuration can already by overridden

So Q is: which use case cannot be handled by the levers that are already in place?

(NB: Reported typos are fixed locally)

@yarikoptic
Copy link
Member

I don't see why we could not use the url. namespace directly in this fashion, and it would make things look nicer and better grouped, per URL (i.e. github handling, OSF handling ...).

primarily for consistency since all datalad-specific settings reside under datalad. . What I also wonder -- may be should be a hybrid like

[datalad "url.datalad-osf"]
    cloneinsteadOf = ",https://osf.io/([^/]*)/?,url.osf://\1"

or generally

[datalad "url.<mapid>"]
    cloneinsteadOf = "<substitution-pair>"

which might help to avoid any possible oddity from sticking regexes into section names, easier overrides (we would need to disallow multiple entries per section I guess), etc?

@mih
Copy link
Member Author

mih commented Jun 22, 2021

[datalad "url.<mapid>"]
   cloneinsteadOf = "<substitution-pair>"

TBH, I don't really like it. The only thing is has in common with the git-construct is the name. I do like the <substitution-pair> syntax better than what git does. Unless there is a good use of the mapid, I think I came almost full circle to datalad.clone.url-substitute = "<substitution-pair>".

Update: I might have found a use for mapid. Stay tuned.

@mih
Copy link
Member Author

mih commented Jun 27, 2021

I have updated the PR with a new implementation that is capable of providing a fix analog to #5709, but without custom service dependent conditional. Instead, we can define and ship a default mapping for Github URL, as demo'd in the included unit test at https://github.com/datalad/datalad/pull/5749/files#diff-9789036616f2c57b1a1882ce86a6dc36cd5611c24e73a17eb1305da8335d98d8R1604

A full documentation update is still pending (which is pending a decision of match prioritization, see notes inside), but maybe this is already enough for you @yarikoptic to get an idea where I am going.

In contrast to #5709 this is a fairly generic feature with relatively wide applicability (e.g. #4822), but pretty much any kind of URL sanitization, redirection, reconfiguration. It also makes my suggestion in #5709 (comment) obsolete, as it offers the same functionality without the custom building block.

@mih

This comment has been minimized.

It allows for configuring an arbitrary number of URL mapping
specification. With this change one can configure datalad to transfer
URLs like https://osf.io/q8xnk/ (copied from a browser window), which do
not work, but can be reliably mapped into working URLs (osf://q8xnk in
this case) via a regular expression specification.

Fixes dataladgh-4822
@mih mih force-pushed the bf-4822 branch 2 times, most recently from a1956aa to 2759e66 Compare June 28, 2021 09:54
@mih mih marked this pull request as ready for review June 28, 2021 09:55
This is an alternative fix to datalad#5709 that makes dataset
hierarchies installable from GitHub -- as they are put up by
`create-sibling-github`. The main difference here is that there is
no service-specific code needed, only a configuration of the necessary
mapping(s). Consequently, the mechanism is applicable to other types of
mappings and other services.

The GitHub usecase is already rather sophisticated, but I believe the
approach of an initial critical match (to enter a mapping series), plus
an arbitrary number of additional may-also-match mappings is relative
powerful, and may be able to capture even more use cases.

What remains to be decided and implemented (or ignored) is a
prioritization of match series.

The tests provided by @yarikoptic with datalad#5709 are included
here too.
mih added 2 commits June 29, 2021 10:20
As noted in the code, this should move to the datalad-osf extension,
as soon as it could provide such configuration.
datalad#5769

But for now, it can meaningfully serve as a demo how multiple
subsitution series can coexist.
Leave a comment for a potential future sorting feature.
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't try, left comments on TODO comments, but otherwise - love it

# if we have a matching mapping replace the URL in its position
for u in urls:
# we only permit a single match
# TODO we likely want to RF this to pick the longest match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why longest would be desired

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what Git does, when deciding which insteadof setting to apply. The rational is probably, the longer the match the more specific. In the regex world things will not be that simple, but the analog "more specific before more generic" could still make sense and be implementable.

datalad/core/distributed/clone.py Show resolved Hide resolved
from datalad.interface.common_cfg import definitions
subst_keys.update(k for k in definitions if k.startswith(cfg_prefix))
# TODO a potential sorting of substitution series could be implemented
# here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we would need to introduce datalad.clone.url-substitute.<series>.priority or alike?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, or sorting based on series ID, or inspection of the regex match specification (see "longest match" comment). I think there are multiple possible ways forward.

@bpoldrack
Copy link
Member

@mih

My rational for "we don't need this" is:

  • for subdataset clone candidates we already have a mechanism that can do this -- no need to have two alternatives -- so we are talking about initial clones only
  • for initial clones the cost of a failed attempt is much lower (i.e. not attempted wrongly for each of 10k subdataset, but only once for the root)
  • we have a prioritization of config sources already, a system/user/repo-provided suboptimal configuration can already by overridden

Generally agree. What I had in mind is admittedly mostly hypothetical. There could be different handlers from different extensions that want to handle URL from some platform. I might then want to curate the order since I know that I'll mostly want A to handle it and only fall back to B if A can't succeed. Something like that. But I guess we can stick with current ways of overriding for now. I'm fine with deferring that until we find it's actually needed.

@mih
Copy link
Member Author

mih commented Jun 30, 2021

Thanks for the comments and the approval. I guess this is good to go now, and we will see when the need for priorities arises.

@mih mih merged commit 9ff9ce3 into datalad:master Jun 30, 2021
@mih mih deleted the bf-4822 branch June 30, 2021 06:06
@yarikoptic yarikoptic added the semver-minor Increment the minor version when merged label Jul 8, 2021
@yarikoptic yarikoptic added this to the 0.15.0 milestone Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clone/install: introduce url mapping convenience?
3 participants