BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499

kyleam · 2019-06-28T20:06:27Z

To decide if a remote is an annex special remote, we check whether
'annex-externaltype' or 'annex-webdav' is configured for the remote.
This mis-classifies nearly all internal special remotes (anything but
webdav) as an ordinary remote because those don't have
'annex-externaltype' configured.

There's no common annex- option for special remotes. As the
definition of annex's findSpecialRemotes shows, each special remote is
expected to have at least an annex-TYPE option. We could keep a list
of known internal special remotes and look for that or externaltype,
but the list of internal special remotes is a moving target.

findSpecialRemotes' documentation mentions that "special remotes don't
have a configured url", so let's instead identify them based on not
having a URL and having some annex- option aside from annex-uuid and
annex-ignore, two particularly common non-special-remote options.

Given that all valid Git remotes have a configured URL, this should be
a reliable classification for correctly configured remotes. It's of
course possible to get mis-classifications if the user either removes
or adds a URL to a remote's section.

Another option would be to rely on get_special_remotes(), which calls
git cat-file git-annex:remote.log to get the special remotes, but
that'd be a change in behavior because it also considers remotes that
aren't enabled. It'd also be slower:

# This patch
% python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')"
  "ar.is_special_annex_remote('origin')"
10000 loops, best of 3: 143 usec per loop

# Using get_special_remotes() in is_special_annex_remote()
% python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')"
  "ar.is_special_annex_remote('origin')"
100 loops, best of 3: 8.82 msec per loop

Fixes #3497.

To decide if a remote is an annex special remote, we check whether 'annex-externaltype' or 'annex-webdav' is configured for the remote. This mis-classifies nearly all internal special remotes (anything but webdav) as an ordinary remote because those don't have 'annex-externaltype' configured. There's no common annex- option for special remotes. As the definition of annex's findSpecialRemotes shows, each special remote is expected to have at least an annex-TYPE option. We could keep a list of known internal special remotes and look for that or externaltype, but the list of internal special remotes is a moving target. findSpecialRemotes' documentation mentions that "special remotes don't have a configured url", so let's instead identify them based on _not_ having a URL and having some annex- option aside from annex-uuid and annex-ignore, two particularly common non-special-remote options. Given that all valid Git remotes have a configured URL, this should be a reliable classification for correctly configured remotes. It's of course possible to get mis-classifications if the user either removes or adds a URL to a remote's section. Another option would be to rely on get_special_remotes(), which calls `git cat-file git-annex:remote.log` to get the special remotes, but that'd be a change in behavior because it also considers remotes that aren't enabled. It'd also be slower: # This patch % python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')" "ar.is_special_annex_remote('origin')" 10000 loops, best of 3: 143 usec per loop # Using get_special_remotes() in is_special_annex_remote() % python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')" "ar.is_special_annex_remote('origin')" 100 loops, best of 3: 8.82 msec per loop Fixes datalad#3497.

kyleam · 2019-06-28T20:09:30Z

@leej3, the patch has changed a bit from the one posted at gh-3497, so it'd be great if you could confirm that this PR fixes your issue as well. Thanks!

leej3 · 2019-06-28T22:28:21Z

Confirmed. This works too. Thanks

yarikoptic · 2019-06-29T13:55:41Z

Thank you! Is that such a frequent call to do that timing is critical? I would have gone for get_special_remotes() + taking intersection with enabled (present in .git/config) remotes so that there is no guess work

kyleam · 2019-07-01T14:40:06Z

Is that such a frequent call to do that timing is critical?

To my mind the question isn't whether it is critical (it's not, and very few things are), but what the extra cost gains us.

Within datalad itself we call this when looping over remotes and when traversing dataset hierarchies, so by not calling get_special_remotes() within is_special_annex_remote() we're avoiding some number of git cat-file calls that depends on the repository state and dataset layout. (Or alternatively we're avoiding reworking the code and maybe interface to use some kind of cache.)

I would have gone for get_special_remotes() + taking intersection with enabled (present in .git/config) remotes so that there is no guess work

So AFAICS the main benefit there is that, if the config file has an invalid Git remote without a URL and with some annex-* options, we don't mis-classify it as a special remote. A Git operation with the invalid remote would fail anyway, and I don't see an issue with is_special_annex_remote() assuming a valid remote.

Do you think there are any valid remotes where the intersection approach would give a different answer?

I was able to come up with one: gcrypt. The problem there is that the remote is actually a normal remote too. So while technically is_special_annex_remote() should return true, doing so would make the two current callers of is_special_annex_remote() behave in an unintended way because those callers assume a remote is either special or not.

yarikoptic · 2019-07-01T15:34:42Z

Similar case I guess might happen with a git special remote which we establish when using some ssh+http accessible server as a common data src? (--as-common-datasrc) Indeed semantic there is not clear -- those are "hybrid" remotes. In the long run I wonder if we should RF into get_remotes(..., type=(git|special|hybrid)) or get_remote_types or get_remote_features to make it possible to disambiguate more explicitly?

kyleam · 2019-07-01T15:56:21Z

Indeed semantic there is not clear -- those are "hybrid" remotes. In the long run I wonder if we should RF into get_remotes(..., type=(git|special|hybrid)) or get_remote_types or get_remote_features to make it possible to disambiguate more explicitly?

I certainly think it should be fixed somehow, but that seems well outside this PR.

yarikoptic · 2019-07-01T16:31:57Z

Sure, let's make things better incrementally ;-) Cheers!

0.11.6 (Jul 30, 2019) -- am I the last of 0.11.x? Primarily bug fixes to achieve more robust performance Fixes - Our tests needed various adjustments to keep up with upstream changes in Travis and Git. ([#3479][]) ([#3492][]) ([#3493][]) - `AnnexRepo.is_special_annex_remote` was too selective in what it considered to be a special remote. ([#3499][]) - We now provide information about unexpected output when git-annex is called with `--json`. ([#3516][]) - Exception logging in the `__del__` method of `GitRepo` and `AnnexRepo` no longer fails if the names it needs are no longer bound. ([#3527][]) - [addurls][] botched the construction of subdataset paths that were more than two levels deep and failed to create datasets in a reliable, breadth-first order. ([#3561][]) - Cloning a `type=git` special remote showed a spurious warning about the remote not being enabled. ([#3547][]) Enhancements and new features - For calls to git and git-annex, we disable automatic garbage collection due to past issues with GitPython's state becoming stale, but doing so results in a larger .git/objects/ directory that isn't cleaned up until garbage collection is triggered outside of DataLad. Tests with the latest GitPython didn't reveal any state issues, so we've re-enabled automatic garbage collection. ([#3458][]) - [rerun][] learned an `--explicit` flag, which it relays to its calls to [run][[]]. This makes it possible to call `rerun` in a dirty working tree ([#3498][]). - The [metadata][] command aborts earlier if a metadata extractor is unavailable. ([#3525][]) * tag '0.11.6': (56 commits) [DATALAD RUNCMD] make update-changelog finalize CHANGELOG.md entry and boost version BF(DOC): close [create] with [] to not cause WARNING by md-strict pandoc CHANGELOG.md: Link entry from b3e8adb CHANGELOG.md: Add entry for gh-3547 CHANGELOG.md: Add entry for gh-3561 CHANGELOG.md: Add link for addurls RF: inform about special remotes based on autoenable config CHANGELOG.md: Second batch for 0.11.6 BF: addurls: Process datasets in a stable, breadth-first order BF: addurls: Fix construction of nested subpaths TST: addurls: Don't hard-code path separator BF(TST): skip test_v7_detached_get in direct mode - fails to annex upgrade TST: benchmark-travis-pr: Swap 'pip install' and 'git show' TST: benchmark-travis-pr: Move repeated logic to run_asv() TST: benchmark-travis-pr: Support other bases TST: benchmark-travis-pr: Tweak message about current HEAD TST: benchmark-travis-pr: Simplify two git commands into one TST: benchmark-travis-pr: Reorder and break up lines TST: benchmark-travis-pr: Move command for running asv into function ...

yarikoptic merged commit 9a65227 into datalad:0.11.x Jul 1, 2019

kyleam deleted the special-remote-check branch July 1, 2019 16:54

yarikoptic added this to the Release 0.11.6 milestone Jul 26, 2019

kyleam mentioned this pull request Feb 26, 2020

BF: LFS special remote has "url" in .git/config #4199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499

BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499

kyleam commented Jun 28, 2019

kyleam commented Jun 28, 2019

leej3 commented Jun 28, 2019

yarikoptic commented Jun 29, 2019

kyleam commented Jul 1, 2019

yarikoptic commented Jul 1, 2019

kyleam commented Jul 1, 2019 •

edited

Loading

yarikoptic commented Jul 1, 2019

BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499

BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499

Conversation

kyleam commented Jun 28, 2019

kyleam commented Jun 28, 2019

leej3 commented Jun 28, 2019

yarikoptic commented Jun 29, 2019

kyleam commented Jul 1, 2019

yarikoptic commented Jul 1, 2019

kyleam commented Jul 1, 2019 • edited Loading

yarikoptic commented Jul 1, 2019

kyleam commented Jul 1, 2019 •

edited

Loading