Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499

Merged
merged 1 commit into from Jul 1, 2019

Conversation

@kyleam
Copy link
Contributor

@kyleam kyleam commented Jun 28, 2019

To decide if a remote is an annex special remote, we check whether
'annex-externaltype' or 'annex-webdav' is configured for the remote.
This mis-classifies nearly all internal special remotes (anything but
webdav) as an ordinary remote because those don't have
'annex-externaltype' configured.

There's no common annex- option for special remotes. As the
definition of annex's findSpecialRemotes shows, each special remote is
expected to have at least an annex-TYPE option. We could keep a list
of known internal special remotes and look for that or externaltype,
but the list of internal special remotes is a moving target.

findSpecialRemotes' documentation mentions that "special remotes don't
have a configured url", so let's instead identify them based on not
having a URL and having some annex- option aside from annex-uuid and
annex-ignore, two particularly common non-special-remote options.

Given that all valid Git remotes have a configured URL, this should be
a reliable classification for correctly configured remotes. It's of
course possible to get mis-classifications if the user either removes
or adds a URL to a remote's section.

Another option would be to rely on get_special_remotes(), which calls
git cat-file git-annex:remote.log to get the special remotes, but
that'd be a change in behavior because it also considers remotes that
aren't enabled. It'd also be slower:

# This patch
% python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')"
  "ar.is_special_annex_remote('origin')"
10000 loops, best of 3: 143 usec per loop

# Using get_special_remotes() in is_special_annex_remote()
% python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')"
  "ar.is_special_annex_remote('origin')"
100 loops, best of 3: 8.82 msec per loop

Fixes #3497.

To decide if a remote is an annex special remote, we check whether
'annex-externaltype' or 'annex-webdav' is configured for the remote.
This mis-classifies nearly all internal special remotes (anything but
webdav) as an ordinary remote because those don't have
'annex-externaltype' configured.

There's no common annex- option for special remotes.  As the
definition of annex's findSpecialRemotes shows, each special remote is
expected to have at least an annex-TYPE option.  We could keep a list
of known internal special remotes and look for that or externaltype,
but the list of internal special remotes is a moving target.

findSpecialRemotes' documentation mentions that "special remotes don't
have a configured url", so let's instead identify them based on _not_
having a URL and having some annex- option aside from annex-uuid and
annex-ignore, two particularly common non-special-remote options.

Given that all valid Git remotes have a configured URL, this should be
a reliable classification for correctly configured remotes.  It's of
course possible to get mis-classifications if the user either removes
or adds a URL to a remote's section.

Another option would be to rely on get_special_remotes(), which calls
`git cat-file git-annex:remote.log` to get the special remotes, but
that'd be a change in behavior because it also considers remotes that
aren't enabled.  It'd also be slower:

    # This patch
    % python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')"
      "ar.is_special_annex_remote('origin')"
    10000 loops, best of 3: 143 usec per loop

    # Using get_special_remotes() in is_special_annex_remote()
    % python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')"
      "ar.is_special_annex_remote('origin')"
    100 loops, best of 3: 8.82 msec per loop

Fixes datalad#3497.
@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Jun 28, 2019

@leej3, the patch has changed a bit from the one posted at gh-3497, so it'd be great if you could confirm that this PR fixes your issue as well. Thanks!

@leej3
Copy link

@leej3 leej3 commented Jun 28, 2019

Confirmed. This works too. Thanks

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jun 29, 2019

Thank you! Is that such a frequent call to do that timing is critical? I would have gone for get_special_remotes() + taking intersection with enabled (present in .git/config) remotes so that there is no guess work

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Jul 1, 2019

Is that such a frequent call to do that timing is critical?

To my mind the question isn't whether it is critical (it's not, and very few things are), but what the extra cost gains us.

Within datalad itself we call this when looping over remotes and when traversing dataset hierarchies, so by not calling get_special_remotes() within is_special_annex_remote() we're avoiding some number of git cat-file calls that depends on the repository state and dataset layout. (Or alternatively we're avoiding reworking the code and maybe interface to use some kind of cache.)

I would have gone for get_special_remotes() + taking intersection with enabled (present in .git/config) remotes so that there is no guess work

So AFAICS the main benefit there is that, if the config file has an invalid Git remote without a URL and with some annex-* options, we don't mis-classify it as a special remote. A Git operation with the invalid remote would fail anyway, and I don't see an issue with is_special_annex_remote() assuming a valid remote.

Do you think there are any valid remotes where the intersection approach would give a different answer?

I was able to come up with one: gcrypt. The problem there is that the remote is actually a normal remote too. So while technically is_special_annex_remote() should return true, doing so would make the two current callers of is_special_annex_remote() behave in an unintended way because those callers assume a remote is either special or not.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jul 1, 2019

Similar case I guess might happen with a git special remote which we establish when using some ssh+http accessible server as a common data src? (--as-common-datasrc) Indeed semantic there is not clear -- those are "hybrid" remotes. In the long run I wonder if we should RF into get_remotes(..., type=(git|special|hybrid)) or get_remote_types or get_remote_features to make it possible to disambiguate more explicitly?

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Jul 1, 2019

Indeed semantic there is not clear -- those are "hybrid" remotes. In the long run I wonder if we should RF into get_remotes(..., type=(git|special|hybrid)) or get_remote_types or get_remote_features to make it possible to disambiguate more explicitly?

I certainly think it should be fixed somehow, but that seems well outside this PR.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jul 1, 2019

Sure, let's make things better incrementally ;-) Cheers!

@yarikoptic yarikoptic merged commit 9a65227 into datalad:0.11.x Jul 1, 2019
2 of 3 checks passed
@kyleam kyleam deleted the special-remote-check branch Jul 1, 2019
@yarikoptic yarikoptic added this to the Release 0.11.6 milestone Jul 26, 2019
yarikoptic added a commit that referenced this issue Jul 31, 2019
0.11.6 (Jul 30, 2019) -- am I the last of 0.11.x?

Primarily bug fixes to achieve more robust performance

Fixes

- Our tests needed various adjustments to keep up with upstream
  changes in Travis and Git. ([#3479][]) ([#3492][]) ([#3493][])

- `AnnexRepo.is_special_annex_remote` was too selective in what it
  considered to be a special remote.  ([#3499][])

- We now provide information about unexpected output when git-annex is
  called with `--json`.  ([#3516][])

- Exception logging in the `__del__` method of `GitRepo` and
  `AnnexRepo` no longer fails if the names it needs are no longer
  bound.  ([#3527][])

- [addurls][] botched the construction of subdataset paths that were
  more than two levels deep and failed to create datasets in a
  reliable, breadth-first order.  ([#3561][])

- Cloning a `type=git` special remote showed a spurious warning about
  the remote not being enabled.  ([#3547][])

Enhancements and new features

- For calls to git and git-annex, we disable automatic garbage
  collection due to past issues with GitPython's state becoming stale,
  but doing so results in a larger .git/objects/ directory that isn't
  cleaned up until garbage collection is triggered outside of DataLad.
  Tests with the latest GitPython didn't reveal any state issues, so
  we've re-enabled automatic garbage collection.  ([#3458][])

- [rerun][] learned an `--explicit` flag, which it relays to its calls
  to [run][[]].  This makes it possible to call `rerun` in a dirty
  working tree ([#3498][]).

- The [metadata][] command aborts earlier if a metadata extractor is
  unavailable.  ([#3525][])

* tag '0.11.6': (56 commits)
  [DATALAD RUNCMD] make update-changelog
  finalize CHANGELOG.md entry and boost version
  BF(DOC): close [create] with [] to not cause WARNING by md-strict pandoc
  CHANGELOG.md: Link entry from b3e8adb
  CHANGELOG.md: Add entry for gh-3547
  CHANGELOG.md: Add entry for gh-3561
  CHANGELOG.md: Add link for addurls
  RF: inform about special remotes based on autoenable config
  CHANGELOG.md: Second batch for 0.11.6
  BF: addurls: Process datasets in a stable, breadth-first order
  BF: addurls: Fix construction of nested subpaths
  TST: addurls: Don't hard-code path separator
  BF(TST): skip test_v7_detached_get in direct mode - fails to annex upgrade
  TST: benchmark-travis-pr: Swap 'pip install' and 'git show'
  TST: benchmark-travis-pr: Move repeated logic to run_asv()
  TST: benchmark-travis-pr: Support other bases
  TST: benchmark-travis-pr: Tweak message about current HEAD
  TST: benchmark-travis-pr: Simplify two git commands into one
  TST: benchmark-travis-pr: Reorder and break up lines
  TST: benchmark-travis-pr: Move command for running asv into function
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants