-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BF: annexrepo: Adjust overly selective is_special_annex_remote() #3499
Conversation
To decide if a remote is an annex special remote, we check whether 'annex-externaltype' or 'annex-webdav' is configured for the remote. This mis-classifies nearly all internal special remotes (anything but webdav) as an ordinary remote because those don't have 'annex-externaltype' configured. There's no common annex- option for special remotes. As the definition of annex's findSpecialRemotes shows, each special remote is expected to have at least an annex-TYPE option. We could keep a list of known internal special remotes and look for that or externaltype, but the list of internal special remotes is a moving target. findSpecialRemotes' documentation mentions that "special remotes don't have a configured url", so let's instead identify them based on _not_ having a URL and having some annex- option aside from annex-uuid and annex-ignore, two particularly common non-special-remote options. Given that all valid Git remotes have a configured URL, this should be a reliable classification for correctly configured remotes. It's of course possible to get mis-classifications if the user either removes or adds a URL to a remote's section. Another option would be to rely on get_special_remotes(), which calls `git cat-file git-annex:remote.log` to get the special remotes, but that'd be a change in behavior because it also considers remotes that aren't enabled. It'd also be slower: # This patch % python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')" "ar.is_special_annex_remote('origin')" 10000 loops, best of 3: 143 usec per loop # Using get_special_remotes() in is_special_annex_remote() % python -m timeit -s "from datalad.support.annexrepo import AnnexRepo; ar = AnnexRepo('.')" "ar.is_special_annex_remote('origin')" 100 loops, best of 3: 8.82 msec per loop Fixes datalad#3497.
Confirmed. This works too. Thanks |
Thank you! Is that such a frequent call to do that timing is critical? I would have gone for get_special_remotes() + taking intersection with enabled (present in .git/config) remotes so that there is no guess work |
To my mind the question isn't whether it is critical (it's not, and very few things are), but what the extra cost gains us. Within datalad itself we call this when looping over remotes and when traversing dataset hierarchies, so by not calling get_special_remotes() within is_special_annex_remote() we're avoiding some number of
So AFAICS the main benefit there is that, if the config file has an invalid Git remote without a URL and with some annex-* options, we don't mis-classify it as a special remote. A Git operation with the invalid remote would fail anyway, and I don't see an issue with is_special_annex_remote() assuming a valid remote. Do you think there are any valid remotes where the intersection approach would give a different answer? I was able to come up with one: gcrypt. The problem there is that the remote is actually a normal remote too. So while technically is_special_annex_remote() should return true, doing so would make the two current callers of is_special_annex_remote() behave in an unintended way because those callers assume a remote is either special or not. |
Similar case I guess might happen with a |
I certainly think it should be fixed somehow, but that seems well outside this PR. |
Sure, let's make things better incrementally ;-) Cheers! |
0.11.6 (Jul 30, 2019) -- am I the last of 0.11.x? Primarily bug fixes to achieve more robust performance Fixes - Our tests needed various adjustments to keep up with upstream changes in Travis and Git. ([#3479][]) ([#3492][]) ([#3493][]) - `AnnexRepo.is_special_annex_remote` was too selective in what it considered to be a special remote. ([#3499][]) - We now provide information about unexpected output when git-annex is called with `--json`. ([#3516][]) - Exception logging in the `__del__` method of `GitRepo` and `AnnexRepo` no longer fails if the names it needs are no longer bound. ([#3527][]) - [addurls][] botched the construction of subdataset paths that were more than two levels deep and failed to create datasets in a reliable, breadth-first order. ([#3561][]) - Cloning a `type=git` special remote showed a spurious warning about the remote not being enabled. ([#3547][]) Enhancements and new features - For calls to git and git-annex, we disable automatic garbage collection due to past issues with GitPython's state becoming stale, but doing so results in a larger .git/objects/ directory that isn't cleaned up until garbage collection is triggered outside of DataLad. Tests with the latest GitPython didn't reveal any state issues, so we've re-enabled automatic garbage collection. ([#3458][]) - [rerun][] learned an `--explicit` flag, which it relays to its calls to [run][[]]. This makes it possible to call `rerun` in a dirty working tree ([#3498][]). - The [metadata][] command aborts earlier if a metadata extractor is unavailable. ([#3525][]) * tag '0.11.6': (56 commits) [DATALAD RUNCMD] make update-changelog finalize CHANGELOG.md entry and boost version BF(DOC): close [create] with [] to not cause WARNING by md-strict pandoc CHANGELOG.md: Link entry from b3e8adb CHANGELOG.md: Add entry for gh-3547 CHANGELOG.md: Add entry for gh-3561 CHANGELOG.md: Add link for addurls RF: inform about special remotes based on autoenable config CHANGELOG.md: Second batch for 0.11.6 BF: addurls: Process datasets in a stable, breadth-first order BF: addurls: Fix construction of nested subpaths TST: addurls: Don't hard-code path separator BF(TST): skip test_v7_detached_get in direct mode - fails to annex upgrade TST: benchmark-travis-pr: Swap 'pip install' and 'git show' TST: benchmark-travis-pr: Move repeated logic to run_asv() TST: benchmark-travis-pr: Support other bases TST: benchmark-travis-pr: Tweak message about current HEAD TST: benchmark-travis-pr: Simplify two git commands into one TST: benchmark-travis-pr: Reorder and break up lines TST: benchmark-travis-pr: Move command for running asv into function ...
To decide if a remote is an annex special remote, we check whether
'annex-externaltype' or 'annex-webdav' is configured for the remote.
This mis-classifies nearly all internal special remotes (anything but
webdav) as an ordinary remote because those don't have
'annex-externaltype' configured.
There's no common annex- option for special remotes. As the
definition of annex's findSpecialRemotes shows, each special remote is
expected to have at least an annex-TYPE option. We could keep a list
of known internal special remotes and look for that or externaltype,
but the list of internal special remotes is a moving target.
findSpecialRemotes' documentation mentions that "special remotes don't
have a configured url", so let's instead identify them based on not
having a URL and having some annex- option aside from annex-uuid and
annex-ignore, two particularly common non-special-remote options.
Given that all valid Git remotes have a configured URL, this should be
a reliable classification for correctly configured remotes. It's of
course possible to get mis-classifications if the user either removes
or adds a URL to a remote's section.
Another option would be to rely on get_special_remotes(), which calls
git cat-file git-annex:remote.log
to get the special remotes, butthat'd be a change in behavior because it also considers remotes that
aren't enabled. It'd also be slower:
Fixes #3497.