Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: check special remotes of type git by uuid to (not) inform if they are not enabled #3547

Merged
merged 2 commits into from Jul 26, 2019

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jul 19, 2019

Matching by the name is not sufficient, since when we just clone such
repository it would get an arbitrary (by default "origin") name which
would not match the one git-annex calls it by. Checking by annex UUID
seems to be the best way to proceed

FOI: test takes 1.7sec on my laptop (on network close to the datasets.datalad.org)

… are not enabled

Matching by the name is not sufficient, since when we just clone such
repository it would get an arbitrary (by default "origin") name which
would not match the one git-annex calls it by.  Checking by annex UUID
seems to be the best way to proceed
@codecov
Copy link

@codecov codecov bot commented Jul 19, 2019

Codecov Report

Merging #3547 into 0.11.x will increase coverage by 0.01%.
The diff coverage is 79.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           0.11.x    #3547      +/-   ##
==========================================
+ Coverage   81.29%   81.31%   +0.01%     
==========================================
  Files         255      255              
  Lines       33588    33639      +51     
==========================================
+ Hits        27307    27355      +48     
- Misses       6281     6284       +3
Impacted Files Coverage Δ
datalad/distribution/tests/test_clone.py 99.54% <100%> (+0.01%) ⬆️
datalad/distribution/utils.py 85.56% <71.42%> (-1.02%) ⬇️
datalad/downloaders/tests/test_http.py 88.86% <0%> (-2.23%) ⬇️
datalad/plugin/tests/test_addurls.py 100% <0%> (ø) ⬆️
datalad/plugin/addurls.py 18.34% <0%> (+0.02%) ⬆️
datalad/support/tests/test_annexrepo.py 96.49% <0%> (+0.03%) ⬆️
datalad/utils.py 64.51% <0%> (+0.53%) ⬆️
datalad/support/annexrepo.py 58.59% <0%> (+0.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d633340...e6c3d9e. Read the comment docs.

Copy link
Contributor

@kyleam kyleam left a comment

Subject: [PATCH] BF: check special remotes of type git by uuid to (not) inform
if they are not enabled

Matching by the name is not sufficient, since when we just clone such
repository it would get an arbitrary (by default "origin") name which
would not match the one git-annex calls it by.

Makes sense. So when we are cloning from a type=git special remote, we see an unwarranted message:

$ datalad clone ///repronim/containers
[INFO   ] Cloning http://datasets.datalad.org/repronim/containers [1 other candidates] into '/tmp/dl-Kv9PJVz/containers'
[INFO   ] warning: Not setting branch master as its own upstream.
[INFO   ] access to dataset sibling "datasets.datalad.org" not auto-enabled, enable with:
| 		datalad siblings -d "/tmp/dl-Kv9PJVz/containers" enable -s datasets.datalad.org

Checking by annex UUID seems to be the best way to proceed

Doing so introduces an unwarranted message when cloning from other sources:

$ datalad clone http://github.com/ReproNim/containers
[INFO   ] Cloning http://github.com/ReproNim/containers [1 other candidates] into '/tmp/dl-96hiJI2/containers'
[INFO   ] warning: Not setting branch master as its own upstream.
[INFO   ]   Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] access to dataset sibling "datasets.datalad.org" not auto-enabled, enable with:
| 		datalad siblings -d "/tmp/dl-96hiJI2/containers" enable -s datasets.datalad.org
install(ok): /tmp/dl-96hiJI2/containers (dataset)

The problem comes from assuming that the git annex init call (triggered by _handle_possible_annex_dataset) configures the UUID when enabling the remote. That isn't (at least always) the case:

# in the last cloned repo from above
$ cat .git/config | tail -3
[remote "datasets.datalad.org"]
        url = http://datasets.datalad.org/repronim/containers/.git
        fetch = +refs/heads/*:refs/remotes/datasets.datalad.org/*

It isn't until later, say getting a file, that the UUID shows up:

$ datalad get scripts/tests/arg-test.simg
get(ok): /tmp/dl-96hiJI2/containers/scripts/tests/arg-test.simg (file) [from datasets.datalad.org...]
$ cat .git/config | tail -5
[remote "datasets.datalad.org"]
        url = http://datasets.datalad.org/repronim/containers/.git
        fetch = +refs/heads/*:refs/remotes/datasets.datalad.org/*
        annex-bare = false
        annex-uuid = 71c620b5-997f-4849-bb30-c42dbb48a51

Why don't we just check the value of the "autoenable" key? That might miss cases where git annex init failed to enable the remote, but presumably the datalad sibling snippet we give for enabling will fail for the same reason, so dropping the instructions isn't really hurting the situation.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jul 22, 2019

THANK YOU @kyleam for the review and catching the regression. I should have thought about it ;)

Why don't we just check the value of the "autoenable" key?

But to check it for what remote exactly, i.e. how (if not using uuid) we can associate any special remote with a known to git remote?

may be upon clone and seeing an annex we should run annex info which would contact those remotes and assign uuid:

/tmp > datalad clone http://github.com/ReproNim/containers && cd containers
[INFO   ] Cloning http://github.com/ReproNim/containers [1 other candidates] into '/tmp/containers' 
[INFO   ]   Remote origin not usable by git-annex; setting annex-ignore                                         
[INFO   ] access to dataset sibling "datasets.datalad.org" not auto-enabled, enable with:
| 		datalad siblings -d "/tmp/containers" enable -s datasets.datalad.org 
install(ok): /tmp/containers (dataset)

/tmp/containers > git config remote.datasets.datalad.org.annex-uuid                   

/tmp/containers > git annex info | grep datasets.datalad.org
 	71c620b5-997f-4849-bb30-c42dbb48a51e -- [datasets.datalad.org]

/tmp/containers > git config remote.datasets.datalad.org.annex-uuid
71c620b5-997f-4849-bb30-c42dbb48a51e

?

@kyleam
Copy link
Contributor

@kyleam kyleam commented Jul 22, 2019

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jul 22, 2019

I'm asking why not get rid of the .git/config inspection altogether, warning based on the autoenable key for each item in get_special_remotes().

ah, got it -- but would not that assume (without checking) that the remote was enabled?

@kyleam
Copy link
Contributor

@kyleam kyleam commented Jul 22, 2019

ah, got it -- but would not that assume (without checking) that the remote was enabled?

Yes, but as I said above, I don't see how the check really helps:

Why don't we just check the value of the "autoenable" key? That might miss cases where git annex init failed to enable the remote, but presumably the datalad sibling snippet we give for enabling will fail for the same reason, so dropping the instructions isn't really hurting the situation.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jul 26, 2019

observation: our "install" is not really causing git-annex to autoenable any of those special remotes. They do get added to git configuration as remotes, but annex doesn't even try to reach them to establish their annex UUID.

annex UUID fetching and "autoenable" happens later (or with the initial version of the this PR here due to `annex info`) by e.g. invoking either `git annex info` or `git annex get` (we had some discussion on this topic recently).
/tmp > datalad install https://github.com/ReproNim/containers
[INFO   ] Cloning https://github.com/ReproNim/containers [1 other candidates] into '/tmp/containers' 
[INFO   ]   Remote origin not usable by git-annex; setting annex-ignore                                                        
install(ok): /tmp/containers (dataset)

/tmp > git -C containers config remote.datasets.datalad.org.annex-uuid  

/tmp > git -C containers annex get images/bids/bids-validator--1.2.3.sing
get images/bids/bids-validator--1.2.3.sing (from datasets.datalad.org...) 
4%    1.3 MiB        1018 KiB/s 35s^C

/tmp > git -C containers config remote.datasets.datalad.org.annex-uuid   
71c620b5-997f-4849-bb30-c42dbb48a51e

It somewhat also relates to our discussion on "either should we enable special remotes on a remote while publishing?".

I think to stay more consistent with git-annex and also avoid spitting out some warnings and failures whenever it is not yet relevant (may be I don't need to get any files in the installed dataset?), I will indeed RF this PR to avoid info call, and just provide analysis of special remotes based on autoenable.

But we might need to improve reporting of failures of autoneabling remotes from git-annex (see #3564) later in the game (info or get or ... whatever triggers it calls)

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jul 26, 2019

hm, while coding for it realized that the situation is more fun. git special remote which has no autoenable might also be the one we just cloned! ;) so I will need to do some check by UUID (without calling info - just from config) to prune those out

+ provide debug level msg about autoenable"ed remotes
+ collide messages on how to enable special remotes into a single message
  (no need to flood the screen)
+ do not report anything about remotes UUIDs for which already known
@kyleam
Copy link
Contributor

@kyleam kyleam commented Jul 26, 2019

observation: our "install" is not really causing git-annex to autoenable any of those special remotes. They do get added to git configuration as remotes, but annex doesn't even try to reach them to establish their annex UUID.
annex UUID fetching and "autoenable" happens later (or with the initial version of the this PR here due to annex info) by e.g. invoking either git annex info or git annex get (we had some discussion on this topic recently).

It's not accurate to say install isn't causing git-annex to autoenable the special remotes. It is autoenabling them in the same way as git annex init does, which is why get can find the files on the special remote.

AFAICT all the remaining information you provide about a UUID not showing up in the config is restating what I said in my initial comment:

The problem comes from assuming that the git annex init call (triggered by _handle_possible_annex_dataset) configures the UUID when enabling the remote. That isn't (at least always) the case:

# in the last cloned repo from above
$ cat .git/config | tail -3
[remote "datasets.datalad.org"]
        url = http://datasets.datalad.org/repronim/containers/.git
        fetch = +refs/heads/*:refs/remotes/datasets.datalad.org/*

It isn't until later, say getting a file, that the UUID shows up:

kyleam
kyleam approved these changes Jul 26, 2019
if srs[True]:
lgr.debug(
"configuration for %s %s added because of autoenable,"
" but no UUIDs for them yet known for dataset %s",
Copy link
Contributor

@kyleam kyleam Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much value in this because it seems to be the norm, but it's of course not a big deal given that it's logged at the debug level.

Copy link
Member Author

@yarikoptic yarikoptic Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Debug messages have only hypothetical value of helping to troubleshoot when something goes wrong. Given the tricky nature of all this, might come handy IMHO

@kyleam kyleam merged commit e6c3d9e into datalad:0.11.x Jul 26, 2019
4 of 5 checks passed
kyleam added a commit that referenced this issue Jul 26, 2019
yarikoptic added a commit that referenced this issue Jul 31, 2019
0.11.6 (Jul 30, 2019) -- am I the last of 0.11.x?

Primarily bug fixes to achieve more robust performance

Fixes

- Our tests needed various adjustments to keep up with upstream
  changes in Travis and Git. ([#3479][]) ([#3492][]) ([#3493][])

- `AnnexRepo.is_special_annex_remote` was too selective in what it
  considered to be a special remote.  ([#3499][])

- We now provide information about unexpected output when git-annex is
  called with `--json`.  ([#3516][])

- Exception logging in the `__del__` method of `GitRepo` and
  `AnnexRepo` no longer fails if the names it needs are no longer
  bound.  ([#3527][])

- [addurls][] botched the construction of subdataset paths that were
  more than two levels deep and failed to create datasets in a
  reliable, breadth-first order.  ([#3561][])

- Cloning a `type=git` special remote showed a spurious warning about
  the remote not being enabled.  ([#3547][])

Enhancements and new features

- For calls to git and git-annex, we disable automatic garbage
  collection due to past issues with GitPython's state becoming stale,
  but doing so results in a larger .git/objects/ directory that isn't
  cleaned up until garbage collection is triggered outside of DataLad.
  Tests with the latest GitPython didn't reveal any state issues, so
  we've re-enabled automatic garbage collection.  ([#3458][])

- [rerun][] learned an `--explicit` flag, which it relays to its calls
  to [run][[]].  This makes it possible to call `rerun` in a dirty
  working tree ([#3498][]).

- The [metadata][] command aborts earlier if a metadata extractor is
  unavailable.  ([#3525][])

* tag '0.11.6': (56 commits)
  [DATALAD RUNCMD] make update-changelog
  finalize CHANGELOG.md entry and boost version
  BF(DOC): close [create] with [] to not cause WARNING by md-strict pandoc
  CHANGELOG.md: Link entry from b3e8adb
  CHANGELOG.md: Add entry for gh-3547
  CHANGELOG.md: Add entry for gh-3561
  CHANGELOG.md: Add link for addurls
  RF: inform about special remotes based on autoenable config
  CHANGELOG.md: Second batch for 0.11.6
  BF: addurls: Process datasets in a stable, breadth-first order
  BF: addurls: Fix construction of nested subpaths
  TST: addurls: Don't hard-code path separator
  BF(TST): skip test_v7_detached_get in direct mode - fails to annex upgrade
  TST: benchmark-travis-pr: Swap 'pip install' and 'git show'
  TST: benchmark-travis-pr: Move repeated logic to run_asv()
  TST: benchmark-travis-pr: Support other bases
  TST: benchmark-travis-pr: Tweak message about current HEAD
  TST: benchmark-travis-pr: Simplify two git commands into one
  TST: benchmark-travis-pr: Reorder and break up lines
  TST: benchmark-travis-pr: Move command for running asv into function
  ...
@yarikoptic yarikoptic added this to the Release 0.11.7 milestone Aug 1, 2019
@yarikoptic yarikoptic removed this from the Release 0.11.7 milestone Aug 25, 2019
@yarikoptic yarikoptic added this to the Release 0.11.6 milestone Aug 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants