Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF+BF+TST: Discontinue "quick" annex (availability) checks #4736

Merged
merged 1 commit into from Jul 23, 2020

Conversation

mih
Copy link
Member

@mih mih commented Jul 16, 2020

These quick alternative checks yield invalid result on crippled FS.
Moreover, they are not used in non-test code.

Consequently, their removal should improve validity of tests without
impacting the performance of user-facing code.

Fixes gh-4714

These quick alternative checks yield invalid result on crippled FS.
Moreover, they are not used in non-test code.

Consequently, their removal should improve validity of tests without
impacting the performance of user-facing code.

Fixes dataladgh-4714
@bpoldrack
Copy link
Member

bpoldrack commented Jul 16, 2020

Just FTR: This not only affects crippled FS. annex can commit the locked/unlocked state of a file for a long time now. Symlink assumptions are long overdue to be removed.

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

AFAICT the entire discussion around this is confused. allow_quick=True should only be in effect for v5 repos (if not, there's a bug elsewhere). Crippled FS should be upgraded past v5 (if not, there's a bug elsewhere). Plus, this is only relevant for a narrow range of git-annex versions: those above our minimum version (7.20190503) that do not auto-upgrade to a repository version above 5 (below 7.20190912), so any mention of failures should include the git-annex version.

Given the restricted git-annex range above, I have no objections to removing the quick=True handling. It came from a time when v5 was the default. But I can't see how the explanations so far are valid descriptions of the situation.

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

allow_quick=True should only be in effect for v5 repos (if not, there's a bug elsewhere)

It looks like the elsewhere is the assumption that supports_unlocked_pointers returning False is a valid indication that a repo is v5. It's not because of the KeyError -> False treatment in supports_unlocked_pointers:

try:
return self.config.getint("annex", "version") >= 6
except KeyError:
# If annex.version isn't set (e.g., an uninitialized repo), assume
# that unlocked pointers aren't supported.
return False

Note that the failing test from gh-4714 is for an uninitialized annex repo.

There's very little point in keeping allow_quick=True around (our next git-annex version bump will likely take us over any git-annex version that defaults to v5), so I think we should go ahead with this PR.

kyleam
kyleam approved these changes Jul 16, 2020
@mih
Copy link
Member Author

mih commented Jul 16, 2020

@kyleam Then there are more bugs. I was debugging today and saw it using the quick checks with:

mih@meiner /media/mih/Samsung_T5/tmp/some % git annex version |head -n1
git-annex version: 8.20200330
mih@meiner /media/mih/Samsung_T5/tmp/some % git --version
git version 2.27.0
mih@meiner /media/mih/Samsung_T5/tmp/some % git init
Initialized empty Git repository in /media/mih/Samsung_T5/tmp/some/.git/
mih@meiner /media/mih/Samsung_T5/tmp/some (git)-[master] % git annex init
init  
  Detected a filesystem without fifo support.

  Disabling ssh connection caching.

  Detected a crippled filesystem.
(scanning for unlocked files...)

  Entering an adjusted branch where files are unlocked as this filesystem does not support locked files.

Switched to branch 'adjusted/master(unlocked)'
ok
(recording state in git...)
mih@meiner /media/mih/Samsung_T5/tmp/some (git)-[adjusted/master(unlocked)] % git annex version  | tail -n3
supported repository versions: 8
upgrade supported from repository versions: 0 1 2 3 4 5 6 7
local repository version: 8

@bpoldrack
Copy link
Member

bpoldrack commented Jul 16, 2020

FTR: Even for v5 it wasn't exactly valid, since there was direct mode at the time.

Re auto-upgrade: FWIW we still have the datalad.repo.version config, that would cause annex-init to explicitly called with the version. I think this prevents auto-upgrade and for testing purposes it's a good thing to have in my estimation.

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

Re auto-upgrade: FWIW we still have the datalad.repo.version config, that would cause annex-init to explicitly called with the version. I think this prevents auto-upgrade and for testing purposes it's a good thing to have in my estimation.

No, it doesn't prevent auto-upgrade.

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

@mih

@kyleam Then there are more bugs. I was debugging today and saw it using the quick checks with:

My point is that the discussion around this is completely missing that, if the repo support unlocked files (i.e. is v6 or above), then quick should never be used. In the comment following my initial one, I described where the faulty logic is that lets the quick path be taken even though it shouldn't on v6+.

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

@bpoldrack

FTR: Even for v5 it wasn't exactly valid, since there was direct mode at the time.

I haven't dug back to when we supported direct mode to see what the handling was like back then, but perhaps we didn't account for it. Or perhaps we did and pruned it or otherwise reworked things with the removal of direct mode support. Either way, at the moment, we of course have a code base that doesn't support direct mode but does support a git-annex version that defaults to v5.

@yarikoptic
Copy link
Member

yarikoptic commented Jul 16, 2020

had only a brief look: I expect such a change could have tremendous performance hit on a number scenarios, including run time of the tests.
Anyone timed it?

... Moreover, they are not used in non-test code.

default was to allow quick checks, so any of the uses below could get effected, in particular I expect major impact on auto (either it is popular/used - is a separate question):

$> git grep -A2 'file_has_content' | grep -v test_| grep '\.py[-:]'
datalad/auto.py:        if (under_annex or under_annex is None) and not annex.file_has_content(filepath):
datalad/auto.py-            lgr.info("AutomagicIO: retrieving file content of %s", filepath)
datalad/auto.py-            out = annex.get(filepath)
datalad/interface/add_archive_content.py:                # if annex.file_has_content(target_path):
datalad/interface/add_archive_content.py-                #     # if not --  it was added to git, if in annex, it is present and output is True
datalad/interface/add_archive_content.py-                #     annex.add_url_to_file(target_file, url, options=['--relaxed'], batch=True)
datalad/interface/ls.py:                    if self.repo.file_has_content(str(self._path)) \
datalad/interface/ls.py-                    else 0
datalad/interface/ls.py-            # else ask fs for node size (= ondisk_size)
datalad/metadata/aggregate.py:            if not repo.file_has_content(relpath):
datalad/metadata/aggregate.py-                present = False
datalad/metadata/aggregate.py-                backends.append(repo.get_key_backend(key))
datalad/metadata/metadata.py:        content_info = zip(paths, ds.repo.file_has_content(paths), ds.repo.is_under_annex(paths))
datalad/metadata/metadata.py-        paths = [p for p, c, a in content_info if not a or c]
datalad/metadata/metadata.py-        nocontent = len(fullpathlist) - len(paths)
datalad/plugin/export_archive.py:                has_content = repo.file_has_content(
datalad/plugin/export_archive.py-                    repo_files, allow_quick=True, batch=True)
datalad/plugin/export_archive.py-            else:
... removed its own definition
datalad/tests/utils.py:    has_content = ar.file_has_content(files)
datalad/tests/utils.py-    if isinstance(has_content, bool):
datalad/tests/utils.py-        ok_(has_content)
datalad/tests/utils.py:def ok_file_has_content(path, content, strip=False, re_=False,
datalad/tests/utils.py-                        decompress=False, **kwargs):
datalad/tests/utils.py-    """Verify that file exists and has expected content"""

I wonder if it shouldn't be just fixed (if not a symlink -- do "proper" check, if a symlink to .git/annex/objects/ -- leave at current "quick" logic) ?

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

had only a brief look: I expect such a change could have tremendous performance hit on a number scenarios, including run time of the tests.
Anyone timed it?

While timing would be fine to see, any performance benefits would be restricted to git annex versions below 7.20190912. Relevant for the neurodebian and the conda versions we test with at the moment, but we can't stay stuck behind that forever, so it's just delaying the inevitable.

I wonder if it shouldn't be just fixed

An alternative fix to the failure would be something like this (would need doc/comment adjustment):

diff --git a/datalad/support/annexrepo.py b/datalad/support/annexrepo.py
index c740500b9..31a292697 100644
--- a/datalad/support/annexrepo.py
+++ b/datalad/support/annexrepo.py
@@ -1187,7 +1187,7 @@ def supports_unlocked_pointers(self):
         except KeyError:
             # If annex.version isn't set (e.g., an uninitialized repo), assume
             # that unlocked pointers aren't supported.
-            return False
+            return None
 
     # TODO: RM DIRECT  might be gone but pieces might be useful for establishing
     #       migration to v6+ mode and testing. For now is made protected to
@@ -1742,7 +1742,7 @@ def _check_files(self, fn, quick_fn, files, allow_quick, batch):
         # `is_under_annex`. `fn` is the annex command used to do the check, and
         # `quick_fn` is the non-annex variant.
         pointers = self.supports_unlocked_pointers
-        if pointers or batch or not allow_quick:
+        if pointers is not False or batch or not allow_quick:
             # We're only concerned about modified files in V6+ mode. In V5
             # `find` returns an empty string for unlocked files.
             modified = [

@yarikoptic
Copy link
Member

yarikoptic commented Jul 16, 2020

... any performance benefits would be restricted to git annex versions below 7.20190912.

sorry, could you elaborate on this?

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

... any performance benefits would be restricted to git annex versions below 7.20190912.

sorry, could you elaborate on this?

Starting with that git-annex version, v5 repos are auto-upgraded to v7, in which case supports_unlocked_pointers is True (except when inside uninitialized annex repos, which is the logic bug responsible for the linked test_AnnexRepo_on_uninited_annex failure). If supports_unlocked_pointers is True, the non-quick code path is taken regardless of whether allow_quick is True.

@yarikoptic
Copy link
Member

yarikoptic commented Jul 16, 2020

THANK YOU!
supports_unlocked_pointers was introduced in 209897c of #2975.. I do not see any discussion of pointers unconditionally (supporting pointers does not make all symlinks go away) "masking away" allow_quick in the commit message or PR. So may be that PR already has contributed to some slow downs when using those recent git-annex versions, and may be we could regain some speed up if allow_quick was reintroduced (as patch suggested above + with necessary fixes detected on crippled FSs) even in cases of annex with support for pointers, since even in v7 we still have symlinks in most typical IMHO use cases.

@kyleam
Copy link
Collaborator

kyleam commented Jul 16, 2020

supports_unlocked_pointers was introduced in 209897c of #2975.. I do not see any discussion of pointers unconditionally (supporting pointers does not make all symlinks go away) "masking away" allow_quick in the commit message or PR.

The commit you point to didn't change the behavior. It is just introducing the property. The behavior changed in 5dc2eb4 (BF: annexrepo.file_has_content: Fix check for unlocked files in V6, 2018-09-19). It does discuss sending everything through the slower check:

Note: If we're in V6, we send everything through the slower annex
check, but we could still check locked files with the quick check,
which would probably improve performance.  Punt on this for now
though.

That aligns with what you propose next...

So may be that PR already has contributed to some slow downs when using those recent git-annex versions, and may be we could regain some speed up if allow_quick was reintroduced (as patch suggested above + with necessary fixes detected on crippled FSs) even in cases of annex with support for pointers, since even in v7 we still have symlinks in most typical IMHO use cases.

... though if we want to go with the patch above, I don't see any reason to couple it to yet-to-materialize performance enhancements.

@yarikoptic
Copy link
Member

yarikoptic commented Jul 16, 2020

The commit you point to didn't change the behavior. It is just introducing the property.

RRIGHT! I have missed that it just replaces already present is_v6. thanks!

@bpoldrack
Copy link
Member

bpoldrack commented Jul 17, 2020

No, it doesn't prevent auto-upgrade.

Well, that seems to not be true for v5. We explicitly test it: https://github.com/datalad/datalad/blob/master/datalad/support/tests/test_annexrepo.py#L1274

@kyleam
Copy link
Collaborator

kyleam commented Jul 17, 2020

@bpoldrack

No, it doesn't prevent auto-upgrade.

Well, that seems to not be true for v5. We explicitly test it: https://github.com/datalad/datalad/blob/master/datalad/support/tests/test_annexrepo.py#L1274

Auto-upgrade is about moving from an unsupported version to a supported one. Note the condition on the test you point to:

# Assuming specified version is a supported version...
if 5 in supported_versions:
# ...parameter `version` still has priority over default config:
annex = AnnexRepo(path3, create=True, version=5)
version = int(annex.config.get('annex.version'))
eq_(version, 5)

And, as I mentioned above, v5 is no longer the default as of 7.20190912. In that version, 5 is no longer in the supported version list:

git-annex version: 7.20190912-gab739242a
build flags: Assistant Webapp Pairing S3 WebDAV Inotify TorrentParser Feeds Testsuite
dependency versions: aws-0.21.1 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.1.0 ghc-8.6.5 http-client-0.5.14 persistent-sqlite-2.9.3 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs hook external
operating system: linux x86_64
supported repository versions: 7
upgrade supported from repository versions: 0 1 2 3 4 5 6

And here is what passing --version=5 to git-annex-init does:

$ git init
Initialized empty Git repository in /tmp/gx-ALhWOUr/.git/
$ git annex init --version=5
init  (scanning for unlocked files...)
ok
(recording state in git...)
$ git config annex.version
6

Yes, it's odd/wrong that it landed on v6 when v7 is the only supported on. That's due to an annex bug that was fixed a while ago.

@bpoldrack
Copy link
Member

bpoldrack commented Jul 17, 2020

In that version, 5 is no longer in the supported version list:

Uh. I missed that. Thx!

@mih
Copy link
Member Author

mih commented Jul 23, 2020

If I am picking up the vibe correctly, this can be merged.

@mih mih merged commit 2633d5e into datalad:maint Jul 23, 2020
1 of 2 checks passed
@mih mih deleted the bf-annexavail branch Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants