Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT+BF: annotate read_only git/annex invocations, thread lock read-write git/annex and git-config invocations #5070

Merged
merged 10 commits into from
Dec 3, 2020

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Oct 22, 2020

old description which got it going: While looking at py-spy stack from long running status I noticed that majority
of time is spent on git config call, triggered by checking for fake dates
config setting. AFAIK for read-only operations, which cannot produce a commit,
it might benefit to skip that check. And _call_git already has that kwarg,
but it is not proxied by "higher-level" helpers. To avoid breeding workarounds
(like the one RFed within this commit) I decided that we might benefit from
exposing that option in higher-level helpers as well. Then "read-only" invocations could set it to False and we might avoid needless config read in some cases (I have not done any timing on any prototypical use case which might benefit from this, but I would assume that status or diff might).

What is in the PR

@yarikoptic yarikoptic added the performance Improve performance of an existing feature label Oct 22, 2020
@kyleam
Copy link
Contributor

kyleam commented Oct 22, 2020

While looking at py-spy stack from long running status I noticed that majority
of time is spent on git config call, triggered by checking for fake dates
config setting.

Hmm, I'd need to look into this again, but that doesn't sound right (i.e. if that's happening, I think it's a bug). For a repo instance, the value should be cached at _fake_dates_enabled.

AFAIK for read-only operations, which cannot produce a commit,
it might benefit to skip that check. And _call_git already has that kwarg,
but it is not proxied by "higher-level" helpers.

Note that this was explicitly discussed when the original series was proposed: #3791 (comment)

@kyleam
Copy link
Contributor

kyleam commented Oct 22, 2020

For a repo instance, the value should be cached at _fake_dates_enabled.

That appears to be working as expected, at least for a simple case.

test
diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index eab25a48e8..d5af0c3733 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -1603,6 +1603,7 @@ def fake_dates_enabled(self):
         """Is the repository configured to use fake dates?
         """
         if self._fake_dates_enabled is None:
+            print("--- config.getbool()")
             self._fake_dates_enabled = \
                 self.config.getbool('datalad', 'fake-dates', default=False)
         return self._fake_dates_enabled
import tempfile
import datalad.api as dl

ds = dl.Dataset(tempfile.mkdtemp(prefix="dl-")).create()
ds.repo.call_git(["status"])
ds.repo.call_git(["status"])
[INFO   ] Creating a new annex repo at /tmp/dl-jwsbo0v0 
--- config.getbool()
[INFO   ] Scanning for unlocked files (this may take some time) 

@yarikoptic
Copy link
Member Author

Hmm, I'd need to look into this again, but that doesn't sound right (i.e. if that's happening, I think it's a bug). For a repo instance, the value should be cached at _fake_dates_enabled.

yes, it is !! But in my case it is LOTS (thousands) of small subdatasets. So it does cache per dataset, but for a single dataset it doesn't need to do much ;) Actually, I was wrong about the status (or diff?): here is the flamegraph I got: http://www.onerussian.com/tmp/busy-datalad.svg so it comes from _save_add_submodules in call to get_hexsha (read only, no need for fake date handling)

            subm_commit = subm.get_hexsha()
            if not subm_commit:
                yield get_status_dict(
                    action='add_submodule',
                    ds=self,
                    path=path,
                    status='error',
                    message=('cannot add subdataset %s with no commits', subm),
                    logger=lgr)
                continue

and then continues to the next one if not to be committed. No need to consult config anywhere AFAIK, so no need to load all 4k of them if no submodule to be added.

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 22, 2020

hm, benchmarks reported

      7.75±0.04ms         9.30±1ms    ~1.20  support.path.get_parent_paths.time_allsubmods_toplevel

which is strange... I will restart to see if reproduces

edit, restart reported

       7.92±0.3ms       8.22±0.3ms     1.04  support.path.get_parent_paths.time_allsubmods_toplevel

so still > 1, but neither of positive or negative excursions were found significant / noted... so so far "it doesn't matter" according to the benchmarks. Nevertheless I think it could be of help in some scenarios as described in initial submission.

@yarikoptic
Copy link
Member Author

took a longer py-spy trace: http://www.onerussian.com/tmp/busy-datalad-3.svg (for some reason no colors) -- different code path, just shows that: 1. caching of config worked and since it was first loaded, subsequent fake_dates_enabled does not trigger reload; 2. this PR will not be a "full solution" for that long running step since there is other long running steps (there is no logs at INFO or progressbar to tell them apart ATM).

@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #5070 (4004943) into master (7e0b020) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5070      +/-   ##
==========================================
- Coverage   89.47%   89.47%   -0.01%     
==========================================
  Files         300      301       +1     
  Lines       42153    42243      +90     
==========================================
+ Hits        37718    37797      +79     
- Misses       4435     4446      +11     
Impacted Files Coverage Δ
datalad/support/annexrepo.py 88.89% <ø> (ø)
datalad/config.py 96.73% <100.00%> (+0.01%) ⬆️
datalad/support/gitrepo.py 90.47% <100.00%> (+0.03%) ⬆️
datalad/support/tests/test_network.py 98.61% <0.00%> (-1.39%) ⬇️
datalad/distribution/tests/test_create_sibling.py 79.85% <0.00%> (-0.68%) ⬇️
datalad/install.py 14.91% <0.00%> (-0.34%) ⬇️
datalad/tests/utils.py 85.79% <0.00%> (ø)
datalad/interface/tests/test_run_procedure.py 100.00% <0.00%> (ø)
datalad/distribution/tests/test_drop.py 100.00% <0.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e0b020...4004943. Read the comment docs.

@mih
Copy link
Member

mih commented Oct 29, 2020

Can we close this after #5076 was merged?

@yarikoptic
Copy link
Member Author

IMHO this one is largely orthogonal. Even though subdatasets use case was solved at higher level, and I don't have another affected use case, this pr changes make sense to me

@kyleam
Copy link
Contributor

kyleam commented Oct 29, 2020

Hmm, I'm on the fence about whether we should wait until having something concrete before deciding to expose this in each call_git* signature.

If we do go with these changes, I think the commit message really should be updated to point to the previous discussion/decision on this and mention why it should be reversed (to account for the "many repos that wouldn't otherwise trigger a config read" usecase, IIUC).

@yarikoptic
Copy link
Member Author

yeap, adjusted commit message locally (not pushing for now) to

new commit msg
OPT: pass and use check_fake_dates kwargs through call call_git helpers

While looking at py-spy stack from long running status I noticed that majority
of time is spent on git config call, triggered by checking for fake dates
config setting. AFAIK for read-only operations, which cannot produce a commit,
it might benefit to skip that check. And _call_git already has that kwarg,
but it is not proxied by "higher-level" helpers.  To avoid breeding workarounds
(like the one RFed within this commit) I decided that we might benefit from
exposing that option in higher-level helpers as well. Then "read-only" invocations
could set it to False and we might avoid needless  config read in some cases
(I have not done any timing on any prototypical use case which might benefit
from this, but I would assume that status or diff might).

We had an original discussion on either to expose check_fake_dates in higher
level *Repo interfaces before:
https://github.com/datalad/datalad/pull/3791#discussion_r335355781 and decided
to not do it at that point.

As for 'subdatasets' call, performance issue was now addressed with
https://github.com/datalad/datalad/pull/5076 , so this change is not strictly
necessary to optimize "subdatasets", but would still be generally benefitial
for possible other invocations of *Repo methods across many instances without
unnecessarily triggering loading of the config, typically in the case of
read-only git operations (git-annex might need to merge git-annex branch, so
typically annex calls should not disable fake dates).

but I wonder if any such RF to be done, should we may be introduce read_only=False instead?
check_fake_dates is just current single (although nicely specific) operation, but it really boils down to either invoked git command is expected to be just "reading" or might "write" (anything which creates a commit -- commit, merge, etc) and thus then fake dates check should be invoked.

@kyleam
Copy link
Contributor

kyleam commented Oct 29, 2020

I wonder if any such RF to be done, should we may be introduce read_only=False instead? [...]

Ah, good point. I like that much better.

While looking at py-spy stack from long running status I noticed that majority
of time is spent on git config call, triggered by checking for fake dates
config setting. AFAIK for read-only operations, which cannot produce a commit,
it might benefit to skip that check. And _call_git already has that kwarg,
but it is not proxied by "higher-level" helpers.  To avoid breeding workarounds
(like the one RFed within this commit) I decided that we might benefit from
exposing that option in higher-level helpers as well. Then "read-only" invocations
could set it to False and we might avoid needless  config read in some cases
(I have not done any timing on any prototypical use case which might benefit
from this, but I would assume that status or diff might).

We had an original discussion on either to expose check_fake_dates in higher
level *Repo interfaces before:
datalad#3791 (comment) and decided
to not do it at that point.

As for 'subdatasets' call, performance issue was now addressed with
datalad#5076 , so this change is not strictly
necessary to optimize "subdatasets", but would still be generally benefitial
for possible other invocations of *Repo methods across many instances without
unnecessarily triggering loading of the config, typically in the case of
read-only git operations (git-annex might need to merge git-annex branch, so
typically annex calls should not disable fake dates).
Per suggestion and a nod of agreement in datalad#5070
for some reason it was check_fake_dates=False for _git_custom_command which made it
into read_only=True after RF. AFAIK we should not assume that no fake dates
handling is needed by default so flipping this bool.  But it should not really
matter because this function is no longer in use within core and will likely to be
removed soon
…y non read_only op

ATM I just took a simple non-reentrable lock, since used solely aound actual call to
run_gitcommand_on_file_list_chunks . But in principle it could be generalized into an
RLock to be acquired higher in the stack to ensure atomicity of some operations.
I have decided to go with this simplest solution for now
InterProcesLock is of no effect across threads of the same process, so we
need to couple it with thread lock
@yarikoptic
Copy link
Member Author

ok, redone it for read_only=False instead of check_fake_dates=True and then added locking on top for parallel execution in non-read_only mode in GitRepo and ConfigManager . So it somewhat piled up two aspects - if desired can split the 2nd one in a follow up PR to keep our CIs warm.

@yarikoptic yarikoptic changed the title OPT: pass and use check_fake_dates kwargs through call call_git helpers OPT+BF: annotate read_only git/annex invocations, thread lock read-write git/annex and git-config invocations Nov 20, 2020
@yarikoptic
Copy link
Member Author

heh, forgot that @mih just reported that crawler uses those older, now bad helpers,

  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/datalad_crawler/nodes/annex.py", line 909, in _commit
    check_fake_dates=True)
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py", line 329, in _wrap_normalize_paths
    result = func(self, files_new, *args, **kwargs)
TypeError: _git_custom_command() got an unexpected keyword argument 'check_fake_dates'
-------------------- >> begin captured logging << --------------------
asyncio: DEBUG: Using selector: EpollSelector
asyncio: DEBUG: Using selector: EpollSelector

I guess will need to fix up/release there, but cannot since then it would need first datalad released, unless we do go through a deprecation cycle here with a kludge for check_fake_dates... eh heh... sure thing I could provide signature based kludge in the crawler -- not sure what would be better ;-)

@yarikoptic yarikoptic added the merge-if-ok OP considers this work done, and requests review/merge label Nov 20, 2020
@yarikoptic
Copy link
Member Author

for the crawler -- picking up on @mih's datalad/datalad-crawler#82 but unfortunately it caused infamous stalls of batched processes, which we thought are gone, and for which we have mitigation (abandon) only in master ATM... might need to bring it in also into maint.

@yarikoptic
Copy link
Member Author

fixedup crawler released, restarted workflows.

@yarikoptic yarikoptic added this to the 0.14.0 milestone Nov 20, 2020
datalad/support/gitrepo.py Outdated Show resolved Hide resolved
@kyleam
Copy link
Contributor

kyleam commented Nov 20, 2020

With the lock changes, I've been unable to trigger the config issue from gh-5131. As you say there, I also don't think using thread.Lock will add noticeable overhead. So that part looks good to me.

check_fake_dates : bool, optional
If enabled (default), inspect if dates should be faked for dataset
read_only : bool, optional
If False (default), do not assume that command does not introduce changes and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the double negative hard to parse here. When trying to come up with a rewrite, I stumbled on how to phrase it accurately. I started to think about the larger issue with read_only: some commands in git are semantically read-only but actually modify the repo in ways that can clash with other running git commands. For example, status and diff will refresh and write the index. These writes are optimizations to avoid recomputing the same thing on the next git call.

In the case of status, the global --no-optional-locks option can be used to avoid these. 27344d6a6c (git: add --no-optional-locks option, 2017-09-27) makes it clear that this option doesn't cover (at least yet) all the commands that do such an "opportunistic" refresh.

So, while I think read_only (versus a more specific check_fake_dates) is appealing because it adds a single high-level knob that conceptually covers check_fake_dates (and now locking), my concern is that it's misleading because it doesn't reflect how git behaves underneath.

Anyway, perhaps not a reason to avoid adding read_only, but I think it's worth thinking about and making the docstring clearer. Also, given the support for parallel operations now, our status calls in .dirty should use --no-optional-locks, and we should audit our diff calls to see if we can switch to a lower-level command (diff-files, ...) that doesn't refresh the index. (But neither of those things need to be tied to this PR of course.)

As an unrelated note, this parameter should be documented in the other call_git* commands, perhaps with some fancy formatting like config's _where_reload, though I think nearby comments reminding future updaters to cover the other spots would do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was not 100% happy with read_only as well, but other ones I thought about are either not generic enough (e.g, no_commit) or vague (e.g. changes_state).
Please suggest changes to the wording - I did think about making it also more specific, like "no commits etc", but just makes it longer for no good value. We could add a disclaimer/note that "Note: even if 'read-only' commands could refresh index etc." may be, but again -- not 100% certain if such level of detail is really needed here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted description a bit, hopefully it is better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about making it also more specific, like "no commits etc", but just makes it longer for no good value.

Or it actually makes it incorrect. The problem isn't that the parameter name is inaccurate; it's that the caller may think some commands are clearly read-only, but in fact they write to the repository. In those cases, it wouldn't be appropriate to skip the locks needed for parallel operations.

datalad/support/gitrepo.py Outdated Show resolved Hide resolved
datalad/support/annexrepo.py Outdated Show resolved Hide resolved
yarikoptic and others added 2 commits November 23, 2020 10:46
Co-authored-by: Kyle Meyer <kyle@kyleam.com>
@yarikoptic
Copy link
Member Author

Thank you @kyleam - applied adjusted wording (i like it), and switch from diff

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good to me.

I think all of the call_git* docstrings should reference the parameter, perhaps just with call_git holding the actual text and the description for all of the other methods pointing to call_gits description. But I can propose it in a follow-up PR (and either way it's not a big deal).

@yarikoptic
Copy link
Member Author

Restarted failed osx and appveyor - don't think the failures are related to pr but who knows

@yarikoptic
Copy link
Member Author

OSX failures persist so must be related since otherwise I think OSX runs are green. grr... I will restart once more (only OSX) and will try to look into it - this time recording a sample failed test for comparison
======================================================================
FAIL: datalad.support.tests.test_annexrepo.test_AnnexRepo_file_has_content(False,)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/runner/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/datalad/tests/utils.py", line 1090, in _wrap_with_testrepos
    t(*(arg + (uri,)), **kw)
  File "/Users/runner/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/datalad/tests/utils.py", line 755, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/Users/runner/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/datalad/support/tests/test_annexrepo.py", line 319, in test_AnnexRepo_file_has_content
    [ar.supports_unlocked_pointers])
AssertionError: [False] != [True]
-------------------- >> begin captured logging << --------------------
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
    return t(*(arg + (filename,)), **kw)
  File "/Users/runner/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/datalad/support/tests/test_annexrepo.py", line 354, in test_AnnexRepo_is_under_annex
    [ar.supports_unlocked_pointers])
AssertionError: [False] != [True]
-------------------- >> begin captured logging << --------------------
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector
asyncio: DEBUG: Using selector: KqueueSelector

@yarikoptic
Copy link
Member Author

FWIW, the same failures happen for me on OSX on our test OSX box when I run the same full selection of tests on master, but they do not happen if I run just a single test (e.g. python -m nose -s -v --with-cov --cover-package datalad --cover-xml ../datalad/support/tests/test_annexrepo.py:test_AnnexRepo_file_has_content ).

@yarikoptic
Copy link
Member Author

yarikoptic commented Dec 1, 2020

it is "fun": apparently the important difference in either specifying by path or by module path, i.e. python -m nose -s -v --with-cov --cover-package datalad --cover-xml datalad.support.tests.test_annexrepo:test_AnnexRepo_file_has_content would fail whenever python -m nose -s -v --with-cov --cover-package datalad --cover-xml ../datalad/support/tests/test_annexrepo.py:test_AnnexRepo_file_has_content would be ok (edit: but I failed to reproduce locally)

@yarikoptic
Copy link
Member Author

grrr, f.cking bug ran away somehow.

here is some incomplete (forgot to copy the last line but I 99% certain it was True ;)) copy/paste I managed to catch into my clipboard history
> /Users/yoh/proj/datalad/datalad/venvs/dev3/lib/python3.8/site-packages/datalad/support/tests/test_annexrepo.py(318)test_AnnexRepo_file_has_content()
-> eq_(ar.file_has_content(["test-annex.dat"], batch=batch),
(Pdb) p ar.file_has_content(["test-annex.dat"], batch=batch)
[False]
(Pdb) p ar.file_has_content(["test-annex.dat"], batch=batch)
[False]
(Pdb) p ar.file_has_content(["test-annex.dat"], batch=batch)
[False]
(Pdb) p batch
False
(Pdb) p ar.supports_unlocked_pointers
True
(Pdb) p ar.file_has_content(["test-annex.dat"], batch=batch)
[False]
(Pdb) p ar.file_has_content(["test-annex.dat"], batch=batch)
[False]
(Pdb) p ar
AnnexRepo('/private/var/folders/wr/944pwmkx3yg0r2sf1g3zpkrh0000gn/T/datalad_temp_test_AnnexRepo_file_has_content30u1oh06')
(Pdb) 
[1]+  Stopped                 DATALAD_TESTS_TEMP_KEEP=1 python -m nose -s -v --with-cov --cover-package datalad --pdb --pdb-failures --cover-xml datalad.support.tests.test_annexrepo:test_AnnexRepo_file_has_content
content30u1oh06evel) bash-3.2$ cd /private/var/folders/wr/944pwmkx3yg0r2sf1g3zpkrh0000gn/T/datalad_temp_test_AnnexRepo_file_has_ 
(dev3) (pyout-devel) bash-3.2$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        typechange: test-annex.dat

(dev3) (pyout-devel) bash-3.2$ fg
DATALAD_TESTS_TEMP_KEEP=1 python -m nose -s -v --with-cov --cover-package datalad --pdb --pdb-failures --cover-xml datalad.support.tests.test_annexrepo:test_AnnexRepo_file_has_content (wd: ~/proj/datalad/datalad/__testhome__)
^[[A
p ar
AnnexRepo('/private/var/folders/wr/944pwmkx3yg0r2sf1g3zpkrh0000gn/T/datalad_temp_test_AnnexRepo_file_has_content30u1oh06')
(Pdb) p ar.file_has_content(["test-annex.dat"], batch=batch)

which suggests that somehow git status made things straight, i.e. so that file_has_content( started to report correct True after I ran git status outside.
My hypothesis is that with this OPT we managed to avoid some git config call which had similar "positive" effect and avoided some kind of a race condition somewhere... but not yet sure where/how

note: appveyor failure is also about file_has_content - just in a different test
======================================================================
FAIL: datalad.support.tests.test_annexrepo.test_AnnexRepo_web_remote
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda37\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda37\lib\site-packages\datalad\tests\utils.py", line 576, in _wrap_with_tree
    return t(*(arg + (d,)), **kw)
  File "C:\Miniconda37\lib\site-packages\datalad\tests\utils.py", line 688, in _wrap_serve_path_via_http
    return tfunc(*(args + (path, url)), **kwargs)
  File "C:\Miniconda37\lib\site-packages\datalad\tests\utils.py", line 755, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "C:\Miniconda37\lib\site-packages\datalad\support\tests\test_annexrepo.py", line 384, in test_AnnexRepo_web_remote
    ok_(ar.file_has_content(testfile))
AssertionError: None
-------------------- >> begin captured logging << --------------------
asyncio: DEBUG: Using proactor: IocpProactor
may be it could be the one to track down instead...

@yarikoptic
Copy link
Member Author

yarikoptic commented Dec 2, 2020

on windows it boils down to the switch from diff to diff-files and here is the difference in the logged output

-datalad.cmd: DEBUG: Async run ['git', 'diff', '--name-only', '-z']^M
-datalad.cmd: DEBUG: Launching process ['git', 'diff', '--name-only', '-z']^M
-datalad.cmd: DEBUG: Process 19716 started^M
-datalad.cmd: DEBUG: Waiting for process 19716 to complete^M
-datalad.cmd: Level 5: Read 132 bytes from 19716[stderr]:^M
-datalad.cmd: Level 5: stderr| warning: LF will be replaced by CRLF in 127.0.0.1_about.txt.^M
-| The file will have its original line endings in your working directory^M
-|  ^M
-datalad.cmd: DEBUG: Process 19716 exited with return code 0^M
-datalad.cmd: Level 8: Finished running ['git', 'diff', '--name-only', '-z'] with status 0^M
-datalad.gitrepo: Level 11: stderr| warning: LF will be replaced by CRLF in 127.0.0.1_about.txt.^M
-datalad.gitrepo: Level 11: stderr| The file will have its original line endings in your working directory^M
+datalad.cmd: DEBUG: Async run ['git', 'diff-files', '--name-only', '-z']^M
+datalad.cmd: DEBUG: Launching process ['git', 'diff-files', '--name-only', '-z']^M
+datalad.cmd: DEBUG: Process 10040 started^M
+datalad.cmd: DEBUG: Waiting for process 10040 to complete^M
+datalad.cmd: Level 5: Read 20 bytes from 10040[stdout]:^M
+datalad.cmd: Level 5: stdout| 127.0.0.1_about.txt^@ ^M
+datalad.cmd: DEBUG: Process 10040 exited with return code 0^M
+datalad.cmd: Level 8: Finished running ['git', 'diff-files', '--name-only', '-z'] with status 0^M

and what confuses me is that int he case of diff (the one where test passes) there is no stdout logged at all, just all the warnings in the stderr (none of that in diff-files). any immediate clues @kyleam ?

edit: this is windows specific -- OSX tests were failing even at 4d84e6e so before this switch to diff-files was introduced

@kyleam
Copy link
Contributor

kyleam commented Dec 2, 2020

on windows it boils down to the switch from diff to diff-files and here is the difference in the logged output [...] any immediate clues @kyleam ?

No, but I can trigger it with tools/eval_under_testloopfs, so something with adjusted branches (maybe to do with filters). I'll dig more.

@kyleam
Copy link
Contributor

kyleam commented Dec 2, 2020

Under tools/eval_under_testloopfs, here's the repo from test_AnnexRepo_web_remote immediately before the first file_has_content call:

$ git diff-files -p
$ git diff-files
:100755 100755 6712b5d7d14789b99e55540b0dec07fe2fc0d421 0000000000000000000000000000000000000000 M	127.0.0.1_about.txt
$ git diff
$ git diff-files

So, we're switching to diff-files because it doesn't do a refresh of the index, but I suppose that that comes along that comes along with not triggering the clean filter.

In this case, I'd say it makes sense to switch this spot back "diff", read_only=False.

@yarikoptic
Copy link
Member Author

eh, ok, will do... since anything else (checking if we are in adjusted branch etc) would only take more time, thus ruining any OPT effect. I would expect OSX also to turn green, but I still think it would only be hiding the issue - I think there is more to it there.

As discovered while testing in adjusted branches on windows,
see datalad#5070 (comment)
clean filter needs to be invoked/index rewritten, which makes it a
non-readonly operation.  So this at large it reverts 7b50e35 and removes
read_only=True since diff could modify the index.

This might also chase away some odd behavior we see on OSX with
file_has_content check which uses this _check_files, so may be altogether
there is more to it really
@kyleam
Copy link
Contributor

kyleam commented Dec 2, 2020

since anything else (checking if we are in adjusted branch etc) would only take more time, thus ruining any OPT effect.

I don't think any of this is for this PR but...

Another option would be to switch to ls-files --modified PATH, which seems to correctly run the filter yet doesn't opportunistically refresh the index. I don't think that's a step worth taking though, because in my view file_has_content should just be rewritten to use a combination of diffstatus (which calls ls-files --modified) and get_content_annexinfo. Also, it seems like we should consider whether get_content_annexinfo ought to report that content is present if a file is modified, as it does now; if we decide it shouldn't, then we don't really need file_has_content and it can be reduced to a very thin (perhaps deprecated) wrapper around get_content_annexinfo.

And I think it's worth keeping in mind that avoiding the thread lock by disabling the opportunistic refresh of the index might not be a net gain because downstream the index update would need to be redone. Perhaps tricky, but it'd be good to measure in some way rather than assuming there's actually a benefit.

Unrelated to any of that: the fact that ls-files --modified and diff-files when combined with flags like -p and -s do seem to run the filter makes me think that diff-files not running the filter in some cases should be looked into more and reported upstream.

I would expect OSX also to turn green, but I still think it would only be hiding the issue - I think there is more to it there.

I'm confused by this. Based on you saying "OSX tests were failing even at 4d84e6e so before this switch to diff-files was introduced", I thought those were unrelated.

@yarikoptic
Copy link
Member Author

I would expect OSX also to turn green, but I still think it would only be hiding the issue - I think there is more to it there.

I'm confused by this. Based on you saying "OSX tests were failing even at 4d84e6e so before this switch to diff-files was introduced", I thought those were unrelated.

but they were read_only=True. My hypothesis was that some git config call we do in case of non-read_only does something "magically necessary" to avoid those failures. And BTW indeed those OSX builds turned green! Travis is still cranking, and crippledFS failures were closed eyes upon for awhile so would be too much effort to say if related ATM.

@yarikoptic
Copy link
Member Author

ok, let's proceed!

@yarikoptic yarikoptic merged commit 7da7b46 into datalad:master Dec 3, 2020
@yarikoptic yarikoptic deleted the enh-nofakedatescheck branch December 4, 2020 14:35
kyleam added a commit to kyleam/datalad that referenced this pull request Jan 5, 2021
dataladgh-5070 switched over some call_git* invocations to use
read_only=True.  Do a wider audit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-in-parallel merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parallel install of subdatasets needs some thread locking while operating on superdataset config etc
3 participants