Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big overhaul of ConfigManager #4829

Merged
merged 20 commits into from Aug 28, 2020
Merged

Big overhaul of ConfigManager #4829

merged 20 commits into from Aug 28, 2020

Conversation

mih
Copy link
Member

@mih mih commented Aug 24, 2020

This includes the changes from #4808 (I will rebase this PR, once it is merged).

Primarily, this PR fixes #4813. Along the road various RFs have led to optimization that reduce the number of calls to git-config and hopefully led to more readable code.

@yarikoptic
Copy link
Member

FWIW: common gotcha from travis (below is just a sample failed tests among a good number):

======================================================================
ERROR: datalad.tests.test_tests_utils.test_probe_known_failure
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/datalad/tests/test_tests_utils.py", line 633, in test_probe_known_failure
    with patch_config({'datalad.tests.knownfailures.probe': True}):
  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/datalad/tests/utils.py", line 1921, in patch_config
    return patch.dict(dl_cfg._store, vars)
AttributeError: 'ConfigManager' object has no attribute '_store'

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments after initial superficial review

datalad/config.py Outdated Show resolved Hide resolved
datalad/config.py Show resolved Hide resolved
datalad/config.py Outdated Show resolved Hide resolved
'local',
'{}.call-json'.format(hook_basevar),
None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not use procedures much (if at all, although probably should have), but it feels such a change should not go without either/or both of

  • some documentation on procedures, stating that they would be ignored in dataset config
  • warning if any procedure configuration found in the dataset config

otherwise I would expect procedure-savvy users spending a good amount of time wondering "why my procedure doesn't run?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, isn't this measure just "half a solution", e.g. if malicious procedure code is provided within dataset (as configured via datalad.locations.dataset-procedures defaulting to .datalad/procedures), and then triggered by a "local" config?

Copy link
Member Author

@mih mih Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR: This is not a change in behavior. The only change is how expensive the past and future behavior is implemented. And yes, a user can configure a setup that is still insecure. The point of the previous and present safeguards is to not be extremely insecure by default. But again, this PR doesn't change any aspect of this.

@mih
Copy link
Member Author

mih commented Aug 25, 2020

So this seems to be functional in general. In order to get #4794 incorporated as well, something like the following patch should be sufficient (difference to #4794 is that it avoid the need for manual exception handling):

diff --git a/datalad/config.py b/datalad/config.py
index bf921f16c..92223a584 100644
--- a/datalad/config.py
+++ b/datalad/config.py
@@ -9,6 +9,7 @@
 """
 """
 
+from concurrent.futures import ThreadPoolExecutor
 from fasteners import InterProcessLock
 from functools import lru_cache
 import datalad
@@ -279,9 +280,10 @@ class ConfigManager(object):
                 else run_args
 
         # reload everything that was found todo
-        while to_run:
-            store_id, runargs = to_run.popitem()
-            self._stores[store_id] = self._reload(runargs)
+        with ThreadPoolExecutor(max_workers=len(to_run)) as executor:
+            store_ids = list(to_run.keys())
+            res = executor.map(self._reload, (to_run[k] for k in store_ids))
+            self._stores = dict(zip(store_ids, res))
 
         # always update the merged representation, even if we did not reload
         # anything from a file. ENV or overrides could change independently

However, that just hangs. If I am using the same approach as in #4794, adjusted for this PR, I get the Python segfault that I am already familiar with from previous parallelization attempts.

Any insight would be much appreciated.

In any case, I think this PR has utility even without parallelization .

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #4829 into master will decrease coverage by 0.05%.
The diff coverage is 92.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4829      +/-   ##
==========================================
- Coverage   89.71%   89.66%   -0.06%     
==========================================
  Files         289      289              
  Lines       40455    40464       +9     
==========================================
- Hits        36295    36282      -13     
- Misses       4160     4182      +22     
Impacted Files Coverage Δ
datalad/cmd.py 91.90% <66.66%> (-0.51%) ⬇️
datalad/config.py 96.59% <93.02%> (-1.22%) ⬇️
datalad/core/local/resulthooks.py 70.49% <100.00%> (ø)
datalad/interface/run_procedure.py 90.90% <100.00%> (+0.11%) ⬆️
datalad/interface/tests/test_ls_webui.py 100.00% <100.00%> (ø)
datalad/interface/utils.py 94.11% <100.00%> (-0.05%) ⬇️
datalad/support/gitrepo.py 90.49% <100.00%> (ø)
datalad/tests/test_config.py 100.00% <100.00%> (ø)
datalad/tests/utils.py 87.15% <100.00%> (ø)
datalad/downloaders/http.py 81.85% <0.00%> (-2.71%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4d788...383819e. Read the comment docs.

@mih
Copy link
Member Author

mih commented Aug 26, 2020

I looked into the stalled crawler test by raising the log level. At level 5 I see the stalling (more info below). At level 1 no stalling the test passes 🙄 This smells like an async IO problem again. Here is why:

with log level 5:

2020-08-26T09:22:13.7729059Z [DEBUG] Async run ['git', '-c', 'annex.alwayscommit=false', 'annex', 'lookupkey', '-c', 'annex.dotfiles=true', '-c', 'annex.retry=3', '--debug', '--', '1-copy.dat'] 
2020-08-26T09:22:13.7730095Z [DEBUG] Launching process ['git', '-c', 'annex.alwayscommit=false', 'annex', 'lookupkey', '-c', 'annex.dotfiles=true', '-c', 'annex.retry=3', '--debug', '--', '1-copy.dat'] 
2020-08-26T09:22:13.7790392Z [DEBUG] Closing stdin of <subprocess.Popen object at 0x7f4c00471410> and waiting process to finish 

this is the last output before the stall. Fractions of a second earlier there was a successful command execution (same command different file):

2020-08-26T09:22:12.6683512Z [DEBUG] Async run ['git', '-c', 'annex.alwayscommit=false', 'annex', 'lookupkey', '-c', 'annex.dotfiles=true', '-c', 'annex.retry=3', '--debug', '--', 'sample.txt'] 
2020-08-26T09:22:12.6684546Z [DEBUG] Launching process ['git', '-c', 'annex.alwayscommit=false', 'annex', 'lookupkey', '-c', 'annex.dotfiles=true', '-c', 'annex.retry=3', '--debug', '--', 'sample.txt'] 
2020-08-26T09:22:12.6731619Z [DEBUG] Process 18683 started 
2020-08-26T09:22:12.6733804Z [DEBUG] Waiting for process 18683 to complete 
2020-08-26T09:22:12.7020802Z [Level 5] Read 219 bytes from 18683[stderr] 
2020-08-26T09:22:12.7069603Z [Level 5] Read 57 bytes from 18683[stderr] 
2020-08-26T09:22:12.7070271Z [Level 5] Read 195 bytes from 18683[stderr] 
2020-08-26T09:22:12.7074682Z [Level 5] Read 243 bytes from 18683[stderr] 
2020-08-26T09:22:12.7261366Z [DEBUG] Process 18683 exited with return code 1 
2020-08-26T09:22:12.7266622Z [Level 8] Finished running ['git', '-c', 'annex.alwayscommit=false', 'annex', 'lookupkey', '-c', 'annex.dotfiles=true', '-c', 'annex.retry=3', '--debug', '--', 'sample.txt'] with status 1 

I assume that raising to log level 1 causes the call frequency to drop, due to much more processing of output being done.

An alternative source of the problem could be (again) a batched process. The final message:

Closing stdin of <subprocess.Popen object at 0x7f4c00471410> and waiting process to finish

is from BatchedCommand.close(). The only code that could hang is this:

            process.stdin.close()
            process.stdout.close()
            process.wait()

otherwise we would see the following debug log (lgr.debug("Process %s has finished", process)).

I do not see how the changes made here on the the previous PR could trigger this scenario (directly).

Looking a bit more, it seems that the immediate history isn't predictive of a stall. Here is the tail end of the log of another test run, where I tried to determine which of the three lines is the stalling one:

2020-08-26T10:47:40.0725618Z [Level 8] Finished running ['git', '-c', 'annex.alwayscommit=false', 'annex', 'init', '-c', 'annex.dotfiles=true', '--debug'] with status 0 
2020-08-26T10:47:40.0732034Z [Level 5] Running: ['git', 'config', '-z', '-l', '--show-origin'] 
2020-08-26T10:47:40.0905849Z [Level 8] Finished running ['git', 'config', '-z', '-l', '--show-origin'] with status 0 
2020-08-26T10:47:40.0925832Z [DEBUG] Closing stdin of <subprocess.Popen object at 0x7fa2c7847d90> and waiting process to finish 
2020-08-26T10:47:40.0926573Z [DEBUG] STDIN closed 
2020-08-26T10:47:40.0927334Z [DEBUG] STDOUT closed 

again the stall is in BatchedCommand.close() and specifically at process.wait(). Which present an opportunity, because wait() supports a timeout... Will try that.

@mih mih force-pushed the nf-cfgsrcquery branch 2 times, most recently from 8b55b97 to d21d86c Compare August 26, 2020 11:28
mih added a commit to mih/datalad that referenced this pull request Aug 26, 2020
This can happen (even though in some cases it is unclear why a
(git-annex) process is not exiting). This patch changes the
unconditional wait to three attempts of 3 seconds with 1 second wait
in between (i.e. 12s). I have no idea whether this is sensible.

Noise log warning announce this condition. If the process hasn't exited
at the end of this period, it is kept running, but is abandoned.

This addresses a stall in crawler tests that was observed in dataladgh-4829,
but did not affect test performance, as it happened during the take-down
phase of the test.

This is also related to the general need for a take-down helper dataladgh-4827
@mih
Copy link
Member Author

mih commented Aug 26, 2020

So after my "fix" here is what the logs say (before I turn the log level back to normal):

2020-08-26T11:44:43.1602034Z [DEBUG] Async run ['git', '-c', 'annex.alwayscommit=false', 'annex', 'init', '-c', 'annex.dotfiles=true', '--debug'] 
2020-08-26T11:44:43.1602342Z [DEBUG] Launching process ['git', '-c', 'annex.alwayscommit=false', 'annex', 'init', '-c', 'annex.dotfiles=true', '--debug'] 
2020-08-26T11:44:43.1629761Z [DEBUG] Closing stdin of <subprocess.Popen object at 0x7fe51f32f9d0> and waiting process to finish 
2020-08-26T11:44:46.1652556Z [WARNING] Caught Command '['git', '-c', 'annex.merge-annex-branches=false', 'annex', 'info', '--bytes', '--json', '--json-error-messages', '--batch']' timed out after 3.0 seconds [subprocess.py:_wait:1645] on trial #1. Sleeping 1.000000 and retrying 
2020-08-26T11:44:50.1667437Z [WARNING] Caught Command '['git', '-c', 'annex.merge-annex-branches=false', 'annex', 'info', '--bytes', '--json', '--json-error-messages', '--batch']' timed out after 3.0 seconds [subprocess.py:_wait:1645] on trial #2. Sleeping 1.000000 and retrying 
2020-08-26T11:44:54.1681564Z [WARNING] Batched process <subprocess.Popen object at 0x7fe51f32f9d0> did not finish, abandoning it without killing it 
2020-08-26T11:44:54.1684485Z [DEBUG] Closing stdin of <subprocess.Popen object at 0x7fe51f300450> and waiting process to finish 
2020-08-26T11:44:57.1694455Z [WARNING] Caught Command '['git', '-c', 'annex.merge-annex-branches=false', 'annex', 'info', '--bytes', '--json', '--json-error-messages', '--batch']' timed out after 3.0 seconds [subprocess.py:_wait:1645] on trial #1. Sleeping 1.000000 and retrying 
2020-08-26T11:45:01.1710080Z [WARNING] Caught Command '['git', '-c', 'annex.merge-annex-branches=false', 'annex', 'info', '--bytes', '--json', '--json-error-messages', '--batch']' timed out after 3.0 seconds [subprocess.py:_wait:1645] on trial #2. Sleeping 1.000000 and retrying 
2020-08-26T11:45:05.1723954Z [WARNING] Batched process <subprocess.Popen object at 0x7fe51f300450> did not finish, abandoning it without killing it 
2020-08-26T11:45:05.1733630Z [DEBUG] Process 18882 started 

This time it happened twice, not with lookupkey but with info. I do not think, this has anything to do with the scope of this PR.

mih added a commit to mih/datalad that referenced this pull request Aug 26, 2020
This can happen (even though in some cases it is unclear why a
(git-annex) process is not exiting). This patch changes the
unconditional wait to three attempts of 3 seconds with 1 second wait
in between (i.e. 12s). I have no idea whether this is sensible.

Noise log warning announce this condition. If the process hasn't exited
at the end of this period, it is kept running, but is abandoned.

This addresses a stall in crawler tests that was observed in dataladgh-4829,
but did not affect test performance, as it happened during the take-down
phase of the test.

This is also related to the general need for a take-down helper dataladgh-4827
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reading the code leaves me with nothing to complain about ;-)

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a read-through, paying closest attention to "RF: Keep read config separated by source", this looks like a nice improvement to me.

A minor complaint, which I think is fine left unaddressed, is that "RF: Prepare for parallel execution of git-config calls" makes the code less readable to me, and I think it would make sense to hold off on the complexity until the actual functional change introducing parallel calls. (That's in contrast to the earlier commits in this series, which IMO made the code easier to follow.)

The need for "ENH: When closing batched processes avoid waiting forever" makes me uneasy, but I think the approach makes sense (and it's hard to see how the changes in this PR would be directly connected to the hanging crawler test).

@kyleam
Copy link
Contributor

kyleam commented Aug 26, 2020

In any case, I think this PR has utility even without parallelization .

In my view it'd be better to not tack on that change here. Ideally that'd come in a dedicated PR with timings so that we have some sense what the added complexity is buying us.

@mih
Copy link
Member Author

mih commented Aug 26, 2020

After I figured out #4832 I tried the patch from #4829 (comment) again. Unfortunately, it still hangs trying to acquire a thread lock.

I then tried a port of @yarikoptic 's original approach, which now also hangs at the same step (instead of segfaulting):

diff --git a/datalad/config.py b/datalad/config.py
index bf921f16c..37ee64e1e 100644
--- a/datalad/config.py
+++ b/datalad/config.py
@@ -279,9 +279,17 @@ class ConfigManager(object):
                 else run_args
 
         # reload everything that was found todo
-        while to_run:
-            store_id, runargs = to_run.popitem()
-            self._stores[store_id] = self._reload(runargs)
+        if to_run:
+            res = {}
+            with ThreadPoolExecutor(max_workers=len(to_run)) as executor:
+                for k, args in to_run.items():
+                    res[k] = executor.submit(self._reload, args)
+                for r in res.values():
+                    exc = r.exception()
+                    if exc:
+                        raise exc
+                for k, r in res.items():
+                    self._stores[k] = r.result()
 
         # always update the merged representation, even if we did not reload
         # anything from a file. ENV or overrides could change independently

@mih
Copy link
Member Author

mih commented Aug 26, 2020

Thanks for the reviews @kyleam @bpoldrack !

@kyleam I think I would prefer to keep the RF that establishes the point where parallel execution could happen -- even though I cannot get it to work yet.

datalad/cmd.py Outdated Show resolved Hide resolved
@yarikoptic
Copy link
Member

On one hand I am ok with delaying and pretty much ignoring (but then why to delay) failure to close a batched process...
but on the other -- it seems to be some kind of a regression where changes cause unrelated (? BatchedAnnex is not for config reading right?) process to stall. May be 'strace' (at the log level appropriate to cause a stall) and looking at that git-annex process -- what is it waiting for? (some lock? not closed file? dunno) could shine some light and inform us. Otherwise it is not unlikely to resurface in the future in some other fashion and we would spend time chasing it down "afresh" until relating to these changes. I will try now to reproduce as well and strace it... will report on success/fail shortly

@yarikoptic
Copy link
Member

Initial observations. TL;DR: two annex info batched processes are out there; disabling coverage avoids stall (racy racy?)
  • stalls consistently only if I have full invocation as in travis:
mkdir -p __testhome__
  cd __testhome__
  python -m nose -s -v --with-cov --cover-package datalad $(echo datalad-crawler | tr '-' '_')
  • if I disable coverage - no stall
  • if I point to the test file -- no stall
  • if I specify test via --tests -- no stall
  • running from upstairs (not __testhome__) -- stalls, so unrelated

stracing

$> strace -f -o /tmp/stall.log -ttt python -m nose -s -v --with-cov --cover-package datalad datalad_crawler
datalad_crawler.dbs.tests.test_files.test_AnnexDBs(<class 'datalad_crawler.dbs.files.PhysicalFileStatusesDB'>,) ... ok
datalad_crawler.dbs.tests.test_files.test_AnnexDBs(<class 'datalad_crawler.dbs.files.JsonFileStatusesDB'>,) ... ok
datalad_crawler.nodes.tests.test_annex.test_annex_file('full',) ... 

looking at strace:

$> tail -f stall.log       
317776 1598468686.314884 poll([{fd=16, events=POLLIN}], 1, 500) = 0 (Timeout)
317776 1598468686.815950 poll([{fd=16, events=POLLIN}], 1, 500) = 0 (Timeout)
317776 1598468687.317215 poll([{fd=16, events=POLLIN}], 1, 500) = 0 (Timeout)
317776 1598468687.818528 poll([{fd=16, events=POLLIN}], 1, 500) = 0 (Timeout)

and more interesting ps auxw -H

yoh       316711  2.7  0.0   7652  3772 pts/12   S+   15:03   0:04         strace -f -o /tmp/stall.log -ttt python -m nose -s -v --with-cov --cover-package datalad datalad_crawler
yoh       316714  1.5  0.2 190616 75748 pts/12   S+   15:03   0:02           python -m nose -s -v --with-cov --cover-package datalad datalad_crawler
yoh       317224  0.0  0.0   8824  3548 pts/12   S+   15:03   0:00             /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c annex.merge-annex-branches=false annex info --bytes --json --json-error-messages --batch
yoh       317239  0.0  0.1 1074079100 40644 pts/12 Sl+ 15:03   0:00               /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex info --bytes --json --json-error-messages --batch
yoh       317246  0.0  0.0   8900  4248 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       317247  0.0  0.0   8824  3564 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
yoh       317248  0.0  0.0   8896  3956 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       317249  0.0  0.0   8824  4080 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
yoh       317689  0.0  0.0   8824  3568 pts/12   S+   15:03   0:00             /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c annex.merge-annex-branches=false annex info --bytes --json --json-error-messages --batch
yoh       317704  0.0  0.1 1074079100 39284 pts/12 Sl+ 15:03   0:00               /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex info --bytes --json --json-error-messages --batch
yoh       317711  0.0  0.0   8900  4100 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       317712  0.0  0.0   8824  3560 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
yoh       317713  0.0  0.0   8824  3560 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       317714  0.0  0.0   8824  3836 pts/12   S+   15:03   0:00                 /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
yoh       317776  0.0  0.1 190708 62564 pts/12   Sl+  15:03   0:00             python -m nose -s -v --with-cov --cover-package datalad datalad_crawler
yoh       317854  0.0  0.0      0     0 pts/12   Z+   15:03   0:00             [git] <defunct>

so -- my guess we are waiting on one of those two? git annex info processes:

yoh       317224  0.0  0.0   8824  3548 pts/12   S+   15:03   0:00             /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c annex.merge-annex-branches=false annex info --bytes --json --json-error-messages --batch
yoh       317689  0.0  0.0   8824  3568 pts/12   S+   15:03   0:00             /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c annex.merge-annex-branches=false annex info --bytes --json --json-error-messages --batch

both "working" in now deleted temporary different directories so unrelated runs?

$> ls -l /proc/{317224,317689}/cwd
lrwxrwxrwx 1 yoh yoh 0 Aug 26 15:08 /proc/317224/cwd -> /home/yoh/.tmp/datalad_temp_tree__test_AnnexDBs4h2bub1\ (deleted)
lrwxrwxrwx 1 yoh yoh 0 Aug 26 15:08 /proc/317689/cwd -> /home/yoh/.tmp/datalad_temp_tree__test_AnnexDB8hzkgm8y\ (deleted)

and stracing them doesn't work:

lena:/tmp
$> sudo strace -f -p 317689
[sudo] password for yoh: 
strace: attach: ptrace(PTRACE_SEIZE, 317689): Operation not permitted
now I will try to run

@yarikoptic
Copy link
Member

FWIW

  • for me doesn't stall with strace + log level 5; racy racy
  • it is having first those test_files tests which create those two batched processes we saw, so python -m nose -s -v --with-cov --cover-package datalad ../datalad_crawler/dbs/tests/test_files.py ../datalad_crawler/nodes/tests/test_annex.py reproduces the stall
  • it is those two annex info batched process which get "cleaned up" only those tests complete (so directory removed) and we get into test_annex_file which (via some datalad/config.py(284)reload()) only then causes garbage collection on those batched processes:
traceback when failing to wait already in `test_annex_file`
$> python -m nose -s -v --with-cov --cover-package datalad ../datalad_crawler/dbs/tests/test_files.py ../datalad_crawler/nodes/tests/test_annex.py
datalad_crawler.dbs.tests.test_files.test_AnnexDBs(<class 'datalad_crawler.dbs.files.PhysicalFileStatusesDB'>,) ... ok
datalad_crawler.dbs.tests.test_files.test_AnnexDBs(<class 'datalad_crawler.dbs.files.JsonFileStatusesDB'>,) ... ok
datalad_crawler.nodes.tests.test_annex.test_annex_file('full',) ... > /home/yoh/proj/datalad/datalad-master/datalad/cmd.py(1343)close()
-> lgr.warning(
(Pdb) p self
BatchedAnnex(cmd=<<['git', '-c', ++105 chars++ch']>>, output_proc=<function>, path=<<'/home/yoh/.tm++38 chars++_aa'>>)
(Pdb) p self.path
'/home/yoh/.tmp/datalad_temp_tree__test_AnnexDBf1z92_aa'
(Pdb) bt
  /usr/lib/python3.8/runpy.py(194)_run_module_as_main()
-> return _run_code(code, main_globals, None,
  /usr/lib/python3.8/runpy.py(87)_run_code()
-> exec(code, run_globals)
  /usr/lib/python3/dist-packages/nose/__main__.py(8)<module>()
-> run_exit()
  /usr/lib/python3/dist-packages/nose/core.py(118)__init__()
-> unittest.TestProgram.__init__(
  /usr/lib/python3.8/unittest/main.py(101)__init__()
-> self.runTests()
  /usr/lib/python3/dist-packages/nose/core.py(207)runTests()
-> result = self.testRunner.run(self.test)
  /usr/lib/python3/dist-packages/nose/core.py(62)run()
-> test(result)
  /usr/lib/python3/dist-packages/nose/suite.py(177)__call__()
-> return self.run(*arg, **kw)
  /usr/lib/python3/dist-packages/nose/suite.py(224)run()
-> test(orig)
  /usr/lib/python3/dist-packages/nose/suite.py(177)__call__()
-> return self.run(*arg, **kw)
  /usr/lib/python3/dist-packages/nose/suite.py(224)run()
-> test(orig)
  /usr/lib/python3/dist-packages/nose/suite.py(177)__call__()
-> return self.run(*arg, **kw)
  /usr/lib/python3/dist-packages/nose/suite.py(224)run()
-> test(orig)
  /usr/lib/python3/dist-packages/nose/suite.py(177)__call__()
-> return self.run(*arg, **kw)
  /usr/lib/python3/dist-packages/nose/suite.py(224)run()
-> test(orig)
  /usr/lib/python3/dist-packages/nose/case.py(45)__call__()
-> return self.run(*arg, **kwarg)
  /usr/lib/python3/dist-packages/nose/case.py(133)run()
-> self.runTest(result)
  /usr/lib/python3/dist-packages/nose/case.py(151)runTest()
-> test(result)
  /usr/lib/python3.8/unittest/case.py(736)__call__()
-> return self.run(*args, **kwds)
  /usr/lib/python3.8/unittest/case.py(676)run()
-> self._callTestMethod(testMethod)
  /usr/lib/python3.8/unittest/case.py(633)_callTestMethod()
-> method()
  /usr/lib/python3/dist-packages/nose/case.py(197)runTest()
-> self.test(*self.arg)
  /home/yoh/proj/datalad/datalad-master/datalad/tests/utils.py(554)newfunc()
-> return t(*(arg + (d,)), **kw)
  /home/yoh/proj/datalad/datalad-master/datalad/tests/utils.py(666)newfunc()
-> return tfunc(*(args + (path, url)), **kwargs)
  /home/yoh/proj/datalad/datalad-master/datalad/tests/utils.py(731)newfunc()
-> return t(*(arg + (filename,)), **kw)
  /home/yoh/proj/datalad/datalad-crawler/datalad_crawler/nodes/tests/test_annex.py(160)_test_annex_file()
-> output = list(annex(input_with_stats))
  /home/yoh/proj/datalad/datalad-crawler/datalad_crawler/nodes/annex.py(574)__call__()
-> changed = set().union(*self._get_status(args=[fpath]))
  /home/yoh/proj/datalad/datalad-crawler/datalad_crawler/nodes/annex.py(942)_get_status()
-> if isinstance(self.repo, AnnexRepo) and self.repo.is_direct_mode():
  /home/yoh/proj/datalad/datalad-master/datalad/support/annexrepo.py(1158)is_direct_mode()
-> self._direct_mode = self._is_direct_mode_from_config()
  /home/yoh/proj/datalad/datalad-master/datalad/support/annexrepo.py(1144)_is_direct_mode_from_config()
-> self.config.reload()
  /home/yoh/proj/datalad/datalad-master/datalad/config.py(284)reload()
-> self._stores[store_id] = self._reload(runargs)
  /home/yoh/proj/datalad/datalad-master/datalad/config.py(317)_reload()
-> stdout, stderr = self._run(
  /home/yoh/proj/datalad/datalad-master/datalad/config.py(635)_run()
-> return self._runner.run(self._config_cmd + args, **kwargs)
  /home/yoh/proj/datalad/datalad-master/datalad/cmd.py(1139)run()
-> out, err = super().run(
  /home/yoh/proj/datalad/datalad-master/datalad/cmd.py(905)run()
-> proc = subprocess.Popen(cmd,
  /usr/lib/python3.8/subprocess.py(854)__init__()
-> self._execute_child(args, executable, preexec_fn, close_fds,
  /usr/lib/python3.8/subprocess.py(1650)_execute_child()
-> self._close_pipe_fds(p2cread, p2cwrite,
  /usr/lib/python3.8/subprocess.py(1118)_close_pipe_fds()
-> stack.callback(os.close, errwrite)
  /usr/lib/python3.8/contextlib.py(449)callback()
-> _exit_wrapper = self._create_cb_wrapper(callback, *args, **kwds)
  /home/yoh/proj/datalad/datalad-master/datalad/support/annexrepo.py(382)__del__()
-> self._batched.close()
  /home/yoh/proj/datalad/datalad-master/datalad/support/annexrepo.py(3733)close()
-> p.close()
> /home/yoh/proj/datalad/datalad-master/datalad/cmd.py(1343)close()
-> lgr.warning(

So

  • I wonder if somehow may be now config entities manage to hold off on those AnnexRepo (with their batched annexes) instances __del__?

A removed path is not the cause for the stall

I thought that we had already some prior git-annex odd behaviors whenever a directory is removed from under its feet while it is still running, which is what is happening here and may be what causes annex to not die gracefully (but insofar I failed to find relevant report/discussion in annex, and didn't try to reproduce with a minimal example)

  • I incorrectly assumed that DATALAD_TESTS_TEMP_KEEP=1 is not in effect with @with_tree since I do not see those log msgs:
no log msgs
$> DATALAD_TESTS_TEMP_KEEP=1 python -m nose -s -v --with-cov --cover-package datalad ../datalad_crawler/dbs/tests/test_files.py ../datalad_crawler/nodes/tests/test_annex.py
[INFO   ] Keeping temp file: /home/yoh/.tmp/datalad_temp_get_most_obscure_supported_namewdl6khdx 
[INFO   ] Keeping temp file: /home/yoh/.tmp/datalad_temp_m5fubgx2 
datalad_crawler.dbs.tests.test_files.test_AnnexDBs(<class 'datalad_crawler.dbs.files.PhysicalFileStatusesDB'>,) ... ok
datalad_crawler.dbs.tests.test_files.test_AnnexDBs(<class 'datalad_crawler.dbs.files.JsonFileStatusesDB'>,) ... ok
datalad_crawler.nodes.tests.test_annex.test_annex_file('full',) ... > /home/yoh/proj/datalad/datalad-master/datalad/cmd.py(1343)close()
-> lgr.warning(
(Pdb) 
...
- but attaching to pdb and seeing those directories still present still and us failing to 'wait' for annex still while directories present suggests that issue is different!

@yarikoptic
Copy link
Member

damn, giving up for now -- I do not really see an excuse for git annex to stall. Overall change in behavior from before though -- those batched processes were closed at the end of the _test_AnnexDB tests, and now that is delayed into the runtime of the test_annex_file test. Overall I am ready to blame either python or git-annex (BTW using 8.20200810+git47-g27329f0bb-1~ndall+1 for all this ATM), but I do not fathom a minimal example to demonstrate that unfortunately.

@joeyh -- what information could we provide to help troubleshoot batched annex process not exiting when we close our stdout, stderr goes into a file which our process also "closes" although it remains open by git annex process.

Some additional observations:

  • with --debug added to its invocation -- goes through fine, no stalling
  • if I annex._batched.close() at the end of those _test_AnnexDB tests -- all good. Output in stderr from --debug etc looks identical (well -- subprocesses exit with 128 due to fatal: Not a valid object name ... regardless of what it is, shouldn't stall.

@yarikoptic
Copy link
Member

crap -- I thought that may be in non-stuck case all those "Exit 127" from git-annex children would be gone in the "non-stuck" cases... but my modifcations somehow (or evening) lead just to the effect of --debug invocation getting stuck now too... bleh

some details with a diff (even if pdb is commented out -- stalls)
  • when we get "stuck" we do get those stderr about invalid
yoh       461646 11.8  0.2 190572 74436 pts/12   S+   19:18   0:01  |   |   \_ python -m nose -s -v --with-cov --cover-package datalad ../datalad_crawler/dbs/tests/test_files.py ../datalad_crawler/nodes/tests/test_annex.py
yoh       462157  0.0  0.0   8824  3504 pts/12   S+   19:18   0:00  |   |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c annex.merge-annex-branches=false annex info --bytes --json --json-error-messages --batch
yoh       462172  0.2  0.1 1074079100 40420 pts/12 Sl+ 19:18   0:00  |   |       |   \_ /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex info --bytes --json --json-error-messages --batch
yoh       462179  0.0  0.0   8900  4128 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       462180  0.0  0.0   8824  3532 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(
yoh       462181  0.0  0.0   8896  4152 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       462182  0.0  0.0   8824  4240 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(
yoh       462632  0.0  0.0   8824  3500 pts/12   S+   19:18   0:00  |   |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git -c annex.merge-annex-branches=false annex info --bytes --json --json-error-messages --batch
yoh       462647  0.2  0.1 1074079100 39540 pts/12 Sl+ 19:18   0:00  |   |       |   \_ /usr/lib/git-annex.linux/exe/git-annex --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git-annex/git-annex info --bytes --json --json-error-messages --batch
yoh       462654  0.0  0.0   8900  4100 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       462655  0.0  0.0   8824  3540 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(
yoh       462656  0.0  0.0   8824  3532 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
yoh       462657  0.0  0.0   8824  3796 pts/12   S+   19:18   0:00  |   |       |       \_ /usr/lib/git-annex.linux/exe/git --library-path /usr/lib/git-annex.linux//lib/x86_64-linux-gnu: /usr/lib/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(
yoh       462730  0.0  0.1 187828 59840 pts/12   Sl+  19:18   0:00  |   |       \_ python -m nose -s -v --with-cov --cover-package datalad ../datalad_crawler/dbs/tests/test_files.py ../datalad_crawler/nodes/tests/test_annex.py
some straces
$> strace -f -p 462157
strace: Process 462157 attached
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, 0x7ffddea7f894, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(462172, ^Cstrace: Process 462157 detached
 <detached ...>

(dev3) 1 17419 ->130.....................................:Wed 26 Aug 2020 07:20:54 PM EDT:.
(git)lena:~datalad/datalad-master[nf-cfgsrcquery]git
$> strace -f -p 462172
strace: Process 462172 attached with 5 threads
[pid 462177] futex(0x7f10cc000bac, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 462175] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 462173] futex(0x41c61cc, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 462174] futex(0x41c61cc, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 462172] futex(0x555556c209fc, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 462173] <... futex resumed>)       = -1 EAGAIN (Resource temporarily unavailable)
[pid 462174] <... futex resumed>)       = 0
[pid 462173] futex(0x41c6160, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 462174] futex(0x41c6160, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 462173] <... futex resumed>)       = 0
[pid 462173] futex(0x41c6160, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 462174] epoll_wait(4,  <unfinished ...>
[pid 462173] <... futex resumed>)       = 0
[pid 462174] <... epoll_wait resumed>[], 64, 0) = 0
[pid 462173] read(3,  <unfinished ...>
[pid 462174] epoll_wait(4,  <unfinished ...>
[pid 462173] <... read resumed>"\2402\0\0\0\0\0\0", 8) = 8
[pid 462174] <... epoll_wait resumed>[], 64, 0) = 0
[pid 462173] read(3,  <unfinished ...>
[pid 462174] epoll_wait(4,  <unfinished ...>
[pid 462173] <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] write(10, "\377\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid 462175] <... restart_syscall resumed>) = 1
[pid 462173] <... write resumed>)       = 8
[pid 462175] read(10,  <unfinished ...>
[pid 462173] read(3,  <unfinished ...>
[pid 462175] <... read resumed>"\377\0\0\0\0\0\0\0", 8) = 8
[pid 462175] futex(0x7f10cc000bac, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 462177] <... futex resumed>)       = 0
[pid 462177] futex(0x7f10cc000bb0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 462175] futex(0x7f10cc000bb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 462177] <... futex resumed>)       = 0
[pid 462175] <... futex resumed>)       = 1
[pid 462177] futex(0x7f10cc000bb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 462175] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}], 2, -1 <unfinished ...>
[pid 462177] <... futex resumed>)       = 0
[pid 462177] rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
[pid 462177] clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=49224083}) = 0
[pid 462173] <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] read(3,  <unfinished ...>
[pid 462177] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 462177] futex(0x7f10cc000ba8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 462173] <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8
[pid 462173] futex(0x41c61c8, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 462172 detached
strace: Process 462173 detached
 <detached ...>
strace: Process 462174 detached
strace: Process 462175 detached
strace: Process 462177 detached

(dev3) 1 17420 ->130.....................................:Wed 26 Aug 2020 07:21:17 PM EDT:.
(git)lena:~datalad/datalad-master[nf-cfgsrcquery]git
$> ls -l /proc/462172/fd/3
lrwx------ 1 yoh yoh 64 Aug 26 19:21 /proc/462172/fd/3 -> anon_inode:[timerfd]
crap -- with this diff for pdb in problematic case it started to stall even with --debug !
$> git diff
diff --git a/datalad/cmd.py b/datalad/cmd.py
index 43e8fede5..77327963e 100644
--- a/datalad/cmd.py
+++ b/datalad/cmd.py
@@ -1195,7 +1195,7 @@ class BatchedCommand(SafeDelCloseMixin):
     def __init__(self, cmd, path=None, output_proc=None):
         if not isinstance(cmd, list):
             cmd = [cmd]
-        self.cmd = cmd
+        self.cmd = cmd  + ['--debug']
         self.path = path
         self.output_proc = output_proc if output_proc else readline_rstripped
         self._process = None
@@ -1317,7 +1317,14 @@ class BatchedCommand(SafeDelCloseMixin):
         ret = None
         if self._stderr_out:
             # close possibly still open fd
-            os.fdopen(self._stderr_out).close()
+            f = os.fdopen(self._stderr_out)
+            with open(self._stderr_out_fname) as f2:
+                stderr = f2.read()
+            lgr.warning(" STDERR: %s", stderr)
+            # our problem!
+            if 'fatal: Not a valid object name d/file2.txt' in stderr:
+                import pdb; pdb.set_trace()
+            f.close()
             lgr.warning(" CLOSED STDERR at %s", self._stderr_out)
             self._stderr_out = None
         if self._process:

mih added a commit to mih/datalad that referenced this pull request Aug 27, 2020
This can happen (even though in some cases it is unclear why a
(git-annex) process is not exiting). This patch changes the
unconditional wait to three attempts of 3 seconds with 1 second wait
in between (i.e. 12s). I have no idea whether this is sensible.

Noise log warning announce this condition. If the process hasn't exited
at the end of this period, it is kept running, but is abandoned.

This addresses a stall in crawler tests that was observed in dataladgh-4829,
but did not affect test performance, as it happened during the take-down
phase of the test.

This is also related to the general need for a take-down helper dataladgh-4827
As described here:
datalad#450 (comment)
the manipulation of configuration is a collision point for multiple
processes interacting with a single dataset.

This change introduces an inter-process locking mechanism using the
`fasteners` package. Relevant bits from the docs:

- does not require any cleanup
- lock gets dropped automatically if the process crashes, even if
  `__exit__` is not executed

Locking behavior is logged at level 5 and looks like this:

```
[Level 5] Acquired file lock `b'/tmp/datalad_temp_test_obtaink1uf9sf9/.git/config_dataladlock'` after waiting 0.000s [1 attempts were required]
[Level 5] Unlocked and closed file lock open on `b'/tmp/datalad_temp_test_obtaink1uf9sf9/.git/config_dataladlock'`
```

The `fasteners` package is already a direct dependency. It seems to
provide the standard solution for this problem.

Doing locking only at config-write, instead of more global "dataset is
in use"-locks has the advantage of minimizing bottlenecks with external
parallelization approaches (i.e. via `xargs -P`). However, it is
obviously not sufficient in general.

One could debate whether the lock should also cover read operations to
avoid missing partial update done by process A when process B is
reloading a dataset configuration.
All path operations now use pathlib. The internal variables of
ConfigManager have been reduced in favor of keep a repository
instance.

A local git config file location is now updated on `reload()` to
be able to handle moves of the git-dir, e.g. when fixing up
submodules.

A fix tests were adjusted to actually run on an existing repository
instead of a directory with just files. These tests specifically
verify behavior in 'dataset'-mode, so this seems appropriate.
A dataset procedure might change a dataset's config and `run_procedure()`
should take responsibility to trigger an unforced reload.
mih and others added 15 commits August 27, 2020 15:08
This change is motivated by dataladgh-4813, where a dedicated ConfigManager
instance needs to be built only to be able to determine if the local
repo config has a particular setting (while ignoring the committed
config). With the changes here it becomes easier to enhance the class
with source-aware query methods.

For the time being, a merged config representation (identical to the
only representation there was before) is kept as a dedicted dict. This
is done in order to avoid runtime cost for multi-dict lookups.

The internal book-keeping  in the class has also been changed to ease
RF and save runtime. Based on that mtime tests are now ran tailor
to each potential git-config call, only triggering it when the relevant
files did change. This is in contrast to the previous behavior that
always triggered two git-config calls whenever any file changed.
Current minimum version is 2.19.1 already
This was done to have a safe way of looking for result hooks. This
change keeps thiings safe, but avoids loading config from scratch
just for this single query. Instead, it used the separate internal
representation of local and committed config to perform a constrained
query. This is almost as fast as a normal query in all standard cases
(standard being: very few keys in the committed config).

Fixes dataladgh-4813
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Read of a non-existent `--file` is already prevented. git-config will happily
create one on write. It wasn't nice to create .datalad/config just to be able
to read nothing from it anyways.
This follows the spirit of dataladgh-4794 without actually implementing
parallelization, yet.
It should be sufficient to place this in the ConfigManager constructor.
Although it is protected by a conditional, it makes sense to have this
be placed in a location that isn't executed as often a `reload()`.

So now it runs once after the initial config load.
This can happen (even though in some cases it is unclear why a
(git-annex) process is not exiting). This patch changes the
unconditional wait to three attempts of 3 seconds with 1 second wait
in between (i.e. 12s). I have no idea whether this is sensible.

Noise log warning announce this condition. If the process hasn't exited
at the end of this period, it is kept running, but is abandoned.

This addresses a stall in crawler tests that was observed in dataladgh-4829,
but did not affect test performance, as it happened during the take-down
phase of the test.

This is also related to the general need for a take-down helper dataladgh-4827
except subprocess.TimeoutExpired:
lgr.warning(
"Batched process %s did not finish, abandoning it without killing it",
process)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just kill it at this point? I am more afraid of run away git-annex processes lingering around and consuming increasing amount of resources in long running datalad processes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had this discussion with @bpoldrack too (who feels like killing too). I am personally not in favor of such a move. The fact that these processes do not finish is a sign of an error. If we kill them, we create the impression that everything is splendid, but that is not really the case. Moreover, this is happening in BatchedCommand. There are no guarantees that the underlying process is a git-annex instance, that it wasn't supposed to write something to disk and is stalled by IO, that higher-level code would verify that all is kosher despite the process not having finished yet.

IMHO the only safe way to continue at this layer is to detach and leave these processes alone.

@mih
Copy link
Member Author

mih commented Aug 28, 2020

OK, I think we have reached a meaningful endpoint for this PR. I will merge this now.

@mih mih merged commit be47e8d into datalad:master Aug 28, 2020
@mih mih deleted the nf-cfgsrcquery branch August 28, 2020 08:58
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Oct 11, 2020
This reverts commit 383819e.

That workaround, introduced in datalad#4829, just hides away the problem, possibly
leading to data loss (e.g. batched process would not finish its processing by
then and we proceed to eg save or alike), and makes it hard to impossible to
spot/debug in real life cases where it manifests itself.

This is unlikely to fix the issue, but I want to see if it still
reproduces (I failed to reproduce it in the past 2 days).

More references:

- datalad issue datalad#4998
- git-annex report https://git-annex.branchable.com/bugs/some_annex_addurl_--fast_--with-files_--json_--json-error-messages_--batch__do_not_quit/

Conflicts:
	datalad/cmd.py - melded with added later logging of stderr
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Oct 28, 2020
… of batched processes to complete by default

That workaround, introduced in datalad#4829, just hides away the problem, possibly
leading to data loss (e.g. batched process would not finish its processing by
then and we proceed to eg save or alike), and makes it hard to impossible to
spot/debug in real life cases where it manifests itself.

This is unlikely to fix the issue, but I want to see if it still
reproduces (I failed to reproduce it in the past 2 days).

More references:

- datalad issue datalad#4998
- git-annex report https://git-annex.branchable.com/bugs/some_annex_addurl_--fast_--with-files_--json_--json-error-messages_--batch__do_not_quit/

I do not think of any reasonable "timeout" we could specify and be
"correct", since in some cases (very large index) it could take awhile for e.g.
`addurl --batch` to complete and I do not know how long exactly.  What we could
do is to add some "activity indication", e.g. if it did not complete within X
seconds, we bring up "progress bar like" information bar which would report PID
of the batched child process, and may be its tree with CPU/IO  information.
That would allow to ease troubleshooting, alert user about still ongoing (or
stalled?) activity.  In interactive mode we could may be even add some shortcut
(e.g. if Ctrl-C is pressed and thus we get an exception) to jump into pdb...
just ideas

Adding configuration variable would allow to overcome "stalling" in some rare
use cases as a temporary measure until we figure out what lead to it and fix
(or workaround) for it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eval_func() leads to one git-config call per cmd call
4 participants