test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

yarikoptic · 2022-11-02T19:52:10Z

spotted in the recent build on OSX using snapshot version of git-annex build:

https://github.com/datalad/datalad/actions/runs/3379647930/jobs/5611473605

>       assert_raises(IncompleteResultsError, list, ProducerConsumer(paths[::-1], create_, jobs=5))

../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/tests/test_parallel.py:129: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:265: in __iter__
    yield from self._iter_threads(self._jobs)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:417: in _iter_threads
    self.shutdown(force=True, exception=self._producer_exception or interrupted_by_exception)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:233: in shutdown
    raise exception
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:401: in _iter_threads
    done_useful |= self._pop_done_futures(lgr)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:463: in _pop_done_futures
    raise exception
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/concurrent/futures/thread.py:57: in run
    result = self.fn(*self.args, **self.kwargs)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:329: in consumer_worker
    for r in res:
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/interface/utils.py:452: in _execute_command_
    allkwargs):
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/interface/utils.py:619: in _process_results
    for res in results:
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/core/local/create.py:291: in __call__
    paths=[check_path.relative_to(parentds_path)])
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py:2874: in status
    eval_submodule_state=eval_submodule_state)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py:2997: in diffstatus
    from_state = self.get_content_info(paths=paths, ref=fr)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py:2767: in get_content_info
    read_only=True)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/dataset/gitrepo.py:461: in call_git
    keep_ends=True))
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/dataset/gitrepo.py:522: in call_git_items_
    sep=sep):
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/dataset/gitrepo.py:355: in _generator_call_git
    for file_no, content in generator:
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/gitrunner.py:305: in run_on_filelist_chunks_items_
    **kwargs):
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/gitrunner.py:189: in _get_chunked_results
    **kwargs)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/runner.py:194: in run
    results_or_iterator = self.threaded_runner.run()
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/nonasyncrunner.py:320: in run
    return self._locked_run()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def _locked_run(self) -> dict | _ResultGenerator:
        if self.generator is not None:
>           raise RuntimeError("ThreadedRunner.run() was re-entered")
E           RuntimeError: ThreadedRunner.run() was re-entered

../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/nonasyncrunner.py:324: RuntimeError

---------- coverage: platform darwin, python 3.7.15-final-0 ----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED ../tests/test_parallel.py::test_creatsubdatasets - RuntimeError: ThreadedRunner.run() was re-entered
============ 1 failed, 392 passed, 35 skipped in 3114.32s (0:51:54) ============

and that exception was seen raised (test seems to fail or error also for other reasons in the past, so grepping for the exception)

(git)smaug:/mnt/datasets/datalad/ci/logs/2022[master]git
$> datalad foreach-dataset --o-s relpath git grep 'ThreadedRunner.run() was re-entered'                                      
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt:[00:15:51]     raise RuntimeError("ThreadedRunner.run() was re-entered")                                                                                                                           
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt:[00:15:51] RuntimeError: ThreadedRunner.run() was re-entered
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt:[00:17:18]     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt:[00:17:18] RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt:[00:14:43]     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt:[00:14:43] RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/2_test (snapshot).txt:2022-06-03T09:42:09.4228590Z     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/2_test (snapshot).txt:2022-06-03T09:42:09.4228990Z RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/test (snapshot)/8_Run tests.txt:2022-06-03T09:42:09.4228590Z     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/test (snapshot)/8_Run tests.txt:2022-06-03T09:42:09.4228990Z RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt:[00:03:16]     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt:[00:03:16] RuntimeError: ThreadedRunner.run() was re-entered
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:17:12] >           raise RuntimeError("ThreadedRunner.run() was re-entered")                                                                                                                
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:17:12] E           RuntimeError: ThreadedRunner.run() was re-entered
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:17:12] FAILED ../datalad/support/tests/test_parallel.py::test_creatsubdatasets - RuntimeError: ThreadedRunner.run() was re-entered
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:19:56] >           raise RuntimeError("ThreadedRunner.run() was re-entered")                                                                                                                   
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:19:56] E           RuntimeError: ThreadedRunner.run() was re-entered
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:19:56] FAILED ../datalad/support/tests/test_parallel.py::test_creatsubdatasets - RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9901570Z E           RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9901560Z E           RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9901570Z E           RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9901560Z E           RuntimeError: ThreadedRunner.run() was re-entered
datalad foreach-dataset --o-s relpath git grep   57.95s user 48.87s system 461% cpu 23.163 total

so happened on OSX but also on appveyor on debian/ubuntu:

$> datalad foreach-dataset --o-s relpath git grep -l 'ThreadedRunner.run() was re-entered' | grep -v macOS | xargs grep -e '## system' -A1 
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt:[00:02:38] ## system 
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt-[00:02:38]   - distribution: debian/bullseye/sid
--
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt:[00:02:30] ## system 
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt-[00:02:30]   - distribution: debian/bullseye/sid
--
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt:[00:02:54] ## system 
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt-[00:02:54]   - distribution: debian/bullseye/sid
--
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt:[00:02:55] ## system 
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt-[00:02:55]   - distribution: debian/bullseye/sid
--
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:02:02] ## system 
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt-[00:02:02]   - distribution: ubuntu/20.04/focal
--
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:02:09] ## system 
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt-[00:02:09]   - distribution: debian/bullseye/sid
datalad foreach-dataset --o-s relpath git grep -l   77.93s user 46.18s system 1042% cpu 11.904 total
grep --color=auto -d skip -v macOS  0.00s user 0.00s system 0% cpu 11.904 total
xargs grep -e '## system' -A1  0.01s user 0.00s system 0% cpu 11.909 total

edit: and it was on pythons 3.7.12, 3.7.13, and 3.8.10

The text was updated successfully, but these errors were encountered:

yarikoptic · 2022-11-02T19:59:14Z

I see that we set self.generator in https://github.com/datalad/datalad/blob/HEAD/datalad/runner/nonasyncrunner.py#L474 but I do not see us ever resetting it back to None so if we reenter (after waiting and getting the lock at https://github.com/datalad/datalad/blob/HEAD/datalad/runner/nonasyncrunner.py#L319) I don't see why it shouldn't fail like this for two "parallel" executions of the .run. @christian-monch - I need your input

christian-monch · 2022-11-10T15:16:51Z

We reset it in https://github.com/datalad/datalad/blob/HEAD/datalad/runner/nonasyncrunner.py#L128. So if the iterator is exhausted, the generator-attribute of the connected ThreadedRunner-instance is set to None.

christian-monch · 2022-11-10T15:18:23Z

I will add some thread id logging and check the tests

christian-monch · 2022-11-11T08:58:49Z

Just for clarification: ThreadedRunner does not use thread-local storage. If ThreadedRunner.run(self, ...) is entered by multiple threads on the same object, the attribute values will contain invalid information, i.e. information from both threads in case of collection-attributes or values from either thread in case of scalar attributes.

The mutual exclusion has to be guaranteed as long as the thread that "owns" the ThreadedRunner-instance (the thread that first called run() on an un-owned ThreadedRunner-instance), uses the ThreadedRunner-instance. In the non-generator protocol case this is as long as the thread executes the run()-method. In the generator protocol case, that is as long as the generator is not exhausted because the generator uses the ThreadedRunner-instance to progress.

In the non-generator case, mutual exclusion is automatically guaranteed by the lock that protects ThreadedRunner.run(). Multiple threads can randomly call ThreadedRunner.run() and the execution will automatically be serialized.

In the generator case, things are different because the generator exists in a non-trivial state, i.e. a non-exhausted state that still uses the ThreadedRunner-instance, after ThreadedRunner.run() has returned. In this scenario, it is currently the user's responsibility to ensure that no thread is calling ThreadedRunner.run() on the instance until the currently active generator is exhausted.

Having said that, there is a possibility to protect the generator case with a lock as well. That might lead to long blocking of threads, but we could probably issue a warning in this case, It would also open the possbilitiy for deadlocks, if thread 1 holds the generator and waits for results from thread 2, but thread 2 is calling ThreadedRunner.run() on the object that owns the generator and would therefore block until thread 1 exhausts the generator, i.e. forever since thread 1 waits for thread 2

yarikoptic-gitmate · 2022-12-14T15:45:34Z

Issue fixed in 0.17.10

yarikoptic added spurious-test-failure Tests that fail unreliably tests Add or improve existing tests labels Nov 2, 2022

christian-monch self-assigned this Nov 18, 2022

christian-monch mentioned this issue Nov 23, 2022

Fix runner hanging, fix an error in thread handling, and serialize concurrent generator-runner invocations #7201

Merged

christian-monch closed this as completed in #7201 Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

yarikoptic commented Nov 2, 2022 •

edited

yarikoptic commented Nov 2, 2022

christian-monch commented Nov 10, 2022

christian-monch commented Nov 10, 2022

christian-monch commented Nov 11, 2022 •

edited

yarikoptic-gitmate commented Dec 14, 2022

test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

Comments

yarikoptic commented Nov 2, 2022 • edited

yarikoptic commented Nov 2, 2022

christian-monch commented Nov 10, 2022

christian-monch commented Nov 10, 2022

christian-monch commented Nov 11, 2022 • edited

yarikoptic-gitmate commented Dec 14, 2022

yarikoptic commented Nov 2, 2022 •

edited

christian-monch commented Nov 11, 2022 •

edited