Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

Closed
yarikoptic opened this issue Nov 2, 2022 · 5 comments · Fixed by #7201
Closed

test_creatsubdatasets: RuntimeError: ThreadedRunner.run() was re-entered #7138

yarikoptic opened this issue Nov 2, 2022 · 5 comments · Fixed by #7201
Assignees
Labels
spurious-test-failure Tests that fail unreliably tests Add or improve existing tests

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Nov 2, 2022

spotted in the recent build on OSX using snapshot version of git-annex build:

https://github.com/datalad/datalad/actions/runs/3379647930/jobs/5611473605

>       assert_raises(IncompleteResultsError, list, ProducerConsumer(paths[::-1], create_, jobs=5))

../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/tests/test_parallel.py:129: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:265: in __iter__
    yield from self._iter_threads(self._jobs)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:417: in _iter_threads
    self.shutdown(force=True, exception=self._producer_exception or interrupted_by_exception)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:233: in shutdown
    raise exception
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:401: in _iter_threads
    done_useful |= self._pop_done_futures(lgr)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:463: in _pop_done_futures
    raise exception
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/concurrent/futures/thread.py:57: in run
    result = self.fn(*self.args, **self.kwargs)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/parallel.py:329: in consumer_worker
    for r in res:
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/interface/utils.py:452: in _execute_command_
    allkwargs):
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/interface/utils.py:619: in _process_results
    for res in results:
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/core/local/create.py:291: in __call__
    paths=[check_path.relative_to(parentds_path)])
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py:2874: in status
    eval_submodule_state=eval_submodule_state)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py:2997: in diffstatus
    from_state = self.get_content_info(paths=paths, ref=fr)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/support/gitrepo.py:2767: in get_content_info
    read_only=True)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/dataset/gitrepo.py:461: in call_git
    keep_ends=True))
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/dataset/gitrepo.py:522: in call_git_items_
    sep=sep):
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/dataset/gitrepo.py:355: in _generator_call_git
    for file_no, content in generator:
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/gitrunner.py:305: in run_on_filelist_chunks_items_
    **kwargs):
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/gitrunner.py:189: in _get_chunked_results
    **kwargs)
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/runner.py:194: in run
    results_or_iterator = self.threaded_runner.run()
../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/nonasyncrunner.py:320: in run
    return self._locked_run()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def _locked_run(self) -> dict | _ResultGenerator:
        if self.generator is not None:
>           raise RuntimeError("ThreadedRunner.run() was re-entered")
E           RuntimeError: ThreadedRunner.run() was re-entered

../../../../hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad/runner/nonasyncrunner.py:324: RuntimeError

---------- coverage: platform darwin, python 3.7.15-final-0 ----------
Coverage XML written to file coverage.xml

=========================== short test summary info ============================
FAILED ../tests/test_parallel.py::test_creatsubdatasets - RuntimeError: ThreadedRunner.run() was re-entered
============ 1 failed, 392 passed, 35 skipped in 3114.32s (0:51:54) ============

and that exception was seen raised (test seems to fail or error also for other reasons in the past, so grepping for the exception)

(git)smaug:/mnt/datasets/datalad/ci/logs/2022[master]git
$> datalad foreach-dataset --o-s relpath git grep 'ThreadedRunner.run() was re-entered'                                      
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt:[00:15:51]     raise RuntimeError("ThreadedRunner.run() was re-entered")                                                                                                                           
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt:[00:15:51] RuntimeError: ThreadedRunner.run() was re-entered
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt:[00:17:18]     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt:[00:17:18] RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt:[00:14:43]     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt:[00:14:43] RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/2_test (snapshot).txt:2022-06-03T09:42:09.4228590Z     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/2_test (snapshot).txt:2022-06-03T09:42:09.4228990Z RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/test (snapshot)/8_Run tests.txt:2022-06-03T09:42:09.4228590Z     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6737/a4519d8/github-Test on macOS-3326-failed/test (snapshot)/8_Run tests.txt:2022-06-03T09:42:09.4228990Z RuntimeError: ThreadedRunner.run() was re-entered
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt:[00:03:16]     raise RuntimeError("ThreadedRunner.run() was re-entered")
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt:[00:03:16] RuntimeError: ThreadedRunner.run() was re-entered
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:17:12] >           raise RuntimeError("ThreadedRunner.run() was re-entered")                                                                                                                
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:17:12] E           RuntimeError: ThreadedRunner.run() was re-entered
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:17:12] FAILED ../datalad/support/tests/test_parallel.py::test_creatsubdatasets - RuntimeError: ThreadedRunner.run() was re-entered
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:19:56] >           raise RuntimeError("ThreadedRunner.run() was re-entered")                                                                                                                   
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:19:56] E           RuntimeError: ThreadedRunner.run() was re-entered
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:19:56] FAILED ../datalad/support/tests/test_parallel.py::test_creatsubdatasets - RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9901570Z E           RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/7111/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9901560Z E           RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/1_test (brew).txt:2022-10-24T19:12:27.9901570Z E           RuntimeError: ThreadedRunner.run() was re-entered
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9900650Z >           raise RuntimeError("ThreadedRunner.run() was re-entered")
10/24/pr/UNK/ca436bf/github-Test on macOS-3958-failed/test (brew)/8_Run tests.txt:2022-10-24T19:12:27.9901560Z E           RuntimeError: ThreadedRunner.run() was re-entered
datalad foreach-dataset --o-s relpath git grep   57.95s user 48.87s system 461% cpu 23.163 total  

so happened on OSX but also on appveyor on debian/ubuntu:

$> datalad foreach-dataset --o-s relpath git grep -l 'ThreadedRunner.run() was re-entered' | grep -v macOS | xargs grep -e '## system' -A1 
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt:[00:02:38] ## system 
06/02/pr/6737/60b6f94/appveyor-8048-failed/o7ip9u3udu7odsg6.txt-[00:02:38]   - distribution: debian/bullseye/sid
--
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt:[00:02:30] ## system 
06/02/pr/6737/abbf61b/appveyor-8049-failed/fla2okr02r6snseh.txt-[00:02:30]   - distribution: debian/bullseye/sid
--
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt:[00:02:54] ## system 
06/03/pr/6737/81f8b2d/appveyor-8059-failed/61o073tsjurpnalc.txt-[00:02:54]   - distribution: debian/bullseye/sid
--
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt:[00:02:55] ## system 
06/03/pr/6740/d6f1cdd/appveyor-8060-failed/vf6voxcyv67lwaox.txt-[00:02:55]   - distribution: debian/bullseye/sid
--
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt:[00:02:02] ## system 
09/19/push/maint/27c18e8/appveyor-8543-failed/4endjtn7v53adf2h.txt-[00:02:02]   - distribution: ubuntu/20.04/focal
--
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt:[00:02:09] ## system 
10/12/pr/7084/44d7557/appveyor-8646-failed/hsb8f27cmg1m57g0.txt-[00:02:09]   - distribution: debian/bullseye/sid
datalad foreach-dataset --o-s relpath git grep -l   77.93s user 46.18s system 1042% cpu 11.904 total
grep --color=auto -d skip -v macOS  0.00s user 0.00s system 0% cpu 11.904 total
xargs grep -e '## system' -A1  0.01s user 0.00s system 0% cpu 11.909 total

edit: and it was on pythons 3.7.12, 3.7.13, and 3.8.10

@yarikoptic yarikoptic added spurious-test-failure Tests that fail unreliably tests Add or improve existing tests labels Nov 2, 2022
@yarikoptic
Copy link
Member Author

I see that we set self.generator in https://github.com/datalad/datalad/blob/HEAD/datalad/runner/nonasyncrunner.py#L474 but I do not see us ever resetting it back to None so if we reenter (after waiting and getting the lock at https://github.com/datalad/datalad/blob/HEAD/datalad/runner/nonasyncrunner.py#L319) I don't see why it shouldn't fail like this for two "parallel" executions of the .run. @christian-monch - I need your input

@christian-monch
Copy link
Contributor

We reset it in https://github.com/datalad/datalad/blob/HEAD/datalad/runner/nonasyncrunner.py#L128. So if the iterator is exhausted, the generator-attribute of the connected ThreadedRunner-instance is set to None.

@christian-monch
Copy link
Contributor

I will add some thread id logging and check the tests

@christian-monch
Copy link
Contributor

christian-monch commented Nov 11, 2022

Just for clarification: ThreadedRunner does not use thread-local storage. If ThreadedRunner.run(self, ...) is entered by multiple threads on the same object, the attribute values will contain invalid information, i.e. information from both threads in case of collection-attributes or values from either thread in case of scalar attributes.

The mutual exclusion has to be guaranteed as long as the thread that "owns" the ThreadedRunner-instance (the thread that first called run() on an un-owned ThreadedRunner-instance), uses the ThreadedRunner-instance. In the non-generator protocol case this is as long as the thread executes the run()-method. In the generator protocol case, that is as long as the generator is not exhausted because the generator uses the ThreadedRunner-instance to progress.

In the non-generator case, mutual exclusion is automatically guaranteed by the lock that protects ThreadedRunner.run(). Multiple threads can randomly call ThreadedRunner.run() and the execution will automatically be serialized.

In the generator case, things are different because the generator exists in a non-trivial state, i.e. a non-exhausted state that still uses the ThreadedRunner-instance, after ThreadedRunner.run() has returned. In this scenario, it is currently the user's responsibility to ensure that no thread is calling ThreadedRunner.run() on the instance until the currently active generator is exhausted.

Having said that, there is a possibility to protect the generator case with a lock as well. That might lead to long blocking of threads, but we could probably issue a warning in this case, It would also open the possbilitiy for deadlocks, if thread 1 holds the generator and waits for results from thread 2, but thread 2 is calling ThreadedRunner.run() on the object that owns the generator and would therefore block until thread 1 exhausts the generator, i.e. forever since thread 1 waits for thread 2

@yarikoptic-gitmate
Copy link
Collaborator

Issue fixed in 0.17.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spurious-test-failure Tests that fail unreliably tests Add or improve existing tests
Projects
None yet
3 participants