New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NF+OPT+ENH: a helper to parallelize operations (via threads) #5022
Conversation
ca87b8a
to
297d422
Compare
ok --
Here is some examples for those who are not prone to epileptic seizures (a good number of jumping progress bars -- "result" record rendering is not yet anyhow de-interferenced from logs, so affects from time to time). this is running from my laptop on my not so fast broadband. Didn't try yet on some beefier connection. Change in behavior also: now progressbar appears right from the beginning, but "Total" grows as we discover more subdatasets to install ;-) (note -- hcp one had already some pre-installed locally, so they did get skipped) |
BUG: if nothing to be installed (all subdatasets are already in place), never exits:$> datalad install -J 10 -r ///hcp-openaccess
[ERROR ] target path already exists and not empty, refuse to clone into target path [install(/mnt/scrap/tmp/hcp-openaccess)]
install(error): /mnt/scrap/tmp/hcp-openaccess (dataset) [target path already exists and not empty, refuse to clone into target path]
[INFO ] Installing Dataset(/mnt/scrap/tmp/hcp-openaccess) to get /mnt/scrap/tmp/hcp-openaccess recursively
Installing: 0.00 datasets [00:00, ? datasets/s] cpu isn't busy, py-spy
edit:
so producer never finished, strange... and it is not specific to have super already pre-installed -- does not even work on |
I thought that it is related to the stall but it is not. Without fix of #4910 (fasteners version), running test_parallel.py showed
so part of the bug -- if there was an exception we might not want to resort to similar strategy of "continue by default" but really re-raise it as soon as it appears? but in general it is something to be "resilient" about in either case -- it must not stall, it should finish if all consumers blow up, so TODO. edit 2: confirmed locally -- stalls in conda env with python 3.6.10 edit 3: eh crap -- seems like some interference between Thread/ThreadPoolExecutor and WitlessRunner's async implementation, $> DATALAD_LOG_LEVEL=debug python -m nose -s -v -x datalad/support/tests/test_parallel.py:test_creatsubdatasets
...
[DEBUG ] Async run ['git', 'config', '-z', '-l', '--show-origin']
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds0/subsub1
[DEBUG ] Launching process ['git', 'config', '-z', '-l', '--show-origin']
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds0/subsub0
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds9
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds8
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds7
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds6
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds5
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds4
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds3
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds2
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds1
[DEBUG ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds0
[DEBUG ] Process 3810732 started
[DEBUG ] Waiting for process 3810732 to complete
^C^C[DEBUG ] Printing versioning information collected so far
[DEBUG ] Async run ['/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.6/bin/git', 'version']
[DEBUG ] Launching process ['/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.6/bin/git', 'version']
[DEBUG ] Process 3810732 exited with return code 0
eh heh edit: stall present with python 3.7.9 and is gone with 3.8.0 |
uff... tried to bisect and it was a pain since many points in history aren't "usable"... narrowed it down and spotted in the logs (great that fix to build was near the fix for "my" issue), apparently it is a RTFMed limitation prior 3.7, from docs "To handle signals and to execute subprocesses, the event loop must be run in the main thread.", which is actually still there in current version... BUT there was v3.8.0b2~37 "bpo-35621: Support running subprocesses in asyncio when loop is executed in non-main thread (GH-14344)" Here is how I used my (yet to be pushed) helper to build cpython, create virtualenv, pip install datalad into it -- so in theory (in my case ended up mostly manual jumping) could be used for bisection(git)lena:~/proj/misc/cpython[tags/v3.8.0b2~29]git
$> CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3 ~/proj/datalad/datalad-master/tools/ci/bisect-python.sh 'python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0'
Python source: /home/yoh/proj/misc/cpython DataLad: /home/yoh/proj/datalad/datalad-master
INFO: Building python
[detached HEAD ec8b8cacdf0] bpo-37221: Add PyCode_NewWithPosOnlyArgs to be used internally and set PyCode_New as a compatibility wrapper (GH-13959) (#14505)
Author: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Date: Mon Jul 1 04:29:14 2019 -0700
8 files changed, 79 insertions(+), 34 deletions(-)
create mode 100644 Misc/NEWS.d/next/C API/2019-06-11-02-50-38.bpo-37221.4tClQT.rst
INFO: Creating virtualenv
All ready:
build: /home/yoh/proj/misc/cpython-builds/v3.8.0b1-180-gbf8cb318035+cb083f
venv: /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-180-gbf8cb318035+cb083f
source: source "/home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-180-gbf8cb318035+cb083f/bin/activate"
python: /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-180-gbf8cb318035+cb083f/bin/python3
ver: Python 3.8.0b1+
datalad: 0.13.4-264-gf47d566e4
INFO: running bisection command python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0
1
1
CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3 84.48s user 19.64s system 178% cpu 58.434 total
1 30241 ->1 [1].....................................:Wed 14 Oct 2020 07:35:23 PM EDT:.
(git)lena:~/proj/misc/cpython[tags/v3.8.0b2~29]git
$> cd .
CODE_OF_CONDUCT.md Mac/ Objects/ README.rst config.status* m4/ python-config.py
Doc/ Makefile PC/ Tools/ config.sub* pybuilddir.txt python-gdb.py
Grammar/ Makefile.pre PCbuild/ aclocal.m4 configure* pyconfig.h setup.py
Include/ Makefile.pre.in Parser/ build/ configure.ac pyconfig.h.in
LICENSE Misc/ Programs/ config.guess* install-sh* python*
Lib/ Modules/ Python/ config.log libpython3.8.a python-config
1 30242 [1].....................................:Wed 14 Oct 2020 07:35:31 PM EDT:.
(git)lena:~/proj/misc/cpython[undefined]git
$> git co bf8cb31803558f1105efb15b0ee4bd184f3218c8^
Warning: you are leaving 1 commit behind, not connected to
any of your branches:
ec8b8cacdf0 bpo-37221: Add PyCode_NewWithPosOnlyArgs to be used internally and set PyCode_New as a compatibility wrapper (GH-13959) (#14505)
If you want to keep it by creating a new branch, this may be a good time
to do so with:
git branch <new-branch-name> ec8b8cacdf0
HEAD is now at ffcc161c753 bpo-29505: Add more fuzzing for re.compile, re.load and csv.reader (GH-14255)
1 30243 [1].....................................:Wed 14 Oct 2020 07:35:38 PM EDT:.
(git)lena:~/proj/misc/cpython[tags/v3.8.0b2~30]git
$> CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3 ~/proj/datalad/datalad-master/tools/ci/bisect-python.sh 'python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0'
Python source: /home/yoh/proj/misc/cpython DataLad: /home/yoh/proj/datalad/datalad-master
INFO: Building python
[detached HEAD 1576d612592] bpo-37221: Add PyCode_NewWithPosOnlyArgs to be used internally and set PyCode_New as a compatibility wrapper (GH-13959) (#14505)
Author: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Date: Mon Jul 1 04:29:14 2019 -0700
8 files changed, 79 insertions(+), 34 deletions(-)
create mode 100644 Misc/NEWS.d/next/C API/2019-06-11-02-50-38.bpo-37221.4tClQT.rst
INFO: Creating virtualenv
All ready:
build: /home/yoh/proj/misc/cpython-builds/v3.8.0b1-179-gffcc161c753+cb083f
venv: /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-179-gffcc161c753+cb083f
source: source "/home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-179-gffcc161c753+cb083f/bin/activate"
python: /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-179-gffcc161c753+cb083f/bin/python3
ver: Python 3.8.0b1+
datalad: 0.13.4-264-gf47d566e4
INFO: running bisection command python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0
1
1
exceeded
Terminated
CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3 89.67s user 20.66s system 173% cpu 1:03.72 total
So, the easiest way forward is probably simply to make this new thread-based execution available starting with python 3.8. It would require some RF, some of which I was going to do anyways (e.g. move logging outside) so it would be easy to just provide "serial" producer/consumer tandem. Might also be handy to have it "serial" in case of |
ce8f207
to
fdb98cb
Compare
tests were failing since pulled in from dandi code wanted pytest. So I rebased, removing |
Discovered by chance while troubleshooting failing datalad#5022 which due to "out of order" completion of get swapped datasets between noannex and annexed. Was fund to debug .... heh ;)
Codecov Report
@@ Coverage Diff @@
## master #5022 +/- ##
==========================================
+ Coverage 89.79% 89.96% +0.17%
==========================================
Files 293 297 +4
Lines 41274 41779 +505
==========================================
+ Hits 37060 37585 +525
+ Misses 4214 4194 -20
Continue to review full report at Codecov.
|
…mer tandem The idea is to feed consumer with items from producer as soon as they arrive, and update log_progress with the total only when producer is finished producing (well -- we could have updated total upon each iteration but I am not sure if that is worthwhile since it would be providing ETA which would keep rising)
…if returned value is a generator
In general -- "works", but it shows the problem of interference from logs (and possible stdout from running procedures) with progress reporting. Delaying dealing with it anyhow for now
…rogress Based on Threads, it would parallelize execution of producer (e.g. of paths) and consumer (e.g. create) to operate on subdatasets in parallel threads. Given that underlying git/git-annex processes would also consume CPU, overall we can escape 1 CPU "performance" bottleneck in cases of parallel operation across (sub)datasets.
…ions that we do not touch super-dataset until we are done creating sub-datasets, and then just use a single save within super dataset to save all freshly created subdatasets
…s, sleep time will increment upon no activity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with this quite a bit (mostly small-scalle addurls and save operations) on both 3.7 and 3.8. Aside from what I consider to be an unwarranted warning shown on 3.7, I didn't notice any issues.
We have already similar option for add_archive_content and a few places in the crawler. It is a very much needed feature while crawling/addurl-ing big datasets, which might not even fit to local harddrive -- we need to drop content as soon as it is annexed (well -- later, may be after we extract metadata). For now I am just adding it as a bool, but we might later indeed specify when to drop it (right after added, or after dataset saved and possibly metadata extracted via some hook or explicit operation etc). This PR is based on datalad#5022 changes since that is where addurls was RFed to accomodate parallel execution, and I would prefer to not mess resolving conflicts later on
…o serial For some reason, did not investigate, "datalad save" arrives here with jobs=1 and not "auto", see datalad#5022 (comment) . So lets just not bother user with such a warning if jobs == 1
* origin/master: (36 commits) ENH: addurl - store original urlfile, not resolved full path OPT: Delay Dataset.repo access until necessary TST: Unicode setup on appveyor is too special TST: Mark cached dataset tests as broken Discover and run all tests tests TST: Robustify re timezone reporting more tests TST: Minor tweaking of utility tests for a user win10 installation TST: Remove a testrepo setup special case on windows BF: Make sure final suppression message appears BF: Invaluable contribution BF: subdatasets: Fix recent path recoding change CLN: subdatasets: Fix a comment typo CLN: subdatasets: Drop inaccurate comment OPT: get_content_info: Speed up ls-files by dropping unneeded flags OPT: Reuse dataset instances, avoid repeated function calls TST: Adjust test for new suppression message rate limiter RF: Simplify submodule path recoding RF: Decomplicate call to resolve_path() BF: Ensure all paths are Path instances ... Conflicts: datalad/plugin/addurls.py -- it was about '{url_file} -> {urlfile}' fix in master, adopted here
Thanks @kyleam for the review and commits. Since no single approval yet, not sure how to proceed: I guess I will just patiently wait for some brave soul to approve and will merge then ;) Hopefully @mih or @bpoldrack find some 3.8 laying around to give it a spin as well |
Well, you took care of my one objection (thanks), so consider me an approval. For this sort of PR, it might be worth waiting for more than one person to give feedback, but that's of course up to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code now, and only have minor comment. I give it a go on a big dataset next.
'.', recursive=True, get_data=False, result_xfm='datasets') | ||
subs = top.get('.', recursive=True, get_data=False, result_xfm='datasets') | ||
# order for '.' should not be relied upon, so sort by path | ||
sub, subsub, subnoannex = sorted(subs, key=lambda ds: ds.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oi. That is a good point. More generally this points to the fact that the order of results is now "random". I think we have a few places in the code that make use of (implicit) order assumptions. I cannot pinpoint any right now, but I am confident that there are a few. Something to be aware of, as those might fail in strange ways, once parallelization is more wide-spread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was not 100% certain about this one, i.e. either I should have introduced some additional option preserver_order=True
to start with -- that would have been most "reliable and conservative". It would cause a bit more of code complexity and some minor run time (results reporting only I guess, since they would need to "stage" waiting for others to complete) effect. But then thought -- heck, it is against master, so we better spot those assumptions whenever testing breaks, and address them.
But yeah -- if you feel that we better have that option and preserve default current behavior, let me know -- I am ok to implement that one. Might come handy in some use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: that option would then need to be channeled to the top interfaces (get
etc), additional burden on the interface(s) etc... might be easier to spot/fix, insofar it was only few tests (if not one) which had that. For a single path/result - shouldn't matter. For recursive or on directories invocations order is anyway cannot be relied upon any ways... so still not sure if more pain than gain in adding it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth though adding a test run (or make it a default?) to parallelize by default (to at least 2 or 3). That could allow to spot other places where order might have been relied upon. (I think I got this one initially whenever I made get
to parallelize by default, but then I went for conservative 1
as the default, so it would only run producer in a separate thread, but there is only one thread for consumers, so order is preserved now by default)
Thanks @mih for spotting that odd formatting in the diff
Gave it a first try. First I was disappointed. I tried running this in the root of the HCP superdataset
Runs in strictly serial fashion. However, this approach in the same dataset works and gives me parallel operation:
However, I see this at the top of the output:
I will try again with a fresh start. I did and the issue does not go away, even when there is no socket at the beginning. |
FTR: Cleaning up my previous trials, I realized that |
virtually any datalad command which has |
Re
I wonder if could possibly be some racing within annex. Please provide WTF and/or test with the most recent annex snapshot. google does not give immediate clues on being a known/fixed issue although there are some linked to assistant -- may be the same (may be it does run in parallel across different repos too) |
I am running git-annex 8.20201007 from Debian sid WTFWTFconfiguration <SENSITIVE, report disabled by configuration>credentials
datalad
dataset
dependencies
environment
extensions
git-annex
location
metadata_extractors
metadata_indexerspython
system
|
I will blame the use of if self.ctrl_path.exists():
return False
# create ssh control master command
cmd = ["ssh"] + self._ssh_open_args + self._ssh_args + [self.sshri.as_str()]
# start control master:
lgr.debug("Opening %s by calling %s", self, cmd)
proc = Popen(cmd)
stdout, stderr = proc.communicate(input="\n") # why the f.. this is necessary? Moreover, if I Ctrl-C that Anyways -- it better be addressed in a separate PR since doesn't depend on this functionality and would be easier to review in isolation. I will attempt one quickly now (if doesn't come today - I failed ;)) |
It was reported datalad#5022 (comment) that parallel install/get over ssh results in flood of ControlSocket ... already exists, disabling multiplexing and abandoned ssh -fN -o ControlMaster=auto ... processes. This stems from the fact that MultiplexSSHConnection is nohow thread-safe, so the same instance is used across multiple threads and races in the check of socket path existing and then acquiring it. Note: we use threading.Lock here, so this solution is purely for safety across threads (and more lightweight), and not across multiple processes. In principle the construct we have already (not verified to work): from .locking import lock_if_check_fails with lock_if_check_fails( self.ctrl_path.exists, self._ctrl_path, blocking=False ) as (check, lock): if check or not (lock and lock.acquired): return False ... the rest of the .open() call could be used in addition to establish across process file-based locking in addition (but would not work for thread safety since uses InterprocessLock).
I only wonder now what other constructs aren't thread-safe. With parallelization across datasets we should be ok for dataset bound configs etc. |
Ok, this PR had 1 approval, seems like a good amount of "trials", thread-unsafe ssh was fixed, and I bet we will discover more the more we use it. So if it stays green tomorrow morning (yellow now), I will be merging it unless strong objections would be expressed while I am asleep ;) |
if isinstance(recursion_limit, int) and recursion_limit <= 0: | ||
return | ||
if jobs == "auto": | ||
# be safe -- there might be ssh logins etc. So no parallel | ||
jobs = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, re-reviewing my code stumbled upon this . I guess should be removed since I think it came from the time when I had "auto" by default making 5 jobs. ATM "auto" should be a single job, which should be ok for install. Or if not to remove - just making a mental note that I saw it ;)
and here we go.... |
We virtually never see datalad "busy". I believe that even with the simplest parallelization via threads, thus simpler mechanisms to orchestrate, we can make datalad more efficient by working with multiple git/git-annex processes at once, e.g in cases of multiple datasets to work on
the primary "motivational" use case is
addurls
which might need to create a large number of subdatasets. IMHO the most logical (single) point of "optimization" would be thecreate
command if it was made to accept multiple datasets (Make create take multiple PATHs arguments #3233), but since it is not the case, the last commit (ATM ca87b8a) does it withinaddurls
also includes 39b03e7 RF/ENH which is IMHO generally useful, and if agreed, I can submit it in a separate PR. It removes log filtering and makes
ProgressHandler
redirect to the default handler for non-progress log messages AFTER cleaning up all progress bars, and then refreshing them after that. That allows to avoid interference between regular log and progress logs bars. Even in DEBUG heavy output, progress bar is nicely lingering at the bottomrun
processes;2). results records. For 2), I think if we could add a "log handler" which would not format them at all, and then still use ProgressHandler logger - we could have those records nicely displayed without interfering with progress bars even when some processes are still ongoing.result records already go throughui.message
which stops all progress bars first and then shows again (I just submitted a minor BF: UI - .error also should go through .message to handle pbars #5040 which unlikely relates). There is no thread safety measures there, so I think it might (have not analyzed) interfere with clear/refresh progress bars done byProgressHandler
(is logging thread safe?)an aspect I wanted to retain is to keep generators generators, be it a producer (e.g. listing of subdatasets) or consumer (some operation on those subdatasets), while also providing progress bars. With this implementation, we will get progress report as soon as producer starts producing entries we process by a consumer,
and only whenever producer exhausts itself, and thus we know total number -- we get a proper progressbaras of 0ce386e progress bar appears right away but total might grow as well as we get more stuff to do.demo 1: simple
Here is a quick demo from a unittest (with tune up to make it run longer) with a "slow arange" output from which gets consumed by even slower consumer ;) It shows that we start without progressbar and then it appears. 2nd run is done with
DATALAD_LOG_LEVEL=DEBUG
to show how progressbar lingers at the bottom ;)NB just now mentioned that some empty lines in the output start breeding -- I think it is some side-effect from progressbars (not logs), which I see from time to time in other cases -- that 'clean' doesn't fully clean?
And while testing that test_creatsubdatasets (just beef it up to more datasets) you finally can see datalad busy doing something useful ;)
demo 2: install
TODOs
gain feedback on either it is the right direction to pursue ;)seems it hasn't excited anyone so plowing forward as-envisionedProducerConsumer
etc)datalad.jobs
common config variable which would be consulted if args/cmdline did not provide any (i.e stayedNone
).-J
in cmdline will probably just stay as the one for both annex and dataladaddurls
- parallelcreate
. Closes addurls: OPT - parallelize creation of subdatasets #4990 (blocked by save: does not add subdataset(s) to .gitmodules if nested #5021). save: Fix handling of explicit paths for nested new subdatasets #5049 resolves it to addurls-sufficient degree. Merged in this PR for "completness"install
/get
- for already "registered" subdatasets installation (see: RF to move closer towards parallelization of subdataset installation inget()
#4182; install --recursive could run jobs in parallel #450 - TODO)save
(datalad save could use parallelism more #2880)IteratorWithAggregation
(from dandi) which I first thought to reuse but then ended up just melding its code with parallelization + log_progresslog_progress
logic be moved "outside" (and helper likeProducerConsumerLogged
be provided)? (could probably just consume records from the ProducerConsumer, the only tricky point is to pass "total" into it, but could be done by being the only place to uselgr
for) That would allow to use it for use cases where returned values are not our structured recordsCrazy ideas (for future):
Not here:
get/install
over ssh with prompt. May be sshrun: add locking? #5034 ?remove
,drop
) and evensubdatasets
itself and other core commands likediff
andstatus
!maint
as well