Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF+OPT+ENH: a helper to parallelize operations (via threads) #5022

Merged
merged 60 commits into from Oct 30, 2020

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Oct 13, 2020

We virtually never see datalad "busy". I believe that even with the simplest parallelization via threads, thus simpler mechanisms to orchestrate, we can make datalad more efficient by working with multiple git/git-annex processes at once, e.g in cases of multiple datasets to work on

  • the primary "motivational" use case is addurls which might need to create a large number of subdatasets. IMHO the most logical (single) point of "optimization" would be the create command if it was made to accept multiple datasets (Make create take multiple PATHs arguments #3233), but since it is not the case, the last commit (ATM ca87b8a) does it within addurls

  • also includes 39b03e7 RF/ENH which is IMHO generally useful, and if agreed, I can submit it in a separate PR. It removes log filtering and makes ProgressHandler redirect to the default handler for non-progress log messages AFTER cleaning up all progress bars, and then refreshing them after that. That allows to avoid interference between regular log and progress logs bars. Even in DEBUG heavy output, progress bar is nicely lingering at the bottom

    • the only possible intererences with progressbars which would remain AFAIK are 1). direct unhandled output to stdout/stderr from run processes; 2). results records. For 2), I think if we could add a "log handler" which would not format them at all, and then still use ProgressHandler logger - we could have those records nicely displayed without interfering with progress bars even when some processes are still ongoing. result records already go through ui.message which stops all progress bars first and then shows again (I just submitted a minor BF: UI - .error also should go through .message to handle pbars #5040 which unlikely relates). There is no thread safety measures there, so I think it might (have not analyzed) interfere with clear/refresh progress bars done by ProgressHandler (is logging thread safe?)
  • an aspect I wanted to retain is to keep generators generators, be it a producer (e.g. listing of subdatasets) or consumer (some operation on those subdatasets), while also providing progress bars. With this implementation, we will get progress report as soon as producer starts producing entries we process by a consumer, and only whenever producer exhausts itself, and thus we know total number -- we get a proper progressbar as of 0ce386e progress bar appears right away but total might grow as well as we get more stuff to do.

demo 1: simple

Here is a quick demo from a unittest (with tune up to make it run longer) with a "slow arange" output from which gets consumed by even slower consumer ;) It shows that we start without progressbar and then it appears. 2nd run is done with DATALAD_LOG_LEVEL=DEBUG to show how progressbar lingers at the bottom ;)

asciicast

NB just now mentioned that some empty lines in the output start breeding -- I think it is some side-effect from progressbars (not logs), which I see from time to time in other cases -- that 'clean' doesn't fully clean?

And while testing that test_creatsubdatasets (just beef it up to more datasets) you finally can see datalad busy doing something useful ;)

demo 2: install

asciicast

TODOs

  • gain feedback on either it is the right direction to pursue ;) seems it hasn't excited anyone so plowing forward as-envisioned
  • [-] better names? (e.g. for ProducerConsumer etc)
  • decide on how to "split" jobs specification between what to pass to annex and what to be used by datalad.
    • I think we just need to add additional datalad.jobs common config variable which would be consulted if args/cmdline did not provide any (i.e stayed None). -J in cmdline will probably just stay as the one for both annex and datalad
  • initial adoption/testing
  • docstrings and examples
  • possibly remove iterators.py with IteratorWithAggregation (from dandi) which I first thought to reuse but then ended up just melding its code with parallelization + log_progress
  • should all log_progress logic be moved "outside" (and helper like ProducerConsumerLogged be provided)? (could probably just consume records from the ProducerConsumer, the only tricky point is to pass "total" into it, but could be done by being the only place to use lgr for) That would allow to use it for use cases where returned values are not our structured records
  • [?] "properly" handle "on_failure='error'". I do feel that I have observed it not interrupting properly, but it might had been due to outside higher level loop we have in gathering result records...

Crazy ideas (for future):

  • auto scaling: ATM we pre-define number of jobs... But in the code I already added provision for delaying submission of a new job (e.g. waiting for future creating a parent dataset to be done first). So we could use that to establish an upper-limit (e.g. 2*number of cores) and then feed it with new "futures" as long as we see that current PID CPU utilization is still far from 100%. Alternatively, since intended for quite consistent tasks, we could time how long each future took (while operating with minimal specified parallelization), and then provide balancing based on average time for a process to complete not increasing significantly (over-committing IO or CPU). May be there is already known libraries/algorithms to do that.

Not here:

  • allow for parallel get/install over ssh with prompt. May be sshrun: add locking? #5034 ?
  • pretty much any command which operates recursively across datasets could adopt this approach (remove, drop) and even subdatasets itself and other core commands like diff and status !
  • some late commits are BF which might be applicable to maint as well

@yarikoptic yarikoptic added enhancement performance Improve performance of an existing feature UX user experience labels Oct 13, 2020
datalad/plugin/addurls.py Outdated Show resolved Hide resolved
@mih
Copy link
Member

mih commented Oct 13, 2020

#4182 would be something of significant impact (related #450)
but also: #2880

some info on prev exploration: #4180

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 14, 2020

ok -- install/get of subdatasets -- done. Went a little "nuts" with it -

  • as of de36ea9 -- parallelizes only at a level of immediate subdatasets of a dataset
  • as of 297d422 (actual interesting part is in ba9d6ff) -- parallelizes across entire hierarchy.

Here is some examples for those who are not prone to epileptic seizures (a good number of jumping progress bars -- "result" record rendering is not yet anyhow de-interferenced from logs, so affects from time to time). this is running from my laptop on my not so fast broadband. Didn't try yet on some beefier connection. Change in behavior also: now progressbar appears right from the beginning, but "Total" grows as we discover more subdatasets to install ;-) (note -- hcp one had already some pre-installed locally, so they did get skipped)
asciicast

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 14, 2020

BUG: if nothing to be installed (all subdatasets are already in place), never exits:
$> datalad install -J 10 -r ///hcp-openaccess
[ERROR  ] target path already exists and not empty, refuse to clone into target path [install(/mnt/scrap/tmp/hcp-openaccess)]
install(error): /mnt/scrap/tmp/hcp-openaccess (dataset) [target path already exists and not empty, refuse to clone into target path]
[INFO   ] Installing Dataset(/mnt/scrap/tmp/hcp-openaccess) to get /mnt/scrap/tmp/hcp-openaccess recursively
Installing: 0.00 datasets [00:00, ? datasets/s]

cpu isn't busy, py-spy

  2.00%   2.00%   0.160s    0.160s   __call__ (datalad/support/parallel.py:308)
  0.00%   2.00%   0.000s    0.160s   generator_func (datalad/interface/utils.py:400)
  0.00%   2.00%   0.000s    0.160s   __call__ (datalad/distribution/get.py:899)
  0.00%   2.00%   0.000s    0.160s   _process_results (datalad/interface/utils.py:539)
  0.00%   2.00%   0.000s    0.160s   __call__ (datalad/distribution/install.py:394)
  0.00%   2.00%   0.000s    0.160s   main (datalad/cmdline/main.py:526)
  0.00%   2.00%   0.000s    0.160s   _install_targetpath (datalad/distribution/get.py:687)
  0.00%   2.00%   0.000s    0.160s   __iter__ (datalad/support/parallel.py:259)
  0.00%   2.00%   0.000s    0.160s   <module> (datalad:11)
  0.00%   2.00%   0.000s    0.160s   call_from_parser (datalad/interface/base.py:786)
  0.00%   2.00%   0.000s    0.160s   _recursive_install_subds_underneath (datalad/distribution/get.py:588)
  0.00%   2.00%   0.000s    0.160s   __call__ (datalad/distribution/install.py:249)

edit:
Seems to be related to producer since logs are

[DEBUG  ] Did nothing useful, sleeping. Have False {} True True

so producer never finished, strange... and it is not specific to have super already pre-installed -- does not even work on ///openneuro on smaug! weird

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 14, 2020

I thought that it is related to the stall but it is not. Without fix of #4910 (fasteners version), running test_parallel.py showed

  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 539, in _process_results
    for res in results:
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/create.py", line 408, in __call__
    fake_dates=fake_dates
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/repo.py", line 154, in __call__
    instance = type.__call__(cls, *new_args, **new_kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/annexrepo.py", line 301, in __init__
    self._allow_local_urls()
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/annexrepo.py", line 330, in _allow_local_urls
    'local')
  File "/home/yoh/proj/datalad/datalad-master/datalad/config.py", line 744, in add
    protocol=StdOutErrCapture)
  File "/home/yoh/proj/datalad/datalad-master/datalad/config.py", line 694, in _run
    with InterProcessLock(lockfile, logger=lgr):
TypeError: __init__() got an unexpected keyword argument 'logger'

so part of the bug -- if there was an exception we might not want to resort to similar strategy of "continue by default" but really re-raise it as soon as it appears? but in general it is something to be "resilient" about in either case -- it must not stall, it should finish if all consumers blow up, so TODO.
edit: Even with updated fasteners datalad.support.tests.test_parallel.test_creatsubdatasets is stalling as well! I blame python version different (smaug 3.7.3, local 3.8.5) ATM

edit 2: confirmed locally -- stalls in conda env with python 3.6.10

edit 3: eh crap -- seems like some interference between Thread/ThreadPoolExecutor and WitlessRunner's async implementation, git config call seems to never return and it takes Ctrl-C few times for that to happen:

$> DATALAD_LOG_LEVEL=debug python -m nose -s -v -x datalad/support/tests/test_parallel.py:test_creatsubdatasets
...
[DEBUG  ] Async run ['git', 'config', '-z', '-l', '--show-origin'] 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds0/subsub1 
[DEBUG  ] Launching process ['git', 'config', '-z', '-l', '--show-origin'] 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds0/subsub0 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds9 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds8 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds7 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds6 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds5 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds4 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds3 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds2 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds1 
[DEBUG  ] Submitting worker future for /home/yoh/.tmp/datalad_temp_test_creatsubdatasetsva5ht_g3/subds0 
[DEBUG  ] Process 3810732 started 
[DEBUG  ] Waiting for process 3810732 to complete 

^C^C[DEBUG  ] Printing versioning information collected so far 
[DEBUG  ] Async run ['/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.6/bin/git', 'version'] 
[DEBUG  ] Launching process ['/home/yoh/anaconda-5.2.0-2.7/envs/datalad-py3.6/bin/git', 'version'] 
[DEBUG  ] Process 3810732 exited with return code 0 

eh heh

edit: stall present with python 3.7.9 and is gone with 3.8.0

@yarikoptic
Copy link
Member Author

uff... tried to bisect and it was a pain since many points in history aren't "usable"... narrowed it down and spotted in the logs (great that fix to build was near the fix for "my" issue), apparently it is a RTFMed limitation prior 3.7, from docs "To handle signals and to execute subprocesses, the event loop must be run in the main thread.", which is actually still there in current version... BUT there was v3.8.0b2~37 "bpo-35621: Support running subprocesses in asyncio when loop is executed in non-main thread (GH-14344)"

Here is how I used my (yet to be pushed) helper to build cpython, create virtualenv, pip install datalad into it -- so in theory (in my case ended up mostly manual jumping) could be used for bisection
(git)lena:~/proj/misc/cpython[tags/v3.8.0b2~29]git
$> CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3 ~/proj/datalad/datalad-master/tools/ci/bisect-python.sh 'python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0' 
Python source: /home/yoh/proj/misc/cpython   DataLad: /home/yoh/proj/datalad/datalad-master
INFO: Building python
[detached HEAD ec8b8cacdf0] bpo-37221: Add PyCode_NewWithPosOnlyArgs to be used internally and set PyCode_New as a compatibility wrapper (GH-13959) (#14505)
 Author: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
 Date: Mon Jul 1 04:29:14 2019 -0700
 8 files changed, 79 insertions(+), 34 deletions(-)
 create mode 100644 Misc/NEWS.d/next/C API/2019-06-11-02-50-38.bpo-37221.4tClQT.rst
INFO: Creating virtualenv
All ready:
build:   /home/yoh/proj/misc/cpython-builds/v3.8.0b1-180-gbf8cb318035+cb083f
venv:    /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-180-gbf8cb318035+cb083f
source:  source "/home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-180-gbf8cb318035+cb083f/bin/activate"
python:  /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-180-gbf8cb318035+cb083f/bin/python3
ver:     Python 3.8.0b1+
datalad: 0.13.4-264-gf47d566e4

INFO: running bisection command python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0
1
1
CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3    84.48s user 19.64s system 178% cpu 58.434 total
1 30241 ->1 [1].....................................:Wed 14 Oct 2020 07:35:23 PM EDT:.
(git)lena:~/proj/misc/cpython[tags/v3.8.0b2~29]git
$> cd .
CODE_OF_CONDUCT.md  Mac/             Objects/   README.rst     config.status*  m4/             python-config.py
Doc/                Makefile         PC/        Tools/         config.sub*     pybuilddir.txt  python-gdb.py
Grammar/            Makefile.pre     PCbuild/   aclocal.m4     configure*      pyconfig.h      setup.py
Include/            Makefile.pre.in  Parser/    build/         configure.ac    pyconfig.h.in
LICENSE             Misc/            Programs/  config.guess*  install-sh*     python*
Lib/                Modules/         Python/    config.log     libpython3.8.a  python-config
1 30242 [1].....................................:Wed 14 Oct 2020 07:35:31 PM EDT:.
(git)lena:~/proj/misc/cpython[undefined]git
$> git co bf8cb31803558f1105efb15b0ee4bd184f3218c8^
Warning: you are leaving 1 commit behind, not connected to
any of your branches:

  ec8b8cacdf0 bpo-37221: Add PyCode_NewWithPosOnlyArgs to be used internally and set PyCode_New as a compatibility wrapper (GH-13959) (#14505)

If you want to keep it by creating a new branch, this may be a good time
to do so with:

 git branch <new-branch-name> ec8b8cacdf0

HEAD is now at ffcc161c753 bpo-29505: Add more fuzzing for re.compile, re.load and csv.reader (GH-14255)
1 30243 [1].....................................:Wed 14 Oct 2020 07:35:38 PM EDT:.
(git)lena:~/proj/misc/cpython[tags/v3.8.0b2~30]git
$> CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3 ~/proj/datalad/datalad-master/tools/ci/bisect-python.sh 'python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0'
Python source: /home/yoh/proj/misc/cpython   DataLad: /home/yoh/proj/datalad/datalad-master
INFO: Building python
[detached HEAD 1576d612592] bpo-37221: Add PyCode_NewWithPosOnlyArgs to be used internally and set PyCode_New as a compatibility wrapper (GH-13959) (#14505)
 Author: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
 Date: Mon Jul 1 04:29:14 2019 -0700
 8 files changed, 79 insertions(+), 34 deletions(-)
 create mode 100644 Misc/NEWS.d/next/C API/2019-06-11-02-50-38.bpo-37221.4tClQT.rst
INFO: Creating virtualenv
All ready:
build:   /home/yoh/proj/misc/cpython-builds/v3.8.0b1-179-gffcc161c753+cb083f
venv:    /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-179-gffcc161c753+cb083f
source:  source "/home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-179-gffcc161c753+cb083f/bin/activate"
python:  /home/yoh/proj/datalad/datalad-master/venvs/build-v3.8.0b1-179-gffcc161c753+cb083f/bin/python3
ver:     Python 3.8.0b1+
datalad: 0.13.4-264-gf47d566e4

INFO: running bisection command python3 ~/proj/datalad/datalad-master/datalad/support/tests/test_parallel.py && exit 1 || exit 0
1
1
exceeded
Terminated
CP_COMMIT=cb083f7cdf604c1d9d264f387f9e8846bc953eb3    89.67s user 20.66s system 173% cpu 1:03.72 total

So, the easiest way forward is probably simply to make this new thread-based execution available starting with python 3.8. It would require some RF, some of which I was going to do anyways (e.g. move logging outside) so it would be easy to just provide "serial" producer/consumer tandem. Might also be handy to have it "serial" in case of -J 0 regardless of Python version to make it easier to debug.

@yarikoptic
Copy link
Member Author

tests were failing since pulled in from dandi code wanted pytest. So I rebased, removing IteratorWithAggregation. I think ConsumerProducer can be used if something like IteratorWithAggregation is really needed (although at a cost of needing python >= 3.8 and additional thread(s) for the consumer) or we could pull it in whenever need arises (ATM was not used)

yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Oct 18, 2020
Discovered by chance while troubleshooting failing
datalad#5022
which due to "out of order" completion of get swapped datasets between
noannex and annexed.  Was fund to debug .... heh ;)
@codecov
Copy link

codecov bot commented Oct 19, 2020

Codecov Report

Merging #5022 into master will increase coverage by 0.17%.
The diff coverage is 95.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5022      +/-   ##
==========================================
+ Coverage   89.79%   89.96%   +0.17%     
==========================================
  Files         293      297       +4     
  Lines       41274    41779     +505     
==========================================
+ Hits        37060    37585     +525     
+ Misses       4214     4194      -20     
Impacted Files Coverage Δ
datalad/interface/common_cfg.py 100.00% <ø> (ø)
datalad/support/annexrepo.py 89.13% <ø> (ø)
datalad/support/collections.py 62.50% <62.50%> (ø)
datalad/support/tests/test_parallel.py 94.96% <94.96%> (ø)
datalad/support/parallel.py 95.71% <95.71%> (ø)
datalad/log.py 95.51% <97.82%> (+5.37%) ⬆️
datalad/core/distributed/tests/test_push.py 98.01% <100.00%> (+<0.01%) ⬆️
datalad/core/local/save.py 98.82% <100.00%> (+0.08%) ⬆️
datalad/core/local/tests/test_save.py 98.00% <100.00%> (ø)
datalad/distribution/get.py 97.65% <100.00%> (+0.18%) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 635b6af...9045227. Read the comment docs.

…mer tandem

The idea is to feed consumer with items from producer as soon as they arrive,
and update log_progress with the total only when producer is finished producing
(well -- we could have updated total upon each iteration but I am not sure if that is worthwhile
since it would be providing ETA which would keep rising)
In general -- "works", but it shows the problem of interference from logs
(and possible stdout from running procedures) with progress reporting.
Delaying dealing with it anyhow for now
…rogress

Based on Threads, it would parallelize execution of producer (e.g. of paths)
and consumer (e.g. create) to operate on subdatasets in parallel threads.
Given that underlying git/git-annex processes would also consume CPU, overall
we can escape 1 CPU "performance" bottleneck in cases of parallel operation
across (sub)datasets.
…ions

that we do not touch super-dataset until we are done creating sub-datasets, and then
just use a single save within super dataset to save all freshly created subdatasets
…s, sleep time will increment upon no activity
Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this quite a bit (mostly small-scalle addurls and save operations) on both 3.7 and 3.8. Aside from what I consider to be an unwarranted warning shown on 3.7, I didn't notice any issues.

datalad/interface/common_cfg.py Outdated Show resolved Hide resolved
datalad/support/parallel.py Show resolved Hide resolved
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Oct 23, 2020
We have already similar option for add_archive_content and a few places in the
crawler.  It is a very much needed feature while crawling/addurl-ing big
datasets, which might not even fit to local harddrive -- we need to drop
content as soon as it is annexed (well -- later, may be after we extract
metadata).  For now I am just adding it as a bool, but we might later
indeed specify when to drop it (right after added, or after dataset saved
and possibly metadata extracted via some hook or explicit operation etc).

This PR is based on datalad#5022 changes since that is where addurls was RFed
to accomodate parallel execution, and I would prefer to not mess resolving
conflicts later on
…o serial

For some reason, did not investigate, "datalad save" arrives here with jobs=1 and not
"auto", see datalad#5022 (comment) .
So lets just not bother user with such a warning if jobs == 1
* origin/master: (36 commits)
  ENH: addurl - store original urlfile, not resolved full path
  OPT: Delay Dataset.repo access until necessary
  TST: Unicode setup on appveyor is too special
  TST: Mark cached dataset tests as broken
  Discover and run all tests tests
  TST: Robustify re timezone reporting
  more tests
  TST: Minor tweaking of utility tests for a user win10 installation
  TST: Remove a testrepo setup special case on windows
  BF: Make sure final suppression message appears
  BF: Invaluable contribution
  BF: subdatasets: Fix recent path recoding change
  CLN: subdatasets: Fix a comment typo
  CLN: subdatasets: Drop inaccurate comment
  OPT: get_content_info: Speed up ls-files by dropping unneeded flags
  OPT: Reuse dataset instances, avoid repeated function calls
  TST: Adjust test for new suppression message rate limiter
  RF: Simplify submodule path recoding
  RF: Decomplicate call to resolve_path()
  BF: Ensure all paths are Path instances
  ...

 Conflicts:
	datalad/plugin/addurls.py -- it was about '{url_file} -> {urlfile}' fix in master, adopted here
@yarikoptic
Copy link
Member Author

Thanks @kyleam for the review and commits. Since no single approval yet, not sure how to proceed: I guess I will just patiently wait for some brave soul to approve and will merge then ;) Hopefully @mih or @bpoldrack find some 3.8 laying around to give it a spin as well

@kyleam
Copy link
Contributor

kyleam commented Oct 27, 2020

Since no single approval yet, not sure how to proceed: I guess I will just patiently wait for some brave soul to approve and will merge then

Well, you took care of my one objection (thanks), so consider me an approval. For this sort of PR, it might be worth waiting for more than one person to give feedback, but that's of course up to you.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code now, and only have minor comment. I give it a go on a big dataset next.

'.', recursive=True, get_data=False, result_xfm='datasets')
subs = top.get('.', recursive=True, get_data=False, result_xfm='datasets')
# order for '.' should not be relied upon, so sort by path
sub, subsub, subnoannex = sorted(subs, key=lambda ds: ds.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oi. That is a good point. More generally this points to the fact that the order of results is now "random". I think we have a few places in the code that make use of (implicit) order assumptions. I cannot pinpoint any right now, but I am confident that there are a few. Something to be aware of, as those might fail in strange ways, once parallelization is more wide-spread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was not 100% certain about this one, i.e. either I should have introduced some additional option preserver_order=True to start with -- that would have been most "reliable and conservative". It would cause a bit more of code complexity and some minor run time (results reporting only I guess, since they would need to "stage" waiting for others to complete) effect. But then thought -- heck, it is against master, so we better spot those assumptions whenever testing breaks, and address them.
But yeah -- if you feel that we better have that option and preserve default current behavior, let me know -- I am ok to implement that one. Might come handy in some use-cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: that option would then need to be channeled to the top interfaces (get etc), additional burden on the interface(s) etc... might be easier to spot/fix, insofar it was only few tests (if not one) which had that. For a single path/result - shouldn't matter. For recursive or on directories invocations order is anyway cannot be relied upon any ways... so still not sure if more pain than gain in adding it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth though adding a test run (or make it a default?) to parallelize by default (to at least 2 or 3). That could allow to spot other places where order might have been relied upon. (I think I got this one initially whenever I made get to parallelize by default, but then I went for conservative 1 as the default, so it would only run producer in a separate thread, but there is only one thread for consumers, so order is preserved now by default)

datalad/distribution/get.py Show resolved Hide resolved
Thanks @mih for spotting that odd formatting in the diff
@mih
Copy link
Member

mih commented Oct 28, 2020

Gave it a first try. First I was disappointed. I tried running this in the root of the HCP superdataset

datalad get -J 5 -n HCP1200/*

Runs in strictly serial fashion. However, this approach in the same dataset works and gives me parallel operation:

datalad get -J 5 -n -r -R 1 .

However, I see this at the top of the output:

[INFO   ] Installing Dataset(/tmp/d3) to get /tmp/d3 recursively 
Installing (6 skiControlSocket /home/mih/.cache/datalad/sockets/dfda567d already exists, disabling multiplexing                                                                         
ControlSocket /home/mih/.cache/datalad/sockets/dfda567d already exists, disabling multiplexing
ControlSocket /home/mih/.cache/datalad/sockets/dfda567d already exists, disabling multiplexing
ControlSocket /home/mih/.cache/datalad/sockets/dfda567d already exists, disabling multiplexing
[INFO   ] Scanning for unlocked files (this may take some time)                             

I will try again with a fresh start.

I did and the issue does not go away, even when there is no socket at the beginning.

@mih
Copy link
Member

mih commented Oct 28, 2020

FTR: Cleaning up my previous trials, I realized that uninstall -r would greatly benefit from parallelization!

@yarikoptic
Copy link
Member Author

FTR: Cleaning up my previous trials, I realized that uninstall -r would greatly benefit from parallelization!

virtually any datalad command which has recursive option or somehow otherwise operates across multiple (sub)datasets would benefit from parallelization, even subdatasets itself ;) But then we would get into "nested parallelization", which we might want to prevent altogether (this PR doesn't introduce any kind of guarding for that).
We might even find some commands where there is a sequence of git/git-annex invocations which are otherwise independent (collect different information) could benefit from parallelization of those independent steps, but ConsmerProducer pattern might be suboptimal for those.

@yarikoptic
Copy link
Member Author

Re

ControlSocket /home/mih/.cache/datalad/sockets/dfda567d already exists, disabling multiplexing

I wonder if could possibly be some racing within annex. Please provide WTF and/or test with the most recent annex snapshot. google does not give immediate clues on being a known/fixed issue although there are some linked to assistant -- may be the same (may be it does run in parallel across different repos too)

@mih
Copy link
Member

mih commented Oct 29, 2020

I am running git-annex 8.20201007 from Debian sid

WTF

WTF

configuration <SENSITIVE, report disabled by configuration>

credentials

  • keyring:
    • active_backends:
      • SecretService Keyring
      • PlaintextKeyring with no encyption v.1.0 at /home/mih/.local/share/python_keyring/keyring_pass.cfg
    • config_file: /home/mih/.config/python_keyring/keyringrc.cfg
    • data_root: /home/mih/.local/share/python_keyring

datalad

  • full_version: 0.13.4.dev432-g1bd6e
  • version: 0.13.4.dev432

dataset

  • id: 2e2a8a70-3eaa-11ea-a9a5-b4969157768c
  • metadata: <SENSITIVE, report disabled by configuration>
  • path: /tmp/d3
  • repo: GitRepo

dependencies

  • annexremote: 1.4.3
  • appdirs: 1.4.4
  • boto: 2.49.0
  • cmd:7z: 16.02
  • cmd:annex: 8.20201007
  • cmd:bundled-git: UNKNOWN
  • cmd:git: 2.28.0
  • cmd:system-git: 2.28.0
  • cmd:system-ssh: 8.1p1
  • humanize: 2.6.0
  • iso8601: 0.1.13
  • keyring: 21.4.0
  • keyrings.alt: 3.4.0
  • msgpack: 1.0.0
  • requests: 2.24.0
  • tqdm: 4.49.0
  • wrapt: 1.12.1

environment

  • GIT_EDITOR: vim
  • LANG: en_US.UTF-8
  • LANGUAGE: en_US:en
  • LC_CTYPE: en_US.UTF-8
  • PATH: /home/mih/env/datalad3-dev/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/home/mih/bin:/home/mih/bin/pycharm/bin

extensions

  • container:
    • description: Containerized environments
    • entrypoints:
      • datalad_container.containers_add.ContainersAdd:
        • class: ContainersAdd
        • load_error: None
        • module: datalad_container.containers_add
        • names:
          • containers-add
          • containers_add
      • datalad_container.containers_list.ContainersList:
        • class: ContainersList
        • load_error: None
        • module: datalad_container.containers_list
        • names:
          • containers-list
          • containers_list
      • datalad_container.containers_remove.ContainersRemove:
        • class: ContainersRemove
        • load_error: None
        • module: datalad_container.containers_remove
        • names:
          • containers-remove
          • containers_remove
      • datalad_container.containers_run.ContainersRun:
        • class: ContainersRun
        • load_error: None
        • module: datalad_container.containers_run
        • names:
          • containers-run
          • containers_run
    • load_error: None
    • module: datalad_container
    • version: 1.0.1
  • ebrains:
    • description: HBP/EBRAINS support
    • entrypoints:
      • datalad_ebrains.kg2ds.KnowledgeGraph2Dataset:
        • class: KnowledgeGraph2Dataset
        • load_error: None
        • module: datalad_ebrains.kg2ds
        • names:
          • ebrains-kg2ds
          • ebrains_kg2ds
    • load_error: None
    • module: datalad_ebrains
    • version: 0+untagged.3.gd20cc00
  • metalad:
    • description: DataLad semantic metadata command suite
    • entrypoints:
      • datalad_metalad.aggregate.Aggregate:
        • class: Aggregate
        • load_error: None
        • module: datalad_metalad.aggregate
        • names:
          • meta-aggregate
          • meta_aggregate
      • datalad_metalad.dump.Dump:
        • class: Dump
        • load_error: None
        • module: datalad_metalad.dump
        • names:
          • meta-dump
          • meta_dump
      • datalad_metalad.extract.Extract:
        • class: Extract
        • load_error: None
        • module: datalad_metalad.extract
        • names:
          • meta-extract
          • meta_extract
    • load_error: None
    • module: datalad_metalad
    • version: 0.2.1
  • osf:
    • description: DataLad extension for OSF support
    • entrypoints:
      • datalad_osf.create_sibling_osf.CreateSiblingOSF:
        • class: CreateSiblingOSF
        • load_error: None
        • module: datalad_osf.create_sibling_osf
        • names:
          • create-sibling-osf
          • create_sibling_osf
      • datalad_osf.credentials.OSFCredentials:
        • class: OSFCredentials
        • load_error: None
        • module: datalad_osf.credentials
        • names:
          • osf-credentials
          • osf_credentials
    • load_error: None
    • module: datalad_osf
    • version: 0.2.0+12.g8abcae0

git-annex

  • build flags:
    • Assistant
    • Webapp
    • Pairing
    • Inotify
    • DBus
    • DesktopNotify
    • TorrentParser
    • MagicMime
    • Feeds
    • Testsuite
    • S3
    • WebDAV
  • dependency versions:
    • aws-0.22
    • bloomfilter-2.0.1.0
    • cryptonite-0.26
    • DAV-1.3.4
    • feed-1.3.0.1
    • ghc-8.8.4
    • http-client-0.6.4.1
    • persistent-sqlite-2.10.6.2
    • torrent-10000.1.1
    • uuid-1.3.13
    • yesod-1.6.1.0
  • key/value backends:
    • SHA256E
    • SHA256
    • SHA512E
    • SHA512
    • SHA224E
    • SHA224
    • SHA384E
    • SHA384
    • SHA3_256E
    • SHA3_256
    • SHA3_512E
    • SHA3_512
    • SHA3_224E
    • SHA3_224
    • SHA3_384E
    • SHA3_384
    • SKEIN256E
    • SKEIN256
    • SKEIN512E
    • SKEIN512
    • BLAKE2B256E
    • BLAKE2B256
    • BLAKE2B512E
    • BLAKE2B512
    • BLAKE2B160E
    • BLAKE2B160
    • BLAKE2B224E
    • BLAKE2B224
    • BLAKE2B384E
    • BLAKE2B384
    • BLAKE2BP512E
    • BLAKE2BP512
    • BLAKE2S256E
    • BLAKE2S256
    • BLAKE2S160E
    • BLAKE2S160
    • BLAKE2S224E
    • BLAKE2S224
    • BLAKE2SP256E
    • BLAKE2SP256
    • BLAKE2SP224E
    • BLAKE2SP224
    • SHA1E
    • SHA1
    • MD5E
    • MD5
    • WORM
    • URL
    • X*
  • operating system: linux x86_64
  • remote types:
    • git
    • gcrypt
    • p2p
    • S3
    • bup
    • directory
    • rsync
    • web
    • bittorrent
    • webdav
    • adb
    • tahoe
    • glacier
    • ddar
    • git-lfs
    • httpalso
    • hook
    • external
  • supported repository versions:
    • 8
  • upgrade supported from repository versions:
    • 0
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
  • version: 8.20201007

location

  • path: /tmp/d3
  • type: dataset

metadata_extractors

  • annex (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: None
    • module: datalad.metadata.extractors.annex
    • version: None
  • audio (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: No module named 'mutagen' [audio.py::17]
    • module: datalad.metadata.extractors.audio
  • datacite (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: None
    • module: datalad.metadata.extractors.datacite
    • version: None
  • datalad_core (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: None
    • module: datalad.metadata.extractors.datalad_core
    • version: None
  • datalad_rfc822 (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: None
    • module: datalad.metadata.extractors.datalad_rfc822
    • version: None
  • exif (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: No module named 'exifread' [exif.py::16]
    • module: datalad.metadata.extractors.exif
  • frictionless_datapackage (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: None
    • module: datalad.metadata.extractors.frictionless_datapackage
    • version: None
  • image (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: No module named 'PIL' [image.py::16]
    • module: datalad.metadata.extractors.image
  • metalad_annex (datalad-metalad 0.2.1):
    • distribution: datalad-metalad 0.2.1
    • load_error: None
    • module: datalad_metalad.extractors.annex
    • version: None
  • metalad_core (datalad-metalad 0.2.1):
    • distribution: datalad-metalad 0.2.1
    • load_error: None
    • module: datalad_metalad.extractors.core
    • version: None
  • metalad_custom (datalad-metalad 0.2.1):
    • distribution: datalad-metalad 0.2.1
    • load_error: None
    • module: datalad_metalad.extractors.custom
    • version: None
  • metalad_runprov (datalad-metalad 0.2.1):
    • distribution: datalad-metalad 0.2.1
    • load_error: None
    • module: datalad_metalad.extractors.runprov
    • version: None
  • xmp (datalad 0.13.3):
    • distribution: datalad 0.13.3
    • load_error: No module named 'libxmp' [xmp.py::20]
    • module: datalad.metadata.extractors.xmp

metadata_indexers

python

  • implementation: CPython
  • version: 3.8.5

system

@yarikoptic
Copy link
Member Author

I wonder if could possibly be some racing within annex.

I will blame the use of get as the ones which threw me off to blame annex, but it has nothing to do with it since comes up during git clone. In all likeliness it is our sshrun (or to be precise sshconnector.py code) which doesn't have any locking against parallel execution and thus might try to bring up all those multiple mux's in parallel in this racy place:

        if self.ctrl_path.exists():
            return False

        # create ssh control master command
        cmd = ["ssh"] + self._ssh_open_args + self._ssh_args + [self.sshri.as_str()]

        # start control master:
        lgr.debug("Opening %s by calling %s", self, cmd)
        proc = Popen(cmd)
        stdout, stderr = proc.communicate(input="\n")  # why the f.. this is necessary?

Moreover, if I Ctrl-C that install process -- some (probably those which "complained") ssh -fN -o ControlMaster=auto are left behind.

Anyways -- it better be addressed in a separate PR since doesn't depend on this functionality and would be easier to review in isolation. I will attempt one quickly now (if doesn't come today - I failed ;))

It was reported datalad#5022 (comment)
that parallel install/get over ssh results in flood of

  ControlSocket ... already exists, disabling multiplexing

and abandoned ssh -fN -o ControlMaster=auto ... processes.  This stems from the fact
that MultiplexSSHConnection is nohow thread-safe, so the same instance is used across
multiple threads and races in the check of socket path existing and then acquiring it.

Note: we use threading.Lock here, so this solution is purely for safety across
threads (and more lightweight), and not across multiple processes.  In
principle the construct we have already (not verified to work):

        from .locking import lock_if_check_fails
        with lock_if_check_fails(
            self.ctrl_path.exists,
            self._ctrl_path,
            blocking=False
        ) as (check, lock):
            if check or not (lock and lock.acquired):
                return False
            ... the rest of the .open() call

could be used in addition to establish across process file-based locking in
addition (but would not work for thread safety since uses InterprocessLock).
@yarikoptic
Copy link
Member Author

ok, pushed 9045227 straight here since "multithread specific" and I had one commit "staged" anyways, and would not anyhow benefit across multiple processes etc. @mih -- good to try now ;)

@yarikoptic
Copy link
Member Author

I only wonder now what other constructs aren't thread-safe. With parallelization across datasets we should be ok for dataset bound configs etc. logging I believe is thread safe too. But some global objects (e.g. datalad.cfg) might not be thread safe and effected (but I guess dataset-specific code which is parallelized now should not modify global cfg). What about our Flyweight? insofar I have not ran into any issue, but we might discover more gotchas as we use it more.

@yarikoptic
Copy link
Member Author

Ok, this PR had 1 approval, seems like a good amount of "trials", thread-unsafe ssh was fixed, and I bet we will discover more the more we use it. So if it stays green tomorrow morning (yellow now), I will be merging it unless strong objections would be expressed while I am asleep ;)

if isinstance(recursion_limit, int) and recursion_limit <= 0:
return
if jobs == "auto":
# be safe -- there might be ssh logins etc. So no parallel
jobs = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, re-reviewing my code stumbled upon this . I guess should be removed since I think it came from the time when I had "auto" by default making 5 jobs. ATM "auto" should be a single job, which should be ok for install. Or if not to remove - just making a mental note that I saw it ;)

@yarikoptic
Copy link
Member Author

and here we go....

@yarikoptic yarikoptic merged commit 6053a58 into datalad:master Oct 30, 2020
@yarikoptic yarikoptic deleted the enh-iter-progress branch November 2, 2020 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-in-parallel enhancement merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature UX user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

addurls: OPT - parallelize creation of subdatasets
3 participants