Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: do not use BaseDownloader instance wide InterProcessLock - resolves stalling or errors during parallel installs #6507

Merged
merged 1 commit into from
Feb 27, 2022

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Feb 25, 2022

apparently that does not play nice with multi-threading parallelization,
where (I guess) we instantiate and reuse the same Downloader across multiple
threads, and thus reusing the lock.

Bu instantiating the lock right before using it, albeit may be adding even more runtime
overhead, we are avoiding the problems with InterProcessLock.

Closes #6483 and also see harlowja/fasteners#91 for more
information. Filed also #6506 which observed while troubleshooting.

I am yet to try it on actual original HCP use case, but with this script
http://www.onerussian.com/tmp/ts-parallel-download.py
and this list of urls for our store http://www.onerussian.com/tmp/store-configs.txt
/bin/rm -f /tmp/out/*; head -n 1000 /tmp/store-configs.txt | python ts-parallel-download.py now works ok

edit: HCP timing on smaug:

action summary:
  install (ok: 4547)
                 datalad get -n -r HCP1200 --jobs 32  11158.83s user 7123.09s system 176% cpu 2:52:38.36 total

apparently that does not play nice with multi-threading parallelization,
where (I guess) we instantiate and reuse the same Downloader across multiple
threads, and thus reusing the lock.

Bu instantiating the lock right before using it, albeit may be adding even more runtime
overhead, we are avoiding the problems with InterProcessLock.

Closes datalad#6483 and also see harlowja/fasteners#91 for more
information
@yarikoptic yarikoptic added the semver-patch Increment the patch version when merged label Feb 25, 2022
@adswa
Copy link
Member

adswa commented Feb 25, 2022

I was about to try it on the hcp dataset on our cluster, but it runs Python 3.7, where it won't use threads... :/

@yarikoptic
Copy link
Member Author

I was about to try it on the hcp dataset on our cluster, but it runs Python 3.7, where it won't use threads... :/

can merge into master locally -- that one should be ok with 3.7 and threads.

@yarikoptic
Copy link
Member Author

fails on macs are "known" (I think) due to newer annex. Otherwise seems to be green.

@yarikoptic
Copy link
Member Author

yarikoptic commented Feb 27, 2022

ok, tested also on a time datalad install -J32 -r https://github.com/datalad-datasets/human-connectome-project-openaccess which completed in 3:41:08.86 total . no stalls/errors. Let's consider it "the right fix" and merge/release it (what to wait for?)

edit: I will also merge into master

@yarikoptic yarikoptic added the release Create a release when this pr is merged label Feb 27, 2022
@yarikoptic yarikoptic changed the title BF: do not use BaseDownloader instance wide InterProcessLock BF: do not use BaseDownloader instance wide InterProcessLock - resolves stalling or errors during parallel installs Feb 27, 2022
@yarikoptic yarikoptic merged commit bed536d into datalad:maint Feb 27, 2022
@adswa
Copy link
Member

adswa commented Mar 3, 2022

I'm trying it locally, and it seems to run smoothly with --jobs 5. I saw a download locking thing with --jobs 10, but that may have been my computer being overwhelmed with processes.

@yarikoptic
Copy link
Member Author

Thank you @adswa for testing! Locking with 10 sounds worrisome, could be some other issue though. Any interesting log messages were displayed?

@adswa
Copy link
Member

adswa commented Mar 3, 2022

Nothing that caught my eye, but it was with default info log level. I accidentally filled my hard drive just now, but once I have space again I can retry.

@yarikoptic
Copy link
Member Author

I think looking at lsof and py-spy top were also useful for me to see where in datalad prices we might be at the moment, whenever it clearly wasn't done child annex process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release Create a release when this pr is merged semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants