Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: Make default number of annex jobs configurable and 1 by default #4409

Merged
merged 3 commits into from Apr 20, 2020

Conversation

mih
Copy link
Member

@mih mih commented Apr 17, 2020

The previous default behavior (max(8, min(3, nCPUs))) could cause
issues (crashes, system stalls) due to the large number of child
processes spawned and file descriptors opened.

This change introduces a new config switch
datalad.runtime.max-annex-jobs that can be used to tailor
parallelization behavior, and effectively replaces the previously
hard-coded maximum value of 8. Moreover, it drops the default to 1
to disable parallel annex jobs by default to ensure a potentially
slower, but more robust behavior out of the box.

The config switch is specific to annex jobs in order to leave room for
an anticipated parallelization of some functionality in DataLad itself.
This will likely need separate configuration to be able to handle the
multiplicative impact of parallelization on two levels.

Implementation detail: The number of jobs is now lazily evaluated on
first use. While it only takes ~100ns to get the CPU count, adding a
config query would make this longer (still microsecond range). However,
the configuration scope would then be limited to the user-level, and
could not be tailored to individual datasets, which may be desired for
datasets with a huge number of files.

I left the previously defined N_AUTO_JOBS symbol in place to avoid
breaking consumer code (I am not aware of any). But it now also reflects
the changed default behavior.

Fixes gh-4404
Fixes gh-4141

PR is targeting maint based on this comment: #4404 (comment)

The previous default behavior `(max(8, min(3, nCPUs)))` could cause
issues (crashes, system stalls) due to the large number of child
processes spawned and file descriptors opened.

This change introduces a new config switch
`datalad.runtime.max-annex-jobs` that can be used to tailor
parallelization behavior, and effectively replaces the previously
hard-coded maximum value of `8`. Moreover, it drops the default to `1`
to disable parallel annex jobs by default to ensure a potentially
slower, but more robust behavior out of the box.

The config switch is specific to annex jobs in order to leave room for
an anticipated parallelization of some functionality in DataLad itself.
This will likely need separate configuration to be able to handle the
multiplicative impact of parallelization on two levels.

Implementation detail: The number of jobs is now lazily evaluated on
first use. While it only takes ~100ns to get the CPU count, adding a
config query would make this longer (still microsecond range). However,
the configuration scope would then be limited to the user-level, and
could not be tailored to individual datasets, which may be desired for
datasets with a huge number of files.

I left the previously defined `N_AUTO_JOBS` symbol in place to avoid
breaking consumer code (I am not aware of any). But it now also reflects
the changed default behavior.

Fixes dataladgh-4404
Fixes dataladgh-4141
@mih mih linked an issue Apr 17, 2020 that may be closed by this pull request
@mih mih linked an issue Apr 17, 2020 that may be closed by this pull request
@codecov
Copy link

@codecov codecov bot commented Apr 17, 2020

Codecov Report

Merging #4409 into maint will decrease coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #4409      +/-   ##
==========================================
- Coverage   89.70%   89.66%   -0.05%     
==========================================
  Files         275      275              
  Lines       37055    37072      +17     
==========================================
- Hits        33240    33239       -1     
- Misses       3815     3833      +18     
Impacted Files Coverage Δ
datalad/interface/common_cfg.py 100.00% <ø> (ø)
datalad/interface/common_opts.py 100.00% <ø> (ø)
datalad/support/annexrepo.py 86.30% <100.00%> (+0.02%) ⬆️
datalad/downloaders/http.py 72.11% <0.00%> (-2.79%) ⬇️
datalad/downloaders/tests/test_http.py 58.79% <0.00%> (-2.17%) ⬇️
datalad/utils.py 83.97% <0.00%> (-0.10%) ⬇️
datalad/support/tests/test_annexrepo.py 95.66% <0.00%> (-0.03%) ⬇️
datalad/core/distributed/tests/test_clone.py 92.54% <0.00%> (-0.01%) ⬇️
datalad/tests/test_config.py 99.17% <0.00%> (+0.01%) ⬆️
datalad/support/network.py 86.66% <0.00%> (+0.03%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa2da5c...f494d12. Read the comment docs.

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Apr 17, 2020

LGTM

@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 17, 2020

I didn't test it locally, but looks good to me as well.

Copy link
Member

@yarikoptic yarikoptic left a comment

I feel that config variable name should be different, and might overall need more documentation adjustment and possibly behavior (see comments)

@@ -307,6 +307,13 @@
'text': 'Git-annex large files expression (see https://git-annex.branchable.com/tips/largefiles; given expression will be wrapped in parentheses)'}),
'default': 'anything',
},
'datalad.runtime.max-annex-jobs': {
Copy link
Member

@yarikoptic yarikoptic Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be datalad.annex.jobs because it is annex specific setting, and pretty much we just pass it into annex call, similarly to the retry, whenever runtime is more about our own behavior:

$> git grep datalad.annex datalad/interface/common_cfg.py
datalad/interface/common_cfg.py:    'datalad.annex.retry': {

$> git grep datalad.runtime datalad/interface/common_cfg.py
datalad/interface/common_cfg.py:    'datalad.runtime.raiseonerror': {
datalad/interface/common_cfg.py:    'datalad.runtime.report-status': {

# due to https://github.com/datalad/datalad/issues/4404)
jobs = self._n_auto_jobs or min(
self.config.obtain('datalad.runtime.max-annex-jobs'),
max(3, cpu_count()))
Copy link
Member

@yarikoptic yarikoptic Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother limiting here more from what user requested? I could have 1 cpu but download in parallel could still be beneficial (it is not CPU bound job). I would just pass the value of the configuration here.
Although indeed it kinda deminishes the notion of any "auto" here -- it pretty much becomes "config", it is worth adjusting documentation where jobs is mentioned that it wouldn't be "auto" but rather consult config variable, or better change from "auto" to "config", and then here leave it all as it was (i.e. for "auto" - do previous behavior). So by default it would consult config - would become 1.

Copy link
Member

@yarikoptic yarikoptic Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, later, if we could come up with some true "auto" we could change back while leaving possibility to still use "config" setting.

@mih
Copy link
Member Author

@mih mih commented Apr 18, 2020

I am not confident that I understand your intentions. My intentions were to change the previous behavior as little as possible, while getting the issues fixed (hence I target maint). You seem to suggest that more fundamental changes should be made (new config option, direct pass-through).

I disagree with

I think it should be datalad.annex.jobs because it is annex specific setting, and pretty much we just pass it into annex call, similarly to the retry, whenever runtime is more about our own behavior

Datalad is doing "auto" not git-annex, and we are applying additional decision-making before coming up with a modified value to pass on, hence my choice of the variable name.

Overall, I think the current state of the PR is not really in conflict with your thoughts, but rather is the BF part of them, that could go into maint/. Feel free to take this PR over.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 19, 2020

Ok, if we are to pursue "minimal change" approach here, I just pushed two commits adjusting description of the newly added configuration item (to make it more factually correct) and --jobs parameter. But my point was: I do not see much value in doing all the min(,. max()) dance if we now would be simply providing an explicit value, and making it more explicit - "--jobs argument to pass to annex invocation" (and thus renaming this configuration option) would have made things clear(er) not only for the sake of maint but for the future release as well.

@mih
Copy link
Member Author

@mih mih commented Apr 20, 2020

@yarikoptic Thx for the edits! I am OK with changing the default to an explicitly configured number in a second step. Will merge this for now.

@mih mih merged commit a271d52 into datalad:maint Apr 20, 2020
10 of 11 checks passed
@mih mih deleted the bf-4404 branch Apr 20, 2020
@yarikoptic yarikoptic added this to the 0.12.6 milestone Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants