Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: Assume 50% safety margin for max cmdline length #3001

Merged
merged 2 commits into from Nov 21, 2018

Conversation

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Nov 20, 2018

Hit it again with current master ... not sure if this would have resolved it but it should ;)

Copy link
Contributor

@kyleam kyleam left a comment

I think this change is fine as is, particularly assuming it fixes real limits hit, but does it really make sense for the safety margin to scale with CMD_MAX_ARG?

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 20, 2018

I think this change is fine as is, particularly assuming it fixes real limits hit, but does it really make sense for the safety margin to scale with CMD_MAX_ARG?

As it was discovered on linux it should be 300k which is greater than the CMD_MAX_ARG itself on OSX or Windows ;) that is why I decided to go for %

@codecov
Copy link

@codecov codecov bot commented Nov 20, 2018

Codecov Report

Merging #3001 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3001      +/-   ##
==========================================
+ Coverage   90.36%   90.36%   +<.01%     
==========================================
  Files         245      245              
  Lines       32269    32270       +1     
==========================================
+ Hits        29160    29161       +1     
  Misses       3109     3109
Impacted Files Coverage Δ
datalad/support/annexrepo.py 88.1% <ø> (ø) ⬆️
datalad/utils.py 86.33% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 094b83a...83b3640. Read the comment docs.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 20, 2018

pushed a rewrite of my silly untested quick change -- this time must work ;-)

@kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 20, 2018

As it was discovered on linux it should be 300k which is greater than the CMD_MAX_ARG itself on OSX or Windows ;) that is why I decided to go for %

Sure, I didn't say you should just blindly subtract an absolute value. But if ~300k padding is required for a command on linux, isn't that command almost certainly going to fail elsewhere? It seems like git-annex just goes with a max length of 10240. Why not do that?

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 20, 2018

My reason is greed!
Annex probably generally doesn't invoke very long commands - probably passes . and not all the files from under current directory. 10k partitioning would result in unnecessarily large number of separate commands to run in our case

@kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 20, 2018

10k partitioning would result in unnecessarily large number of separate commands to run in our case

OK, I'll trust your gut that this is true in practice.

@kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 20, 2018

@kyleam:

But if ~300k padding is required for a command on linux, isn't that command almost certainly going to fail elsewhere?

Hmm, there's still this. You didn't answer, so I'm (perhaps incorrectly) assuming I'm not mistaken. If so, it seems like it'd be better to just drop to a very conservative absolute value for the maximum length if not on linux or SC_ARG_MAX is lower than a particular value. Then, you wouldn't take the hit you're worried about when on linux in the usual conditions, but there's an increased chance that commands will succeed in the other scenarios.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 20, 2018

But if ~300k padding is required for a command on linux, isn't that command almost certainly going to fail elsewhere?

I am not sure what you mean here.
"elsewhere" where? within git annex call itself? (that one has its own ways -- batch calls to git etc).
Since we (or at least I) haven't figured out why those 300k are needed (even after I accounted for env etc), this is all a heuristic. Could be more concervative (fixed number, which might become more liberal for some OSs) or less (fixed '%')... if you like it more conservative '%' then which one -- 70? 60? ... Going all the way down to 0.5% sounds a bit of an overkill for my liking.

@kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 20, 2018

But if ~300k padding is required for a command on linux, isn't that command almost certainly going to fail elsewhere?

I am not sure what you mean here.
"elsewhere" where?

"elsewhere" was in direct contrast to the "linux" at the beginning of the sentence, so on other OSs.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 20, 2018

ah, ok... I am not sure why it would "certainly" fail, I have no prior experiences of that kind ;-)

but if you want to bargain, what if we take 100% - 10240/32767 =~ 70% margin (i.e. have 0.3 multiplier) and still stay friends? ;) This way at least we wouldn't go all the way to 99.5% margin on linux ;-)

@kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 20, 2018

ah, ok... I am not sure why it would "certainly" fail, I have no prior experiences of that kind ;-)

"certainly" was bad word choice on my part. I was just trying to reason through it.

This way at least we wouldn't go all the way to 99.5% margin on linux ;-)

I'm confused by what that means given that above I said:

If so, it seems like it'd be better to just drop to a very conservative absolute value for the maximum length if not on linux or SC_ARG_MAX is lower than a particular value. Then, you wouldn't take the hit you're worried about when on linux in the usual conditions, but there's an increased chance that commands will succeed in the other scenarios.

Anyway, I'm unconvinced a fixed scaling is a useful thing to do for low CMD_MAX_ARG values, but that's likely me being obtuse. Proceed without me :]

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 20, 2018

it might be that more complex OS based decision making would be more robust indeed... ok, you made me paranoid, I will go for 50% margin and be done with it for now.
Ideally/eventually any of those ad-hoc ones would fail (unicode, etc) unless we figure out exact way to estimate it reliably while still accounting for possible limits being hit by tools we would call, or add some progressive splitting if we hit "maximal length reached"...

@yarikoptic yarikoptic changed the title BF: Assume 20% safety margin for max cmdline length BF: Assume 50% safety margin for max cmdline length Nov 20, 2018
@yarikoptic yarikoptic merged commit ec128a6 into datalad:master Nov 21, 2018
6 of 7 checks passed
@yarikoptic yarikoptic deleted the bf-long-names branch Nov 21, 2018
@yarikoptic yarikoptic added this to the Release 0.11.1 milestone Nov 24, 2018
yarikoptic added a commit that referenced this issue Nov 27, 2018
	## 0.11.1 (Nov 25, 2018) -- v7-better-than-v6

	Rushed out bugfix release to stay fully compatible with recent
	[git-annex] which introduced v7 to replace v6.

	### Fixes

	- [install]: be able to install recursively into a dataset ([#2982])
	- [save]: be able to commit/save changes whenever files potentially
	  could have swapped their storage between git and annex
	  ([#1651]) ([#2752]) ([#3009])
	- [aggregate-metadata]:
	  - dataset's itself is now not "aggregated" if specific paths are
		provided for aggregation ([#3002]). That resolves the issue of
		`-r` invocation aggregating all subdatasets of the specified dataset
		as well
	  - also compare/verify the actual content checksum of aggregated metadata
		while considering subdataset metadata for re-aggregation ([#3007])
	- `annex` commands are now chunked assuming 50% "safety margin" on the
	  maximal command line length. Should resolve crashes while operating
	  ot too many files at ones ([#3001])
	- `run` sidecar config processing ([#2991])
	- no double trailing period in docs ([#2984])
	- correct identification of the repository with symlinks in the paths
	  in the tests ([#2972])
	- re-evaluation of dataset properties in case of dataset changes ([#2946])
	- [text2git] procedure to use `ds.repo.set_gitattributes`
	  ([#2974]) ([#2954])
	- Switch to use plain `os.getcwd()` if inconsistency with env var
	  `$PWD` is detected ([#2914])
	- Make sure that credential defined in env var takes precedence
	  ([#2960]) ([#2950])

	### Enhancements and new features

	- [shub://datalad/datalad:git-annex-dev](https://singularity-hub.org/containers/5663/view)
	  provides a Debian buster Singularity image with build environment for
	  [git-annex]. [tools/bisect-git-annex]() provides a helper for running
	  `git bisect` on git-annex using that Singularity container ([#2995])
	- Added [.zenodo.json]() for better integration with Zenodo for citation
	- [run-procedure] now provides names and help messages with a custom
	  renderer for ([#2993])
	- Documentation: point to [datalad-revolution] extension (prototype of
	  the greater DataLad future)
	- [run]
	  - support injecting of a detached command ([#2937])
	- `annex` metadata extractor now extracts `annex.key` metadata record.
	  Should allow now to identify uses of specific files etc ([#2952])
	- Test that we can install from http://datasets.datalad.org
	- Proper rendering of `CommandError` (e.g. in case of "out of space"
	  error) ([#2958])

* tag '0.11.1':
  Adjust the date -- 25th fell through due to __version__ fiasco
  BF+ENH(TST): boost hardcoded version + provide a test to guarantee consistency in the future
  This (expensive) approach is not needed in v6+
  small tuneup to changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants