Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize git annex special remote with chunking to circumvent file size limits #118

Closed
wants to merge 12 commits into from

Conversation

adswa
Copy link
Member

@adswa adswa commented Jul 16, 2020

This wants to fix #116 by initializing the special remote with a default chunk size of 50MB. initremote's chunk parameter is exposed via a --chunk argument to allow configuring the chunk size or disable chunking at the time of sibling creation.

Locally, I was able to push a >5GB file to the OSF with this change.

I'm unsure on how to approach unit tests, though - are there related tests in datalad that test chunk sizes/chunking?

This allows the upload of files larger than 5GB for non-export* mode siblings.
The chunk size is exposed as a parameter should chunking be undesired or need
adjustment.
The default chunk size of 50mb is inspired by the box.com special remote example (50mb)
with some consideration for OSFs rate limiting (10000 requests per day)
@adswa adswa changed the title Enable special remote with chunking Initialize git annex special remote with chunking to circumvent file size limits Jul 16, 2020
@yarikoptic
Copy link
Member

Why just 50mb if like 5gb? Why not eg 1gb? I have never tested the effect from chunking on transfer etc performance but I assume there should be some

@adswa
Copy link
Member Author

adswa commented Jul 16, 2020

I agree that we should test this on different infrastructure. Here's what the docs say:

Good chunk sizes will depend on the remote, but a good starting place is probably 1MiB. Very large chunks are problematic, both because git-annex needs to buffer one chunk in memory when uploading, and because a larger chunk will make resuming interrupted transfers less efficient. On the other hand, when a file is split into a great many chunks, there can be increased overhead of making many requests to the remote.

50mb is based on the webdav special remote for box.com example.

@adswa
Copy link
Member Author

adswa commented Jul 16, 2020

With very large datasets and (small-ish) chunking, rate limiting (#28) may also be a point to consider.

@yarikoptic
Copy link
Member

yarikoptic commented Jul 16, 2020

Thank you! So it feels that the best course of action is not to have chunking enabled by default but error out and instruct user to enable chunking when any file to be uploaded is bigger than 5gb... I don't know if chunking could be enabled on existing remote and what effect it would have.

@adswa
Copy link
Member Author

adswa commented Jul 16, 2020

I don't know if chunking could be enabled on existing remote and what effect it would have.

Yeah, that was what I was thinking about when implementing this, too. I would be fine with making the default 0 to disable chunking - changing to a chunking size could be done afterwards with enableremote, if I am not mistaken.

but error out and instruct user to enable chunking when any file to be uploaded is bigger than 5gb

that raises a good point, the current error when uploading too large files is maximally uninformative at the moment:

datalad push --to osf-nochunk                                           127 !
Push to 'osf-nochunk':  25%|███         | 1.00/4.00 [00:00<00:00, 6.49k Steps/s
[ERROR  ] Could not create a new file at (MD5E-s5500000000--d7251c8712f1810b3fe77f571f1c4a91) nor update it.                                                     
| This could have failed because --fast is enabled. 
[copy(/home/adina/scratch/mytest2/areallylargefile)] 
copy(error): areallylargefile (file) [Could not create a new file at (MD5E-s5500000000--d7251c8712f1810b3fe77f571f1c4a91) nor update it.                        
This could have failed because --fast is enabled.]

@yarikoptic
Copy link
Member

FWIW

that raises a good point, the current error when uploading too large files is maximally uninformative at the moment:

I wonder if OSF returns any more informative information in the response headers (something like , so may be even hardcoding 5GB size in the code would not be necessary... but looking at https://developer.osf.io/#tag/Errors-and-Error-Codes I see nothing relevant :-/ needs to be tried ;-)

@mih
Copy link
Member

mih commented Jul 17, 2020

Chunking below the effective limit can be useful (git-annex can download chunks in parallel IIRC), but it increases API usage (more requests, reaching the daily limit faster), and the substantial request latency (more than a second, even on fast(ish) connections) quickly eats up any benefits when the chunk size is too small.

I'd say, and agree, that things should be tried and benchmarked. Let's release 0.2.0 without this feature and target the next one.

@adswa
Copy link
Member Author

adswa commented Jul 17, 2020

Let's release 0.2.0 without this feature and target the next one.

agreed.

@adswa
Copy link
Member Author

adswa commented Jul 17, 2020

With regard to more informative error messages...

I wonder if OSF returns any more informative information in the response headers (something like , so may be even hardcoding 5GB size in the code would not be necessary... but looking at https://developer.osf.io/#tag/Errors-and-Error-Codes I see nothing relevant :-/ needs to be tried ;-)

I couldn't find anything informative, but both my git-annex special remote debugging as well as HTTP request foo is weak ;-)

The obscure error message is raised by OSF client code (which doesn't check for too large files at all):

else:
raise RuntimeError("Could not create a new file at "
"({}) nor update it.".format(path))

For better errors, would it make sense and is it possible to do something like git annex examinekey <key> --format='${bytesize}\n' in the special remote code on local files after an unsuccessful transfer if chunking is disabled? I.e., something like a hook?

adswa and others added 8 commits July 17, 2020 11:05
- Installation instructions
- generic and more prominent link to the docs
- shorter intro, instead point to the docs
This allows the upload of files larger than 5GB for non-export* mode siblings.
The chunk size is exposed as a parameter should chunking be undesired or need
adjustment.
The default chunk size of 50mb is inspired by the box.com special remote example (50mb)
with some consideration for OSFs rate limiting (10000 requests per day)
@adswa
Copy link
Member Author

adswa commented Aug 25, 2020

I have benchmarked a tiny bit using the following script:

#!/usr/bin/bash

set -eu
set -x

cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
datalad create .
head -c 1G </dev/urandom >myfile1G
head -c 500MB </dev/urandom >myfile500MB
head -c 100MB </dev/urandom >myfile100MB
datalad save .
for chunk in 50mb 100mb 150mb 200mb 250mb;
   do datalad create-sibling-osf --mode annex --chunk $chunk -s osf_{$chunk};
      time datalad push --to osf_{$chunk};
done

no chunking

real    1m27.192s                                                                                                  
user    0m5.389s                                                                                                   
sys     0m2.398s

50mb

real    4m7.911s
user    0m6.341s
sys     0m4.899s

100mb

real    2m59.807s
user    0m6.102s
sys     0m4.903s

150mb

real    2m29.905s
user    0m5.887s
sys     0m4.598s

200mb

real    2m19.883s
user    0m5.794s
sys     0m4.939s

250mb

real    2m10.201s
user    0m5.777s
sys     0m5.299s

300mb

real    2m48.794s
user    0m6.083s
sys     0m4.714s

500mb

real    1m37.938s
user    0m5.901s
sys     0m5.014s

improvements on how to benchmark this are very welcomed :)

@adswa
Copy link
Member Author

adswa commented Feb 1, 2021

Given the 5GB space limit on OSF, this feature isn't really that required anymore, right? Shall we close this?

@adswa adswa closed this Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make git-annex special remote config aware of 5GB filesize limit
3 participants