-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize git annex special remote with chunking to circumvent file size limits #118
Conversation
This allows the upload of files larger than 5GB for non-export* mode siblings. The chunk size is exposed as a parameter should chunking be undesired or need adjustment. The default chunk size of 50mb is inspired by the box.com special remote example (50mb) with some consideration for OSFs rate limiting (10000 requests per day)
Why just 50mb if like 5gb? Why not eg 1gb? I have never tested the effect from chunking on transfer etc performance but I assume there should be some |
I agree that we should test this on different infrastructure. Here's what the docs say:
50mb is based on the webdav special remote for box.com example. |
With very large datasets and (small-ish) chunking, rate limiting (#28) may also be a point to consider. |
Thank you! So it feels that the best course of action is not to have chunking enabled by default but error out and instruct user to enable chunking when any file to be uploaded is bigger than 5gb... I don't know if chunking could be enabled on existing remote and what effect it would have. |
Yeah, that was what I was thinking about when implementing this, too. I would be fine with making the default
that raises a good point, the current error when uploading too large files is maximally uninformative at the moment:
|
FWIW
I wonder if OSF returns any more informative information in the response headers (something like , so may be even hardcoding 5GB size in the code would not be necessary... but looking at https://developer.osf.io/#tag/Errors-and-Error-Codes I see nothing relevant :-/ needs to be tried ;-) |
Chunking below the effective limit can be useful (git-annex can download chunks in parallel IIRC), but it increases API usage (more requests, reaching the daily limit faster), and the substantial request latency (more than a second, even on fast(ish) connections) quickly eats up any benefits when the chunk size is too small. I'd say, and agree, that things should be tried and benchmarked. Let's release 0.2.0 without this feature and target the next one. |
agreed. |
With regard to more informative error messages...
I couldn't find anything informative, but both my git-annex special remote debugging as well as HTTP request foo is weak ;-) The obscure error message is raised by OSF client code (which doesn't check for too large files at all): datalad-osf/datalad_osf/osfclient/osfclient/models/storage.py Lines 130 to 132 in 4218aaa
For better errors, would it make sense and is it possible to do something like |
- Installation instructions - generic and more prominent link to the docs - shorter intro, instead point to the docs
This allows the upload of files larger than 5GB for non-export* mode siblings. The chunk size is exposed as a parameter should chunking be undesired or need adjustment. The default chunk size of 50mb is inspired by the box.com special remote example (50mb) with some consideration for OSFs rate limiting (10000 requests per day)
I have benchmarked a tiny bit using the following script:
no chunking
50mb
100mb
150mb
200mb
250mb
300mb
500mb
improvements on how to benchmark this are very welcomed :) |
Given the 5GB space limit on OSF, this feature isn't really that required anymore, right? Shall we close this? |
This wants to fix #116 by initializing the special remote with a default chunk size of 50MB.
initremote
'schunk
parameter is exposed via a--chunk
argument to allow configuring the chunk size or disable chunking at the time of sibling creation.Locally, I was able to push a >5GB file to the OSF with this change.
I'm unsure on how to approach unit tests, though - are there related tests in
datalad
that test chunk sizes/chunking?