Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Curl is not handling uploads with large number of URLs (100,000+) #1959
I did this
I'm trying to upload large number of small files (100,000 or 1,000,000) using single HTTPS connection
I expected the following
I expected curl to start uploading process in a reasonable amount of time, but upload process never starts.
If I'm trying to download files instead of uploading - curl starts processing right away, see below.
It looks like curl is not processing large number of uploads/URLs in optimal way.
[curl -V output]
macOS Sierra 10.12.6
I tried this in Windows with several recent builds and it takes about 20 seconds to parse the config with the upload options (200,000 lines) and 10 seconds to parse the config with the download options (100,000 lines). If you build with -DDEBUG_CONFIG it will show you the parsing in real time, for example I see this repeating without delay:
Check for a delay in parsing the entries.
edit: note I'm using debug builds of curl/libcurl, release builds would obviously be faster.
I'm not sure if Windows curl would operate any different. But curl on macOS (comes with the OS) and the latest build of curl from GitHub I tried on debian is processing config, but it takes forever and it never finishes in reasonable time to start uploading. I do see that it's processing entries (with strace), but it's doing it kind of slow.
read(4, "st/upload/34235\nupload-file /tmp"..., 4096) = 4096
I've also compiled it with "-pg" option on Linux and I see that it spends most of time in getparameter function.
this is tool_getparam.c part where I think it spends all this time. If I get it correctly - it constantly scans list of URLs (from beginning to end) and more URLs get added - the longer it takes.
So I tried this: replacing tool_getparam.c:1916
and it seem to fixed the issue :) I'm not a developer and I don't understand the logic with GETOUT_UPLOAD flags at all, but hey - it worked :)
Lovely @arainchik and thanks! Someone just needs to verify that it actually is a proper fix and we can land that. This is a pretty tricky issue to add a test case for so - I'd rather not have a hundred thousand transfers as a test. I think we can skip it this time.
By properly keeping track of the last entry in the list of URLs/uploads to handle, curl now avoids many meaningless traverses of the list which speeds up many-URL handling *MASSIVELY* (several magnitudes on 100K URLs). Added test 1291, to verify that it doesn't take ages - but we don't have any detection of "too slow" command in the test suite. Reported-by: arainchik on github Fixes #1959