Downloading a long sequence of URLs results in high CPU usage and slowness #429

Closed
greafhe opened this Issue Sep 11, 2015 · 9 comments

Projects

None yet

5 participants

@greafhe
greafhe commented Sep 11, 2015

Here is what I did
In one terminal run: python -m SimpleHTTPServer 8888
In another terminal run: curl "localhost:8888/[1-1000000]" > /dev/null
In the beginning the speed was over 2000 URLs/second. By the time it reached the 50000th URL, the speed was under 300 URLs/second, and curl's CPU usage was about twice as high as it was at the start.
Restarting curl (without restarting the server) results again in high speed in the beginning, then gradually dropping. Same if I use a different range, e.g. [100000-1000000].

@greafhe
greafhe commented Sep 11, 2015

curl --version
curl 7.35.0 (x86_64-pc-linux-gnu) libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

(from the Ubuntu 14.04 repository)

@frenche
Contributor
frenche commented Sep 12, 2015

Interesting test. I've run it and got similar results with curl 7.45 (fedora 21).
There seem to be a noticeable decrease of the speed and increase of CPU usage over time.

On the other hand, similar test with libcurl doesn't show this behaviour.
The test below shows rather constant speed (and about twice as fast).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <curl/curl.h>

int main(int argc, char *argv[])
{
  CURLcode ret;
  CURL *hnd;
  int i;
  char base_url[] = "http://localhost:8888/";

  hnd = curl_easy_init();

  char *url = (char *) malloc(sizeof(base_url) + 7);
  strcpy(url, base_url);

  for(i = 0; i < 1000000; i++) {
    sprintf(url + sizeof(base_url) - 1, "%i", i);
    curl_easy_setopt(hnd, CURLOPT_URL, url);
    ret = curl_easy_perform(hnd);
  }

  curl_easy_cleanup(hnd);

  return ret;
}

I can't tell why the tool behaves that way and I'm not sure I've considered all the aspects.

@jay
Member
jay commented Sep 12, 2015
 80.83%  curl  libcurl.so.4.3.0    [.] Curl_slist_append_nodup
 16.00%  curl  [kernel.kallsyms]   [k] 0xffffffff8114f406
  0.23%  curl  libcurl.so.4.3.0    [.] dprintf_formatf
  0.15%  curl  libc-2.19.so        [.] _IO_vfscanf
  0.10%  curl  libc-2.19.so        [.] malloc
  0.10%  curl  libcurl.so.4.3.0    [.] Curl_setopt
  0.09%  curl  curl                [.] operate_do

perf report output from Ubuntu 14 x64 running curl built from master fad9604 2015-09-11.

The Curl_slist_append_nodup calls trace back to the curl tool's libcurl code generation. Therefore it is an issue exclusive to the curl tool. By default the code generation is enabled at build time and for some reason it's always generated at runtime if it was enabled at build time. So for every URL the tool is generating the source. And to generate the source it goes line by line. The lines are stored in a linked list with no tail pointer, and so for each line it has to traverse the list to get to the end to add another line. This conclusion is just based on my read of the code, I did not go more in depth for proof that's what's eating. It's the type of thing that is not noticeable with a URL or two but a million, yes, it's noticeable.

If this is the case I think the fix would be not to do code generation unless it's requested. Though I wonder if this is one of those things that's a whole lot easier to say than do.

Until this is resolved you could build curl with configure option --disable-libcurl-option and it should perform better.

@bagder
Member
bagder commented Sep 12, 2015

Nice catch @jay, and indeed a silly flaw. We should of course only waste that time and memory if the user actually asked for code to get generated...

@bagder
Member
bagder commented Sep 12, 2015

Oh, and we should probably also consider using a linked list to which we can append strings at no extra cost even when the list gets very long so that this use case still would work fast and nice even when enabled.

@frenche
Contributor
frenche commented Sep 12, 2015

Nice indeed (and thanks for the perf tip)!
I can confirm that after I disable libcurl-option the speed is steady (and faster).

@bagder bagder added the cmdline tool label Sep 13, 2015
@bagder
Member
bagder commented Sep 16, 2015

Anyone volunteering to work on improving this piece of code?

@bagder bagder added a commit that referenced this issue Sep 20, 2015
@gnawhleinad @bagder gnawhleinad + bagder tool: generate easysrc only on --libcurl
Code should only be generated when --libcurl is used.

Bug: #429
Reported-by: @greafhe, Jay Satiro

Closes #429
Closes #442
4d95491
@bagder bagder added a commit that closed this issue Sep 20, 2015
@gnawhleinad @bagder gnawhleinad + bagder tool: generate easysrc only on --libcurl
Code should only be generated when --libcurl is used.

Bug: #429
Reported-by: @greafhe, Jay Satiro

Closes #429
Closes #442
4d95491
@bagder bagder closed this in 4d95491 Sep 20, 2015
@gnawhleinad
Contributor

Oh, and we should probably also consider using a linked list to which we can append strings at no extra cost even when the list gets very long so that this use case still would work fast and nice even when enabled.

We should still pursue this proposal! Should I create a new issue for this?

@bagder
Member
bagder commented Sep 20, 2015

We should still pursue this proposal! Should I create a new issue for this?

Yes, it would be good and yeah a separate issue/pull-request makes it more convenient.

@jay jay added a commit that referenced this issue Sep 21, 2015
@jay jay tool_operate: Don't call easysrc cleanup unless --libcurl
- Review of 4d95491.

The author changed it so easysrc only initializes when --libcurl but did
not do the same for the call to easysrc cleanup.

Ref: #429
3f8d4e2
@gnawhleinad gnawhleinad added a commit to gnawhleinad/curl that referenced this issue Sep 22, 2015
@gnawhleinad gnawhleinad tool: remove redundant libcurl check
The easysrc generation is run only when --libcurl is initialized.

Ref: curl#429
bc5c432
@bagder bagder added a commit that referenced this issue Sep 22, 2015
@gnawhleinad @bagder gnawhleinad + bagder tool: remove redundant libcurl check
The easysrc generation is run only when --libcurl is initialized.

Ref: #429

Closes #448
1467dec
@gnawhleinad gnawhleinad added a commit to gnawhleinad/curl that referenced this issue Sep 25, 2015
@gnawhleinad @gnawhleinad gnawhleinad + gnawhleinad tool: Generate easysrc with last cache linked-list
Using a last cache linked-list improves the performance of easysrc generation.

Bug: curl#444
Ref: curl#429
2717cc8
@bagder bagder added a commit that referenced this issue Oct 17, 2015
@gnawhleinad @bagder gnawhleinad + bagder tool: Generate easysrc with last cache linked-list
Using a last cache linked-list improves the performance of easysrc
generation.

Bug: #444
Ref: #429

Closes #452
19cb0c4
@jgsogo jgsogo added a commit to jgsogo/curl that referenced this issue Oct 19, 2015
@gnawhleinad @jgsogo gnawhleinad + jgsogo tool: generate easysrc only on --libcurl
Code should only be generated when --libcurl is used.

Bug: curl#429
Reported-by: @greafhe, Jay Satiro

Closes #429
Closes #442
4923c9d
@jgsogo jgsogo added a commit to jgsogo/curl that referenced this issue Oct 19, 2015
@jay @jgsogo jay + jgsogo tool_operate: Don't call easysrc cleanup unless --libcurl
- Review of 4d95491.

The author changed it so easysrc only initializes when --libcurl but did
not do the same for the call to easysrc cleanup.

Ref: curl#429
6813c3e
@jgsogo jgsogo added a commit to jgsogo/curl that referenced this issue Oct 19, 2015
@gnawhleinad @jgsogo gnawhleinad + jgsogo tool: remove redundant libcurl check
The easysrc generation is run only when --libcurl is initialized.

Ref: curl#429

Closes #448
f101383
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment