Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change cookie storage to a top-level-domain-specific hash table #2440

Closed
wants to merge 7 commits into from

Conversation

clbr
Copy link
Contributor

@clbr clbr commented Mar 30, 2018

This series improves curl's performance greatly when there are a lot of cookies. All tests pass.
Since a big part changes whitespace only, IIRC you can add ?w=1 to github urls to ignore whitespace.

The testcase was about 4k cookies, and loading google.com in WebkitFLTK's test browser.

Before:

Curl_cookie_init for cookies.dat took 571 ms
Curl_cookie_getlist for /... took 1019 us, 3953 tested
Curl_cookie_getlist for /... took 846 us, 3954 tested
Curl_cookie_getlist for /?gws_rd... took 899 us, 3956 tested
Curl_cookie_getlist for /images/... took 281 us, 3956 tested
Curl_cookie_getlist for /status?... took 284 us, 3956 tested
Curl_cookie_getlist for /gb/imag... took 431 us, 3956 tested
Curl_cookie_getlist for /images/... took 446 us, 3956 tested
Curl_cookie_getlist for /xjs/_/j... took 458 us, 3956 tested
Curl_cookie_getlist for /gen_204... took 1131 us, 3956 tested
Curl_cookie_getlist for /og/_/js... took 547 us, 3956 tested
Curl_cookie_getlist for /xjs/_/j... took 543 us, 3956 tested
Curl_cookie_getlist for /inputto... took 350 us, 3957 tested
Curl_cookie_getlist for /gen_204... took 414 us, 3957 tested
Curl_cookie_getlist for /og/_/ss... took 574 us, 3957 tested
Curl_cookie_getlist for /og/_/js... took 320 us, 3957 tested
Curl_cookie_getlist for /xjs/_/j... took 344 us, 3957 tested
Curl_cookie_getlist for /_/scs/a... took 925 us, 3957 tested
Curl_cookie_getlist for /images/... took 542 us, 3957 tested
Curl_cookie_getlist for /gen_204... took 730 us, 3957 tested
Curl_cookie_getlist for /gen_204... took 455 us, 3957 tested
Curl_cookie_getlist for /adsid/g... took 844 us, 3957 tested
Curl_cookie_getlist for /domainl... took 443 us, 3957 tested
Curl_cookie_getlist for /domainl... took 273 us, 3958 tested

After

url_cookie_init for cookies.dat took 176 ms
Curl_cookie_getlist for /... took 308 us, 10 tested
Curl_cookie_getlist for /... took 309 us, 11 tested
Curl_cookie_getlist for /?gws_rd... took 280 us, 13 tested
Curl_cookie_getlist for /images/... took 103 us, 13 tested
Curl_cookie_getlist for /status?... took 342 us, 13 tested
Curl_cookie_getlist for /gb/imag... took 89 us, 6 tested
Curl_cookie_getlist for /images/... took 82 us, 13 tested
Curl_cookie_getlist for /xjs/_/j... took 81 us, 13 tested
Curl_cookie_getlist for /gen_204... took 385 us, 13 tested
Curl_cookie_getlist for /og/_/js... took 93 us, 6 tested
Curl_cookie_getlist for /xjs/_/j... took 301 us, 14 tested
Curl_cookie_getlist for /inputto... took 78 us, 6 tested
Curl_cookie_getlist for /_/scs/a... took 297 us, 14 tested
Curl_cookie_getlist for /gen_204... took 186 us, 14 tested
Curl_cookie_getlist for /og/_/ss... took 266 us, 6 tested
Curl_cookie_getlist for /xjs/_/j... took 87 us, 14 tested
Curl_cookie_getlist for /og/_/js... took 452 us, 6 tested
Curl_cookie_getlist for /images/... took 282 us, 14 tested
Curl_cookie_getlist for /gen_204... took 91 us, 14 tested
Curl_cookie_getlist for /gen_204... took 81 us, 14 tested
Curl_cookie_getlist for /domainl... took 91 us, 7 tested
Curl_cookie_getlist for /domainl... took 98 us, 14 tested
Curl_cookie_getlist for /adsid/g... took 99 us, 14 tested

I also checked the Curl_hash_str function's performance using a set of 8k real cookies. It performed quite well, spreading the domains around very evenly.

42, 30, 23, 75, 38, 17, 16, 47, 43, 16, 31, 26, 21, 29, 23, 24, 19, 63, 21, 39, 31, 36, 43, 27, 15, 19, 41, 49, 40, 26, 19, 49, 32, 17, 32, 22, 13, 23, 20, 21, 29, 20, 19, 15, 26, 45, 54, 34, 52, 19, 46, 28, 47, 30, 16, 38, 31, 21, 18, 35, 7, 30, 24, 36, 54, 23, 29, 47, 29, 49, 26, 28, 19, 24, 20, 16, 13, 21, 27, 20, 58, 37, 10, 30, 35, 25, 22, 77, 27, 15, 24, 18, 22, 20, 64, 7, 26, 58, 39, 15, 16, 21, 91, 36, 24, 20, 20, 25, 34, 11, 36, 28, 42, 19, 20, 27, 9, 47, 23, 48, 18, 12, 33, 30, 31, 30, 25, 20, 18, 40, 99, 21, 20, 45, 27, 12, 18, 37, 27, 22, 27, 24, 22, 18, 24, 3, 23, 24, 69, 25, 20, 43, 22, 15, 33, 44, 25, 52, 29, 39, 63, 37, 16, 29, 23, 22, 69, 32, 41, 20, 38, 31, 22, 24, 18, 19, 61, 34, 31, 22, 42, 28, 16, 31, 33, 12, 31, 30, 79, 31, 60, 20, 74, 18, 29, 38, 30, 30, 27, 32, 26, 65, 40, 9, 28, 29, 30, 19, 17, 23, 32, 21, 65, 21, 17, 42, 31, 125, 11, 28, 73, 22, 73, 20, 17, 21, 75, 17, 17, 11, 70, 29, 32, 31, 74, 36, 35, 18, 43, 49, 38, 71, 21, 59, 22, 52, 26, 5, 25, 22, 47, 37, 25, 27, 16, 64,
low high avg 3 125 31

@bagder
Copy link
Member

bagder commented Mar 30, 2018

Thank you, this looks like an awesome improvement as the cookie handling has been lacking in the performance department for a while!

If you run ./configure --enable-debug you'll get pickier compiler options and then you'll see the errors the CI found:

cookie.c: In function ‘get_top_domain’:
cookie.c:254:15: error: conversion to ‘unsigned int’ from ‘size_t’ may alter its value [-Werror=conversion]
   len = strlen(in);
               ^
cookie.c:266:34: error: conversion to ‘unsigned int’ from ‘long int’ may alter its value [-Werror=conversion]
     *outlen = len - (first - in) - 1;
                                  ^
cc1: all warnings being treated as errors

lib/cookie.c Outdated
/*
* Return the top-level domain, for optimal hashing.
*/
static const char *get_top_domain(const char * const in, unsigned *outlen)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate a comment explaining what 'in' points to here!

lib/cookie.c Outdated
/*
* Hash this domain.
*/
static size_t cookiehash(const char * const in)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... this also uses 'in' without much explanation. Use a better name or add a comment saying what it is?

lib/cookie.c Outdated
@@ -1304,9 +1363,12 @@ void Curl_cookie_clearsess(struct CookieInfo *cookies)
****************************************************************************/
void Curl_cookie_cleanup(struct CookieInfo *c)
{
unsigned i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please spell out unsigned int fully if that's what you want!

lib/cookie.c Outdated
@@ -1355,6 +1417,7 @@ static int cookie_output(struct CookieInfo *c, const char *dumphere)
FILE *out;
bool use_stdout = FALSE;
char *format_ptr;
unsigned i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

lib/cookie.c Outdated
@@ -1406,26 +1471,29 @@ static struct curl_slist *cookie_list(struct Curl_easy *data)
struct curl_slist *beg;
struct Cookie *c;
char *line;
unsigned i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

struct CookieInfo {
/* linked list of cookies we know of */
struct Cookie *cookies;
struct Cookie *cookies[COOKIE_HASH_SIZE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of linked lists. Remember that most users of libcurl still only use very few cookies, most have much fewer than 256 in total. Can you tell us something on why you want 256 here and perhaps some measurements with smaller numbers?

@clbr
Copy link
Contributor Author

clbr commented Apr 1, 2018

256 pointers is 2kb RAM on 64-bit and 1kb on 32-bit, so the overhead shouldn't harm even the lowest embedded targets curl is run on. 256 was selected being a power-of-two (for the mod), and for dividing the 8k cookies so each bucket is suitably little filled. As for 8k cookies being the target, that's about the number my browsing has plateaued at over the years - so normal browsing tends to result in that number, it doesn't increase much with additional use.

For my cookie set and google.com, it looks like each bucket accessed had little other domains, meaning higher numbers wouldn't help much. However, halving to 128 or even lower would on average double the amount in each bucket, and lower performance accordingly. So 256 seems the optimal number, and has little downsides.

@bagder
Copy link
Member

bagder commented Apr 2, 2018

Thanks!

@bagder bagder closed this in c990ead Apr 2, 2018
if(first == last)
return domain;

first = memrchr(domain, '.', (size_t)(last - domain - 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -1 ? This will miss character before the last dot, causing blabla..org to set first = NULL


first = memrchr(domain, '.', (size_t)(last - domain - 1));
if(outlen)
*outlen = len - (size_t)(first - domain) - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark. *outlen will not include last character.

return 0;

top = get_top_domain(domain, &len);
return Curl_hash_str((void *) top, len, COOKIE_HASH_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should compute hash on lowercase domain to avoid missing a match in later strcasecompare().

@clbr
Copy link
Contributor Author

clbr commented Apr 3, 2018

Domain names with two consequtive dots are invalid. Lowecasing would indeed be necessary.

@monnerat
Copy link
Contributor

monnerat commented Apr 3, 2018

Domain names with two consequtive dots are invalid.

I agree, but an invalid domain name can be passed in anyway. See segfault in https://travis-ci.org/curl/curl/jobs/361105163.

@clbr
Copy link
Contributor Author

clbr commented Apr 3, 2018

Is this the right place to catch it? I'd think invalid domains would be caught before adding them to the cookie list, or querying the cookie list.

Same remark. *outlen will not include last character.

Not including -1 here would cause wrong results. "foo.bar.com" -> "bar.com" length 8

@monnerat
Copy link
Contributor

monnerat commented Apr 3, 2018

Is this the right place to catch it?

Not to catch, but tolerate.

Not including -1 here would cause wrong results. "foo.bar.com" -> "bar.com" length 8

My bad: OK for this one: you need to strip the initial dot. I still do not agree for the other.

@monnerat
Copy link
Contributor

monnerat commented Apr 3, 2018

I've been able to reproduce the segfault manually:
Create an injar file with line
domain..tld / FALSE 1739150993 mooo indeed
Then issue the command
curl --verbose --resolve domain..tld:80:<http-server-ip> -b injar http://domain..tld/
Where <http-server-ip> is the ip address of any responding http server.

@clbr
Copy link
Contributor Author

clbr commented Apr 3, 2018 via email

@monnerat
Copy link
Contributor

monnerat commented Apr 4, 2018

Can you test with this with the fuzzer?

No I can't: this is done by CI when submitting a pull request and I presume this is fed with random data. I don't know if it can be done manually.

However I have checked your fix locally as described above and it works.

That said, if you're chasing performance, this is only a matter of a few cycles and I'm not convinced this is more efficient than having a extra byte in memrchr and suppress a 1 subtraction.

We could even do better by saving the memchr call:

static const char *get_top_domain(const char * const domain, size_t *outlen)
{
  size_t len;
  const char *first = NULL, *last;

  if(!domain)
    return NULL;

  len = strlen(domain);
  last = memrchr(domain, '.', len);
  if(last) {
    first = memrchr(domain, '.', (size_t) (last - domain));
    if(first)
      len -= (size_t) (++first - domain);
  }

  if(outlen)
    *outlen = len;

  return first? first: domain;
}

@jay
Copy link
Member

jay commented Apr 4, 2018

Can you test with this with the fuzzer?

No I can't: this is done by CI when submitting a pull request and I presume this is fed with random data. I don't know if it can be done manually.

@cmeister2 are we fuzzing cookies at present

@bagder
Copy link
Member

bagder commented Apr 4, 2018

I've been able to reproduce the segfault manually:

@monnerat, how about turning that into a "real" test case or adding that to an existing cookie test?

@clbr
Copy link
Contributor Author

clbr commented Apr 4, 2018 via email

@cmeister2
Copy link
Contributor

@jay Generally yes: https://github.com/curl/curl-fuzzer/blob/master/curl_fuzzer.cc#L192 ensures that cookies that are set in HTTP testing are stored/parsed/etc.

The areas which are not tested by this are mostly related to reading cookies from file, which is not yet fuzzed.

monnerat added a commit that referenced this pull request Apr 4, 2018
This fixes a segfault occurring when a name of the (invalid) form "domain..tld"
is processed.

test46 updated to cover this case.

Follow-up to commit c990ead.

Ref: #2440
@monnerat
Copy link
Contributor

monnerat commented Apr 4, 2018

@clbr : segfault fix pushed 82dfdac
@bagder : test46 now also checks this segfault case.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants