Change cookie storage to a top-level-domain-specific hash table #2440

clbr · 2018-03-30T15:03:50Z

This series improves curl's performance greatly when there are a lot of cookies. All tests pass.
Since a big part changes whitespace only, IIRC you can add ?w=1 to github urls to ignore whitespace.

The testcase was about 4k cookies, and loading google.com in WebkitFLTK's test browser.

Before:

Curl_cookie_init for cookies.dat took 571 ms
Curl_cookie_getlist for /... took 1019 us, 3953 tested
Curl_cookie_getlist for /... took 846 us, 3954 tested
Curl_cookie_getlist for /?gws_rd... took 899 us, 3956 tested
Curl_cookie_getlist for /images/... took 281 us, 3956 tested
Curl_cookie_getlist for /status?... took 284 us, 3956 tested
Curl_cookie_getlist for /gb/imag... took 431 us, 3956 tested
Curl_cookie_getlist for /images/... took 446 us, 3956 tested
Curl_cookie_getlist for /xjs/_/j... took 458 us, 3956 tested
Curl_cookie_getlist for /gen_204... took 1131 us, 3956 tested
Curl_cookie_getlist for /og/_/js... took 547 us, 3956 tested
Curl_cookie_getlist for /xjs/_/j... took 543 us, 3956 tested
Curl_cookie_getlist for /inputto... took 350 us, 3957 tested
Curl_cookie_getlist for /gen_204... took 414 us, 3957 tested
Curl_cookie_getlist for /og/_/ss... took 574 us, 3957 tested
Curl_cookie_getlist for /og/_/js... took 320 us, 3957 tested
Curl_cookie_getlist for /xjs/_/j... took 344 us, 3957 tested
Curl_cookie_getlist for /_/scs/a... took 925 us, 3957 tested
Curl_cookie_getlist for /images/... took 542 us, 3957 tested
Curl_cookie_getlist for /gen_204... took 730 us, 3957 tested
Curl_cookie_getlist for /gen_204... took 455 us, 3957 tested
Curl_cookie_getlist for /adsid/g... took 844 us, 3957 tested
Curl_cookie_getlist for /domainl... took 443 us, 3957 tested
Curl_cookie_getlist for /domainl... took 273 us, 3958 tested

After

url_cookie_init for cookies.dat took 176 ms
Curl_cookie_getlist for /... took 308 us, 10 tested
Curl_cookie_getlist for /... took 309 us, 11 tested
Curl_cookie_getlist for /?gws_rd... took 280 us, 13 tested
Curl_cookie_getlist for /images/... took 103 us, 13 tested
Curl_cookie_getlist for /status?... took 342 us, 13 tested
Curl_cookie_getlist for /gb/imag... took 89 us, 6 tested
Curl_cookie_getlist for /images/... took 82 us, 13 tested
Curl_cookie_getlist for /xjs/_/j... took 81 us, 13 tested
Curl_cookie_getlist for /gen_204... took 385 us, 13 tested
Curl_cookie_getlist for /og/_/js... took 93 us, 6 tested
Curl_cookie_getlist for /xjs/_/j... took 301 us, 14 tested
Curl_cookie_getlist for /inputto... took 78 us, 6 tested
Curl_cookie_getlist for /_/scs/a... took 297 us, 14 tested
Curl_cookie_getlist for /gen_204... took 186 us, 14 tested
Curl_cookie_getlist for /og/_/ss... took 266 us, 6 tested
Curl_cookie_getlist for /xjs/_/j... took 87 us, 14 tested
Curl_cookie_getlist for /og/_/js... took 452 us, 6 tested
Curl_cookie_getlist for /images/... took 282 us, 14 tested
Curl_cookie_getlist for /gen_204... took 91 us, 14 tested
Curl_cookie_getlist for /gen_204... took 81 us, 14 tested
Curl_cookie_getlist for /domainl... took 91 us, 7 tested
Curl_cookie_getlist for /domainl... took 98 us, 14 tested
Curl_cookie_getlist for /adsid/g... took 99 us, 14 tested

I also checked the Curl_hash_str function's performance using a set of 8k real cookies. It performed quite well, spreading the domains around very evenly.

42, 30, 23, 75, 38, 17, 16, 47, 43, 16, 31, 26, 21, 29, 23, 24, 19, 63, 21, 39, 31, 36, 43, 27, 15, 19, 41, 49, 40, 26, 19, 49, 32, 17, 32, 22, 13, 23, 20, 21, 29, 20, 19, 15, 26, 45, 54, 34, 52, 19, 46, 28, 47, 30, 16, 38, 31, 21, 18, 35, 7, 30, 24, 36, 54, 23, 29, 47, 29, 49, 26, 28, 19, 24, 20, 16, 13, 21, 27, 20, 58, 37, 10, 30, 35, 25, 22, 77, 27, 15, 24, 18, 22, 20, 64, 7, 26, 58, 39, 15, 16, 21, 91, 36, 24, 20, 20, 25, 34, 11, 36, 28, 42, 19, 20, 27, 9, 47, 23, 48, 18, 12, 33, 30, 31, 30, 25, 20, 18, 40, 99, 21, 20, 45, 27, 12, 18, 37, 27, 22, 27, 24, 22, 18, 24, 3, 23, 24, 69, 25, 20, 43, 22, 15, 33, 44, 25, 52, 29, 39, 63, 37, 16, 29, 23, 22, 69, 32, 41, 20, 38, 31, 22, 24, 18, 19, 61, 34, 31, 22, 42, 28, 16, 31, 33, 12, 31, 30, 79, 31, 60, 20, 74, 18, 29, 38, 30, 30, 27, 32, 26, 65, 40, 9, 28, 29, 30, 19, 17, 23, 32, 21, 65, 21, 17, 42, 31, 125, 11, 28, 73, 22, 73, 20, 17, 21, 75, 17, 17, 11, 70, 29, 32, 31, 74, 36, 35, 18, 43, 49, 38, 71, 21, 59, 22, 52, 26, 5, 25, 22, 47, 37, 25, 27, 16, 64,
low high avg 3 125 31

…implementation detail)

bagder · 2018-03-30T18:38:40Z

Thank you, this looks like an awesome improvement as the cookie handling has been lacking in the performance department for a while!

If you run ./configure --enable-debug you'll get pickier compiler options and then you'll see the errors the CI found:

cookie.c: In function ‘get_top_domain’:
cookie.c:254:15: error: conversion to ‘unsigned int’ from ‘size_t’ may alter its value [-Werror=conversion]
   len = strlen(in);
               ^
cookie.c:266:34: error: conversion to ‘unsigned int’ from ‘long int’ may alter its value [-Werror=conversion]
     *outlen = len - (first - in) - 1;
                                  ^
cc1: all warnings being treated as errors

bagder · 2018-03-31T19:42:38Z

lib/cookie.c

+/*
+ * Return the top-level domain, for optimal hashing.
+ */
+static const char *get_top_domain(const char * const in, unsigned *outlen)


I'd appreciate a comment explaining what 'in' points to here!

bagder · 2018-03-31T19:43:12Z

lib/cookie.c

+/*
+ * Hash this domain.
+ */
+static size_t cookiehash(const char * const in)


... this also uses 'in' without much explanation. Use a better name or add a comment saying what it is?

bagder · 2018-03-31T19:44:27Z

lib/cookie.c

@@ -1304,9 +1363,12 @@ void Curl_cookie_clearsess(struct CookieInfo *cookies)
 ****************************************************************************/
 void Curl_cookie_cleanup(struct CookieInfo *c)
 {
+  unsigned i;


please spell out unsigned int fully if that's what you want!

bagder · 2018-03-31T19:44:40Z

lib/cookie.c

@@ -1355,6 +1417,7 @@ static int cookie_output(struct CookieInfo *c, const char *dumphere)
  FILE *out;
  bool use_stdout = FALSE;
  char *format_ptr;
+  unsigned i;


bagder · 2018-03-31T19:45:10Z

lib/cookie.c

@@ -1406,26 +1471,29 @@ static struct curl_slist *cookie_list(struct Curl_easy *data)
  struct curl_slist *beg;
  struct Cookie *c;
  char *line;
+  unsigned i;


bagder · 2018-03-31T19:51:25Z

lib/cookie.h

 struct CookieInfo {
  /* linked list of cookies we know of */
-  struct Cookie *cookies;
+  struct Cookie *cookies[COOKIE_HASH_SIZE];


This is a lot of linked lists. Remember that most users of libcurl still only use very few cookies, most have much fewer than 256 in total. Can you tell us something on why you want 256 here and perhaps some measurements with smaller numbers?

clbr · 2018-04-01T07:49:04Z

256 pointers is 2kb RAM on 64-bit and 1kb on 32-bit, so the overhead shouldn't harm even the lowest embedded targets curl is run on. 256 was selected being a power-of-two (for the mod), and for dividing the 8k cookies so each bucket is suitably little filled. As for 8k cookies being the target, that's about the number my browsing has plateaued at over the years - so normal browsing tends to result in that number, it doesn't increase much with additional use.

For my cookie set and google.com, it looks like each bucket accessed had little other domains, meaning higher numbers wouldn't help much. However, halving to 128 or even lower would on average double the amount in each bucket, and lower performance accordingly. So 256 seems the optimal number, and has little downsides.

bagder · 2018-04-02T08:51:55Z

Thanks!

monnerat · 2018-04-03T12:26:47Z

lib/cookie.c

+  if(first == last)
+    return domain;
+
+  first = memrchr(domain, '.', (size_t)(last - domain - 1));


Why -1 ? This will miss character before the last dot, causing blabla..org to set first = NULL

monnerat · 2018-04-03T12:27:52Z

lib/cookie.c

+
+  first = memrchr(domain, '.', (size_t)(last - domain - 1));
+  if(outlen)
+    *outlen = len - (size_t)(first - domain) - 1;


Same remark. *outlen will not include last character.

monnerat · 2018-04-03T12:39:30Z

lib/cookie.c

+    return 0;
+
+  top = get_top_domain(domain, &len);
+  return Curl_hash_str((void *) top, len, COOKIE_HASH_SIZE);


Should compute hash on lowercase domain to avoid missing a match in later strcasecompare().

clbr · 2018-04-03T12:54:33Z

Domain names with two consequtive dots are invalid. Lowecasing would indeed be necessary.

monnerat · 2018-04-03T12:57:25Z

Domain names with two consequtive dots are invalid.

I agree, but an invalid domain name can be passed in anyway. See segfault in https://travis-ci.org/curl/curl/jobs/361105163.

clbr · 2018-04-03T13:06:30Z

Is this the right place to catch it? I'd think invalid domains would be caught before adding them to the cookie list, or querying the cookie list.

Same remark. *outlen will not include last character.

Not including -1 here would cause wrong results. "foo.bar.com" -> "bar.com" length 8

monnerat · 2018-04-03T13:13:10Z

Is this the right place to catch it?

Not to catch, but tolerate.

Not including -1 here would cause wrong results. "foo.bar.com" -> "bar.com" length 8

My bad: OK for this one: you need to strip the initial dot. I still do not agree for the other.

monnerat · 2018-04-03T13:35:16Z

I've been able to reproduce the segfault manually:
Create an injar file with line
domain..tld / FALSE 1739150993 mooo indeed
Then issue the command
curl --verbose --resolve domain..tld:80:<http-server-ip> -b injar http://domain..tld/
Where <http-server-ip> is the ip address of any responding http server.

clbr · 2018-04-03T15:02:47Z

Since this is on a fast path, I think putting the check in the if would be slightly faster; it's not a off-by-one per se, since valid domain parts are required to be at least one char long, it's a small optimization not to search the first char. Can you test with this with the fuzzer? The if changed from if(first == last) to if(first == last || first == last - 1)

monnerat · 2018-04-04T00:20:22Z

Can you test with this with the fuzzer?

No I can't: this is done by CI when submitting a pull request and I presume this is fed with random data. I don't know if it can be done manually.

However I have checked your fix locally as described above and it works.

That said, if you're chasing performance, this is only a matter of a few cycles and I'm not convinced this is more efficient than having a extra byte in memrchr and suppress a 1 subtraction.

We could even do better by saving the memchr call:

static const char *get_top_domain(const char * const domain, size_t *outlen)
{
  size_t len;
  const char *first = NULL, *last;

  if(!domain)
    return NULL;

  len = strlen(domain);
  last = memrchr(domain, '.', len);
  if(last) {
    first = memrchr(domain, '.', (size_t) (last - domain));
    if(first)
      len -= (size_t) (++first - domain);
  }

  if(outlen)
    *outlen = len;

  return first? first: domain;
}

jay · 2018-04-04T02:45:12Z

Can you test with this with the fuzzer?

No I can't: this is done by CI when submitting a pull request and I presume this is fed with random data. I don't know if it can be done manually.

@cmeister2 are we fuzzing cookies at present

bagder · 2018-04-04T07:05:30Z

I've been able to reproduce the segfault manually:

@monnerat, how about turning that into a "real" test case or adding that to an existing cookie test?

clbr · 2018-04-04T07:26:11Z

@monnerat I like your rewrite, can you send it yourself so it's attributed properly? I'll send the lowercasing later when I have time.

cmeister2 · 2018-04-04T07:50:53Z

@jay Generally yes: https://github.com/curl/curl-fuzzer/blob/master/curl_fuzzer.cc#L192 ensures that cookies that are set in HTTP testing are stored/parsed/etc.

The areas which are not tested by this are mostly related to reading cookies from file, which is not yet fuzzed.

This fixes a segfault occurring when a name of the (invalid) form "domain..tld" is processed. test46 updated to cover this case. Follow-up to commit c990ead. Ref: #2440

monnerat · 2018-04-04T13:42:06Z

@clbr : segfault fix pushed 82dfdac
@bagder : test46 now also checks this segfault case.

clbr added 3 commits March 30, 2018 17:42

Move isip higher in the file

2010edd

Change cookie storage to a top-level-domain-specific hash table

b8d5036

Update test 46 expected jar, it depended on specific printout order (…

ff9aee6

…implementation detail)

clbr mentioned this pull request Mar 30, 2018

When reading cookies from a file, only run remove_expired once #2441

Closed

Squash some -Wconversion warnings

0c10491

bagder reviewed Mar 31, 2018

View reviewed changes

clbr added 3 commits April 1, 2018 10:53

Rename get_top_domain's param

3315a08

Rename cookiehash's param

73985b2

Spell out unsigned int

fef35ae

bagder closed this in c990ead Apr 2, 2018

monnerat reviewed Apr 3, 2018

View reviewed changes

clbr mentioned this pull request Apr 5, 2018

Use case-insensitive hashing for the cookie domains #2458

Closed

lock bot locked as resolved and limited conversation to collaborators Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change cookie storage to a top-level-domain-specific hash table #2440

Change cookie storage to a top-level-domain-specific hash table #2440

clbr commented Mar 30, 2018

bagder commented Mar 30, 2018

bagder Mar 31, 2018

bagder Mar 31, 2018

bagder Mar 31, 2018

bagder Mar 31, 2018

bagder Mar 31, 2018

bagder Mar 31, 2018

clbr commented Apr 1, 2018

bagder commented Apr 2, 2018

monnerat Apr 3, 2018

monnerat Apr 3, 2018

monnerat Apr 3, 2018

clbr commented Apr 3, 2018

monnerat commented Apr 3, 2018

clbr commented Apr 3, 2018

monnerat commented Apr 3, 2018

monnerat commented Apr 3, 2018

clbr commented Apr 3, 2018 via email

monnerat commented Apr 4, 2018

jay commented Apr 4, 2018

bagder commented Apr 4, 2018

clbr commented Apr 4, 2018 via email

cmeister2 commented Apr 4, 2018

monnerat commented Apr 4, 2018

Change cookie storage to a top-level-domain-specific hash table #2440

Change cookie storage to a top-level-domain-specific hash table #2440

Conversation

clbr commented Mar 30, 2018

bagder commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clbr commented Apr 1, 2018

bagder commented Apr 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clbr commented Apr 3, 2018

monnerat commented Apr 3, 2018

clbr commented Apr 3, 2018

monnerat commented Apr 3, 2018

monnerat commented Apr 3, 2018

clbr commented Apr 3, 2018 via email

monnerat commented Apr 4, 2018

jay commented Apr 4, 2018

bagder commented Apr 4, 2018

clbr commented Apr 4, 2018 via email

cmeister2 commented Apr 4, 2018

monnerat commented Apr 4, 2018