-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change cookie storage to a top-level-domain-specific hash table #2440
Conversation
Thank you, this looks like an awesome improvement as the cookie handling has been lacking in the performance department for a while! If you run
|
lib/cookie.c
Outdated
/* | ||
* Return the top-level domain, for optimal hashing. | ||
*/ | ||
static const char *get_top_domain(const char * const in, unsigned *outlen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate a comment explaining what 'in' points to here!
lib/cookie.c
Outdated
/* | ||
* Hash this domain. | ||
*/ | ||
static size_t cookiehash(const char * const in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... this also uses 'in' without much explanation. Use a better name or add a comment saying what it is?
lib/cookie.c
Outdated
@@ -1304,9 +1363,12 @@ void Curl_cookie_clearsess(struct CookieInfo *cookies) | |||
****************************************************************************/ | |||
void Curl_cookie_cleanup(struct CookieInfo *c) | |||
{ | |||
unsigned i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please spell out unsigned int
fully if that's what you want!
lib/cookie.c
Outdated
@@ -1355,6 +1417,7 @@ static int cookie_output(struct CookieInfo *c, const char *dumphere) | |||
FILE *out; | |||
bool use_stdout = FALSE; | |||
char *format_ptr; | |||
unsigned i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
lib/cookie.c
Outdated
@@ -1406,26 +1471,29 @@ static struct curl_slist *cookie_list(struct Curl_easy *data) | |||
struct curl_slist *beg; | |||
struct Cookie *c; | |||
char *line; | |||
unsigned i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
struct CookieInfo { | ||
/* linked list of cookies we know of */ | ||
struct Cookie *cookies; | ||
struct Cookie *cookies[COOKIE_HASH_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of linked lists. Remember that most users of libcurl still only use very few cookies, most have much fewer than 256 in total. Can you tell us something on why you want 256 here and perhaps some measurements with smaller numbers?
256 pointers is 2kb RAM on 64-bit and 1kb on 32-bit, so the overhead shouldn't harm even the lowest embedded targets curl is run on. 256 was selected being a power-of-two (for the mod), and for dividing the 8k cookies so each bucket is suitably little filled. As for 8k cookies being the target, that's about the number my browsing has plateaued at over the years - so normal browsing tends to result in that number, it doesn't increase much with additional use. For my cookie set and google.com, it looks like each bucket accessed had little other domains, meaning higher numbers wouldn't help much. However, halving to 128 or even lower would on average double the amount in each bucket, and lower performance accordingly. So 256 seems the optimal number, and has little downsides. |
Thanks! |
if(first == last) | ||
return domain; | ||
|
||
first = memrchr(domain, '.', (size_t)(last - domain - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -1 ? This will miss character before the last dot, causing blabla..org
to set first = NULL
|
||
first = memrchr(domain, '.', (size_t)(last - domain - 1)); | ||
if(outlen) | ||
*outlen = len - (size_t)(first - domain) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark. *outlen will not include last character.
return 0; | ||
|
||
top = get_top_domain(domain, &len); | ||
return Curl_hash_str((void *) top, len, COOKIE_HASH_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should compute hash on lowercase domain to avoid missing a match in later strcasecompare().
Domain names with two consequtive dots are invalid. Lowecasing would indeed be necessary. |
I agree, but an invalid domain name can be passed in anyway. See segfault in https://travis-ci.org/curl/curl/jobs/361105163. |
Is this the right place to catch it? I'd think invalid domains would be caught before adding them to the cookie list, or querying the cookie list.
Not including -1 here would cause wrong results. "foo.bar.com" -> "bar.com" length 8 |
Not to catch, but tolerate.
My bad: OK for this one: you need to strip the initial dot. I still do not agree for the other. |
I've been able to reproduce the segfault manually: |
Since this is on a fast path, I think putting the check in the if would
be slightly faster; it's not a off-by-one per se, since valid domain
parts are required to be at least one char long, it's a small
optimization not to search the first char.
Can you test with this with the fuzzer? The if changed from
if(first == last)
to
if(first == last || first == last - 1)
|
No I can't: this is done by CI when submitting a pull request and I presume this is fed with random data. I don't know if it can be done manually. However I have checked your fix locally as described above and it works. That said, if you're chasing performance, this is only a matter of a few cycles and I'm not convinced this is more efficient than having a extra byte in memrchr and suppress a 1 subtraction. We could even do better by saving the memchr call:
|
@cmeister2 are we fuzzing cookies at present |
@monnerat, how about turning that into a "real" test case or adding that to an existing cookie test? |
@monnerat
I like your rewrite, can you send it yourself so it's attributed
properly? I'll send the lowercasing later when I have time.
|
@jay Generally yes: https://github.com/curl/curl-fuzzer/blob/master/curl_fuzzer.cc#L192 ensures that cookies that are set in HTTP testing are stored/parsed/etc. The areas which are not tested by this are mostly related to reading cookies from file, which is not yet fuzzed. |
This fixes a segfault occurring when a name of the (invalid) form "domain..tld" is processed. test46 updated to cover this case. Follow-up to commit c990ead. Ref: #2440
This series improves curl's performance greatly when there are a lot of cookies. All tests pass.
Since a big part changes whitespace only, IIRC you can add ?w=1 to github urls to ignore whitespace.
The testcase was about 4k cookies, and loading google.com in WebkitFLTK's test browser.
Before:
After
I also checked the Curl_hash_str function's performance using a set of 8k real cookies. It performed quite well, spreading the domains around very evenly.