[WIP] Avoid using free'd easy handles referenced from connection cache #2669

Closed
wants to merge 4 commits into
from

Conversation

Projects
None yet
2 participants
@dtzWill
Contributor

dtzWill commented Jun 18, 2018

For a while now I've been trying to sort out a problem with Nix's usage of libcurl,
one that is especially apparent when building with address sanitizer enabled
or building against musl.


The commits in this PR appear to "resolve" the issue but I don't believe are really the right answer,
as they only mask the real problem: free'd handles end up being referred to from the connection cache.


The usage is this:

  • we create a multi handle that lives more or less forever
  • single thread handles everything curl-related, processes
    a queue of requests populated by other threads and for each
    creates an easy handle and adds to multi.
  • A mapping of easy handle to our own datastructure is maintained, AFAIK this works
    (similar to the ASIO example or one of those, as I recall)
  • Loop is basically this:
    • curl_multi_perform
    • Read all available messages from curl_multi_info_read, freeing the handles as we go
    • curl_multi_wait (we add an extra FD that's a wake-up pipe when new requests are added)
    • Create handles for any new requests and add to the multi handle
    • (minor: check if thread has been asked to quit, otherwise keep looping)

We make many many requests that use http2/openssl and enable multiplexing (but not http1 pipelining).

If it matters we also enable PIPEWAIT and MAXCONNECTS is set to 25.


Eventually, and almost immediately with ASAN or quickly with musl, it appears the memory for a deallocated handle
is used for something else and everything goes south when curl attempts to "check if the connection is dead"
which involves -- for http2+ssl at least-- using data-structures and buffers that are no longer present.

I guess glibc is either less quick to recycle free'd memory or its use of thread-local arenas mostly
mitigates the problem since the thread creating/free'ing curl handles really does nothing else and so
is less likely to run into problems.

These commits workaround this by checking if the magic bit is unset, but that of course only works sometimes
and still requires accessing free'd memory and hoping for the best.


Here's an example report from address sanitizer: https://gist.github.com/dtzWill/4b02c8943ec3be4875ab54186e2cc176


I see that in general it's known to be problematic to remove handles possibly involved in pipelines,
but I thought I'd double-check that this particular problem is known/expected and to perhaps ask
if there's a recommendation for avoiding these problems.

As far as I can tell the easy handles added to a multi handle are expected to outlive the multi-handle
(although this is not mentioned elsewhere, so I'd appreciate confirmation) since they may be referenced
from the multi handle's connection cache indefinitely and there is no way to clear or query this cache.

I tried using a pool of ~100 connection handles --and basically never free them (curl_easy_cleanup) until we're entirely done.
This seems to mitigate the issue but I don't know that this can be relied upon as it may only appear to work
by making it less problematic when multi handle attempts to deference easy handles that have been removed
(since the pool ensures this memory will likely be unused for at least a while) but in the worst case
might break if the cache refers to a handle long enough for us to cycle through the pool, in which case
the connection check might do something silly like attempt to read into the buffer of a new handle
that may or may not really be part of the cached connection referencing it.

Anyway-- I suppose this is a bug report of sorts, but also a request for insights into what we're doing wrong
or how to restructure things to use the libcurl API in a more expected/supported manner.

Thank you for your time!


FWIW this issue seems particularly problematic in 7.60 (and still in latest git), it does not happen in 7.59 or the handful of previous versions I tested.

dtzWill added some commits Jun 8, 2018

connisalive: detect and avoid using invalid handle pointers
These checks are redundant on the http2 path,
but might be better to check here to be more "general".
don't try to recv if data handle is null or deallocated
Ideally such a pointer wouldn't still be present,
but seems to happen in some circumstances.

This papers over the issue--if a new handle (or other object!)
is now using this memory Bad Things (tm) will happen.
@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jun 18, 2018

Member

connections are kept in the connection cache, not easy handles. Once you remove an easy handle from the multi handle it can be killed right away. That's a common pattern and one that we've always said works. When the multi interface is used, re-using easy handles is mostly just avoiding free() / malloc() since the connections, the DNS cache etc are then all owned by the multi handle anyway.

in general it's known to be problematic to remove handles possibly involved in pipelines

Yes, if that handle uses a live connection involved in a HTTP/1.1 pipeline - not when HTTP/2 multiplexing is used.

I would urge you to try to reproduce the problem with a stand-alone example that mimics how your "real" application works but with as much as possible of the application logic removed and the libcurl logic left that causes the problems.

given your log output, it looks like a struct connectdata or parts of that ends up freed while apparently still in use. I have not seen this happen.

Member

bagder commented Jun 18, 2018

connections are kept in the connection cache, not easy handles. Once you remove an easy handle from the multi handle it can be killed right away. That's a common pattern and one that we've always said works. When the multi interface is used, re-using easy handles is mostly just avoiding free() / malloc() since the connections, the DNS cache etc are then all owned by the multi handle anyway.

in general it's known to be problematic to remove handles possibly involved in pipelines

Yes, if that handle uses a live connection involved in a HTTP/1.1 pipeline - not when HTTP/2 multiplexing is used.

I would urge you to try to reproduce the problem with a stand-alone example that mimics how your "real" application works but with as much as possible of the application logic removed and the libcurl logic left that causes the problems.

given your log output, it looks like a struct connectdata or parts of that ends up freed while apparently still in use. I have not seen this happen.

@dtzWill

This comment has been minimized.

Show comment
Hide comment
@dtzWill

dtzWill Jun 18, 2018

Contributor

Okay, will do thank you. I say easy handles are referenced from connection cache since the connection structs have a data field that refers to an easy handle--I'm not sure if this is allowed to be NULL or the most-recently-used-handle or what, but I'm seeing these refer to handles that have been removed (curl_multi_remove_handle) and then free'd via curl_easy_cleanup; it sounds like this process certainly should be ensuring the handle is no longer referenced by the cached connection entries?

That's encouraging to hear! I'll see about creating a standalone example although it may take some time.

Thanks for the quick response!

Contributor

dtzWill commented Jun 18, 2018

Okay, will do thank you. I say easy handles are referenced from connection cache since the connection structs have a data field that refers to an easy handle--I'm not sure if this is allowed to be NULL or the most-recently-used-handle or what, but I'm seeing these refer to handles that have been removed (curl_multi_remove_handle) and then free'd via curl_easy_cleanup; it sounds like this process certainly should be ensuring the handle is no longer referenced by the cached connection entries?

That's encouraging to hear! I'll see about creating a standalone example although it may take some time.

Thanks for the quick response!

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jun 18, 2018

Member

the conn->data pointer should point to a legitimate easy handle whenever that connection is in use, otherwise it is the "most-recently-used-handle" in some cases. A cleaner approach would be to make sure to NULLify that pointer properly when the easy handle that it pointed to is gone or when the connection is detached from it.

Member

bagder commented Jun 18, 2018

the conn->data pointer should point to a legitimate easy handle whenever that connection is in use, otherwise it is the "most-recently-used-handle" in some cases. A cleaner approach would be to make sure to NULLify that pointer properly when the easy handle that it pointed to is gone or when the connection is detached from it.

@dtzWill

This comment has been minimized.

Show comment
Hide comment
@dtzWill

dtzWill Jun 19, 2018

Contributor

This example reliably produces the crash for me, although unsure why it's not happening on travis.

But log of such a crash is included, let me know if you can reproduce or if more information is helpful.

https://github.com/dtzWill/curl-repro

EDIT: oh hooray travis did reproduce it at least one time, now linked from README

Contributor

dtzWill commented Jun 19, 2018

This example reliably produces the crash for me, although unsure why it's not happening on travis.

But log of such a crash is included, let me know if you can reproduce or if more information is helpful.

https://github.com/dtzWill/curl-repro

EDIT: oh hooray travis did reproduce it at least one time, now linked from README

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jun 19, 2018

Member

I think it is a stale ->data pointer, much like we discussed earlier. I seem to have a hard time to reproduce this, can you check if this little patch makes any difference?

diff --git a/lib/url.c b/lib/url.c
index d29eddaea..0cab0a303 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -963,10 +963,11 @@ static bool extract_if_dead(struct connectdata *conn,
     /* The check for a dead socket makes sense only if there are no
        handles in pipeline and the connection isn't already marked in
        use */
     bool dead;
 
+    conn->data = data;
     if(conn->handler->connection_check) {
       /* The protocol has a special method for checking the state of the
          connection. Use it to check if the connection is dead. */
       unsigned int state;
 
@@ -977,11 +978,10 @@ static bool extract_if_dead(struct connectdata *conn,
       /* Use the general method for determining the death of a connection */
       dead = SocketIsDead(conn->sock[FIRSTSOCKET]);
     }
 
     if(dead) {
-      conn->data = data;
       infof(data, "Connection %ld seems to be dead!\n", conn->connection_id);
       Curl_conncache_remove_conn(conn, FALSE);
       return TRUE;
     }
   }

Possible extra thing to make the pointer not point to a dead handle:

diff --git a/lib/conncache.c b/lib/conncache.c
index 6bd06582a..066542915 100644
--- a/lib/conncache.c
+++ b/lib/conncache.c
@@ -449,10 +449,11 @@ bool Curl_conncache_return_conn(struct connectdata *conn)
       (void)Curl_disconnect(conn_candidate, /* dead_connection */ FALSE);
     }
   }
   CONN_LOCK(data);
   conn->inuse = FALSE; /* Mark the connection unused */
+  conn->data = NULL; /* no owner */
   CONN_UNLOCK(data);
 
   return (conn_candidate == conn) ? FALSE : TRUE;
 
 }
Member

bagder commented Jun 19, 2018

I think it is a stale ->data pointer, much like we discussed earlier. I seem to have a hard time to reproduce this, can you check if this little patch makes any difference?

diff --git a/lib/url.c b/lib/url.c
index d29eddaea..0cab0a303 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -963,10 +963,11 @@ static bool extract_if_dead(struct connectdata *conn,
     /* The check for a dead socket makes sense only if there are no
        handles in pipeline and the connection isn't already marked in
        use */
     bool dead;
 
+    conn->data = data;
     if(conn->handler->connection_check) {
       /* The protocol has a special method for checking the state of the
          connection. Use it to check if the connection is dead. */
       unsigned int state;
 
@@ -977,11 +978,10 @@ static bool extract_if_dead(struct connectdata *conn,
       /* Use the general method for determining the death of a connection */
       dead = SocketIsDead(conn->sock[FIRSTSOCKET]);
     }
 
     if(dead) {
-      conn->data = data;
       infof(data, "Connection %ld seems to be dead!\n", conn->connection_id);
       Curl_conncache_remove_conn(conn, FALSE);
       return TRUE;
     }
   }

Possible extra thing to make the pointer not point to a dead handle:

diff --git a/lib/conncache.c b/lib/conncache.c
index 6bd06582a..066542915 100644
--- a/lib/conncache.c
+++ b/lib/conncache.c
@@ -449,10 +449,11 @@ bool Curl_conncache_return_conn(struct connectdata *conn)
       (void)Curl_disconnect(conn_candidate, /* dead_connection */ FALSE);
     }
   }
   CONN_LOCK(data);
   conn->inuse = FALSE; /* Mark the connection unused */
+  conn->data = NULL; /* no owner */
   CONN_UNLOCK(data);
 
   return (conn_candidate == conn) ? FALSE : TRUE;
 
 }
@dtzWill

This comment has been minimized.

Show comment
Hide comment
@dtzWill

dtzWill Jun 20, 2018

Contributor

After some testing, I tentatively report the first patch (didn't test with/without second) seems to fix the issue!

Sorry for this being a PR and not an issue as seems more appropriate-- please close this when committing your suggested patch.

Contributor

dtzWill commented Jun 20, 2018

After some testing, I tentatively report the first patch (didn't test with/without second) seems to fix the issue!

Sorry for this being a PR and not an issue as seems more appropriate-- please close this when committing your suggested patch.

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jun 20, 2018

Member

That's just awesome! Thanks for confirming. I'll make a PR with my patches, just to make sure the CI doesn't find anything bad there and then merge.

Member

bagder commented Jun 20, 2018

That's just awesome! Thanks for confirming. I'll make a PR with my patches, just to make sure the CI doesn't find anything bad there and then merge.

bagder added a commit that referenced this pull request Jun 20, 2018

url: use a fresh easy handle with connections from the cache
... and make sure to NULLify the ->data pointer when the connection is put
into the cache to make this mistake easier to detect in the future.

Reported-by: Will Dietz
Closes #2669

bagder added a commit that referenced this pull request Jun 21, 2018

url: use a fresh easy handle with connections from the cache
... and make sure to NULLify the ->data pointer when the connection is put
into the cache to make this mistake easier to detect in the future.

Reported-by: Will Dietz
Fixes #2669
Closes #2672

@bagder bagder closed this in 2c15693 Jun 21, 2018

@dtzWill dtzWill deleted the dtzWill:experimental/dont-use-invalid-easy-handles-referenced-from-conn-cache branch Jun 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment