New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP pipelining and reusing easy handles causes mismatched request/response #1380

Closed
ahatala opened this Issue Apr 3, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@ahatala

ahatala commented Apr 3, 2017

I've narrowed down an issue in our use of libcurl to this repro case. Getting the repro to trigger seems to be timing-dependent so it might take a couple of runs to hit.

This is CURL verbose output for a failed test run: fail.txt

The user-visible problem is that the response data to a simple http request occasionally is associated to another http request in flight (to the same host), when multiple requests are pipelined over the same connection. The curl pipelining maintains a queue of easy handles for feeding the results to, under some conditions it seems that the queue gets sorted into incorrect order. A necessary condition for triggering the error seems to be that there are request both in sending and receiving states, the repro case accomplishes this by pushing in new requests in response to reading responses.

I'm using current github master (commit 55f4aba) but the issue triggers with stable versions as well (tried 7_39_0 and 7_53_1). The issue seems platform independent, FWIW I'm running on Win10 with configuration "LIB Debug - DLL Windows SSPI".

I can get the issue to repro against a number of different HTTP servers and I've verified that there's nothing suspicious in the data returned by the server on the wire.

I'm happy to provide any further data that may be useful for tracking down the root cause.

@ahatala

This comment has been minimized.

Show comment
Hide comment
@ahatala

ahatala Apr 3, 2017

I spent some time root causing this and I believe I know what is happening.

The culprit seems to be USE_RECV_BEFORE_SEND_WORKAROUND. Received data may be buffered in connectdata.postponed and it might get silently discarded on connection reuse in the call to conn_reset_all_postponed_data() in url.c:6232

conn_reset_all_postponed_data(conn); /* reset unprocessed data */

This causes the next read to read the response of a more recent request, I suppose with larger payloads the buffered data might not be limited to full response boundaries either so garbage would be read in place of valid HTTP headers.

Since this requires USE_RECV_BEFORE_SEND_WORKAROUND my original statement of being platform independent is clearly wrong, the issue is Windows specific.

I don't understand what the motivation behind discarding buffered data upon connection reuse is, but just commenting out that offending line does seem to resolve the issue.

ahatala commented Apr 3, 2017

I spent some time root causing this and I believe I know what is happening.

The culprit seems to be USE_RECV_BEFORE_SEND_WORKAROUND. Received data may be buffered in connectdata.postponed and it might get silently discarded on connection reuse in the call to conn_reset_all_postponed_data() in url.c:6232

conn_reset_all_postponed_data(conn); /* reset unprocessed data */

This causes the next read to read the response of a more recent request, I suppose with larger payloads the buffered data might not be limited to full response boundaries either so garbage would be read in place of valid HTTP headers.

Since this requires USE_RECV_BEFORE_SEND_WORKAROUND my original statement of being platform independent is clearly wrong, the issue is Windows specific.

I don't understand what the motivation behind discarding buffered data upon connection reuse is, but just commenting out that offending line does seem to resolve the issue.

@jay jay added the HTTP label Apr 3, 2017

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Apr 3, 2017

Member

Yeah that looks strange, it probably needs a move/copy_all_postponed_data instead of freeing it.

I suppose with larger payloads the buffered data might not be limited to full response boundaries either so garbage would be read in place of valid HTTP headers.

Can you explain how that could happen?

Member

jay commented Apr 3, 2017

Yeah that looks strange, it probably needs a move/copy_all_postponed_data instead of freeing it.

I suppose with larger payloads the buffered data might not be limited to full response boundaries either so garbage would be read in place of valid HTTP headers.

Can you explain how that could happen?

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Apr 3, 2017

Member

That code section comes from commit 72d5e14, issue #657 and PR #668 ... Maybe we just need to add a condition to not do that if pipelining is used?

Member

bagder commented Apr 3, 2017

That code section comes from commit 72d5e14, issue #657 and PR #668 ... Maybe we just need to add a condition to not do that if pipelining is used?

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Apr 3, 2017

Member

I think it would be better to just keep it regardless so it doesn't come back to bite some other way. I don't understand why it's being freed there.

Member

jay commented Apr 3, 2017

I think it would be better to just keep it regardless so it doesn't come back to bite some other way. I don't understand why it's being freed there.

jay added a commit to jay/curl that referenced this issue Apr 4, 2017

url: fix conn reuse when postponed data. draft1
- Don't free the postponed struct on conn reuse; transfer it to the
  replacement conn instead.

- Encapsulate postponed struct initialization.

Bug: curl#1380
Reported-by: Antti Hätälä
@jay

This comment has been minimized.

Show comment
Hide comment
@jay
Member

jay commented Apr 4, 2017

@ahatala

This comment has been minimized.

Show comment
Hide comment
@ahatala

ahatala Apr 4, 2017

@jay I can test later, but I don't think that patch is going to work. The original problem is that there is postponed data in "conn" that gets reset. There is no postponed data in "old_conn" and how could there be, since that connection was just created. Your change overwrites existing postponed data with no postponed data, leaking the buffer(s) in doing so. Right?

ahatala commented Apr 4, 2017

@jay I can test later, but I don't think that patch is going to work. The original problem is that there is postponed data in "conn" that gets reset. There is no postponed data in "old_conn" and how could there be, since that connection was just created. Your change overwrites existing postponed data with no postponed data, leaking the buffer(s) in doing so. Right?

@jay jay closed this in 36e604f Apr 4, 2017

@jay

This comment has been minimized.

Show comment
Hide comment
@jay

jay Apr 4, 2017

Member

Your change overwrites existing postponed data with no postponed data, leaking the buffer(s) in doing so. Right?

Yes I had it backwards, old_conn is a pointer to the temporary connection struct created after conn and conn holds the connection, so there's no reason to transfer. Your fix is fine then. Thanks for the great analysis and repro case.

Member

jay commented Apr 4, 2017

Your change overwrites existing postponed data with no postponed data, leaking the buffer(s) in doing so. Right?

Yes I had it backwards, old_conn is a pointer to the temporary connection struct created after conn and conn holds the connection, so there's no reason to transfer. Your fix is fine then. Thanks for the great analysis and repro case.

@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.