Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes in DoH requests (consistent, unknown cause) #6604

Closed
arvids-kokins-bidstack opened this issue Feb 14, 2021 · 17 comments
Closed

Crashes in DoH requests (consistent, unknown cause) #6604

arvids-kokins-bidstack opened this issue Feb 14, 2021 · 17 comments
Labels

Comments

@arvids-kokins-bidstack
Copy link

arvids-kokins-bidstack commented Feb 14, 2021

OS: Windows (probably doesn't matter though)
libcurl version: 7.65.2-DEV (based on 40259ca with minor changes)

call stack:

  • Curl_strncasecompare(const char * first, const char * second, unsigned __int64 max) Line 139
    • while(*first && *second && max) { -- first is an invalid value
  • Curl_checkheaders(const connectdata * conn, const char * thisheader) Line 101
    • data->set.headers head seems to point to wrong memory (dangling pointer?) but often the memory is still usable enough to crash a bit deeper
    • conn->ip_addr_str = "1.1.1.1"
    • conn->connection_id = 18 (from one of the dumps, some are higher) so it's not a guaranteed failure
  • Curl_http(connectdata * conn, bool * done) Line 2082
    • path = "/dns-query"
    • host = "1.1.1.1"
  • multi_do(Curl_easy *) Line 1214
  • multi_runsingle(Curl_multi * multi, curltime now, Curl_easy * data) Line 1585
  • curl_multi_perform(Curl_multi * multi, int * running_handles) Line 2094

sadly the minidump does not include the entire connectdata or Curl_easy data structures (they seem to be too big) but feel free to ask for any data early in the structure, I may be able to provide some

I've also noticed one possibly related issue that is marked as a known bug: #4592

2 questions:

  • is there a path in the state machine which would restart/reuse the DoH request after it calls doh_done?
    • I've tried using a proxy (set using env.var.) but ran into some other bugs which were fixed according to changelog
    • also tried redirects but those didn't seem to help either
  • why are DoH request headers freed in doh_done separately, before freeing the request handles? this seems to be true even in the latest version: https://github.com/curl/curl/blob/master/lib/doh.c#L200

this seems to be a relatively rare issue but I'd still like to at least understand what's causing it since it's difficult to reproduce

@jay jay added crash name lookup DNS and related tech labels Feb 14, 2021
@jay
Copy link
Member

jay commented Feb 14, 2021

  • is there a path in the state machine which would restart/reuse the DoH request after it calls doh_done

AFAICT no. If you are threading please read thread safety. Can you make a minimal self-contained program that can be used to reproduce?

@arvids-kokins-bidstack
Copy link
Author

If you are threading please read thread safety.

Yes, we're threading and I have read that page.

  • You can pass the handles around among threads, but you must never use a single handle from more than one thread at any given time. ✔️
  • curl_global_* is used but only in the very beginning/end
  • not using curl_share_*

One thread can make requests (configuring the easy handle and passing it off to another thread).
Another thread can perform them using the multi interface, then detaching them, reading the result and passing them back.
Handle and all associated data are passed between threads together.
Also, all the handles are reused and there's a limit on max. connections for the multi interface but it's not being hit in this case.

That said, with threading issues I'd have expected the crash to be more random, but it's always here (Curl_checkheaders(conn, "User-Agent")).

Can you make a minimal self-contained program that can be used to reproduce?

Unfortunately not at the moment as I haven't seen the crash in person, we only receive reports about them.

We're also not yet in a position to tell if it's something computer-specific or is simply rare (might get that data in a week or two) but it does seem to happen far more frequently with corporate network setups.

Are there any settings other than proxy which might be controlled by the environment/network instead of the code that uses libcurl?
In the past we've had issues with configurations where 1.1.1.1 was intercepted by the company's internal network, for example, but we didn't look too much into those at the time since they figured it out quickly.

@bagder
Copy link
Member

bagder commented Feb 14, 2021

We've fixed several issues in DoH (and elsewhere) since 40259ca. It seems a worthwhile investment to first bump up the version to the latest libcurl before continuing the search.

is there a path in the state machine which would restart/reuse the DoH request after it calls doh_done?

No. The easy handles used for DoH requests are never reused and those transfers are never restarted.

why are DoH request headers freed in doh_done separately, before freeing the request handles

Why not? When there's no more pending DoH request they're not needed anymore. If you think you have a better way to do it, so by all means, please provide a PR! We can always improve the code.

@arvids-kokins-bidstack
Copy link
Author

It seems a worthwhile investment to first bump up the version to the latest libcurl before continuing the search.

I'd like that too but we're not in control of the release/update schedules. We'll try to update on our end sometime soon but it may take months to see the results.

Why not? When there's no more pending DoH request they're not needed anymore.

Sure, it just seems easier to reason about the code if all the resources were freed together, seeing as they were allocated and otherwise used together. Currently it looks like the headers are freed from one state of the last request out of two to finish and the handles are freed from another state of another request, and to someone like me in this situation it's not immediately obvious whether this works in all cases.

If you think you have a better way to do it, so by all means, please provide a PR! We can always improve the code.

Thanks, I might look into that once there's time to update.

@bagder
Copy link
Member

bagder commented Feb 15, 2021

I would urge you to work on creating a way to force the problem to trigger, probably with a stand-alone separate app that maybe mimics what your real-life application does. Then you can make that with the latest libcurl and once you manage to show the crash, you can share the code with us.

A reported crash that nobody can reproduce with a libcurl version with several DoH fixes missing is hard to act on.

@arvids-kokins-bidstack
Copy link
Author

Sorry, yeah, I know it's hard without the data and I don't expect much here. Just wanted to see if you had any ideas on what could potentially be the issue (or if it was already fixed), and to ask about the headers.

I would urge you to work on creating a way to force the problem to trigger, probably with a stand-alone separate app that maybe mimics what your real-life application does.

I've been doing that and will continue, but do you have any suggestions on e.g. fuzzers and their configurations that could be used for this purpose?
Since the header freeing theory has not given any results, the next best theory is that there's a rare double free happening on the same address (though it's unclear why it only happens on some devices), but I don't think there's a practical possibility for me to guess which branches must be taken in what order to trigger the issue, hence approaching this analytically.

@bagder
Copy link
Member

bagder commented Feb 15, 2021

do you have any suggestions on e.g. fuzzers and their configurations that could be used for this purpose?

I think using valgrind and memory/address sanitizers are the most useful tools for this. Fuzzing this area of the code more would be awesome too and was done to trigger #4592 if I remember correctly, but I don't have anything prepared or setup to reproduce that other than what is already mentioned in that issue.

@jay
Copy link
Member

jay commented Feb 15, 2021

In general we've occasionally seen weird arbitrary crashes in libcurl when users don't do thread synchronization properly. Though your crash has some consistency I would still double check that. Some kind of synchronization primitive is needed (eg critical section). volatile doesn't cut it and is too often misused.

@arvids-kokins-bidstack
Copy link
Author

Thanks, will try fuzzing once I have time to boot into Linux.

I would still double check that.

Yep, doing that regularly.

Some kind of synchronization primitive is needed (eg critical section). volatile doesn't cut it and is too often misused.

It's a good thing that we don't use volatile for multithreading then. 😄 Mutexes/condition variables are used to transfer request objects.


I've started looking into a new theory (non-safe free > allocating headers in the same spot > dangling pointer freed by other code > crash), dumping all the frees that later turn into headers.
I figure the frees have to be quite close to guarantee this degree of consistency, and at least one of the frees should be always present (though possibly a free(null) usually). Also, after stress-testing with 100'000+ requests, while it's no guarantee for anything, I'm starting to think that it's more likely to be an edge case than a threading/ordering/timing issue.

P.S. Things I've found so far:

jay added a commit to jay/curl that referenced this issue Feb 16, 2021
Prior to this change if the user specified a default protocol and a
separately allocated non-absolute URL was used then it was freed
prematurely, before it was then used to make the replacement URL.

Bug: curl#6604 (comment)
Reported-by: arvids-kokins-bidstack@users.noreply.github.com

Closes #xxxx
@jay
Copy link
Member

jay commented Feb 16, 2021

#6613 thanks

jay added a commit that referenced this issue Feb 17, 2021
Prior to this change if the user specified a default protocol and a
separately allocated non-absolute URL was used then it was freed
prematurely, before it was then used to make the replacement URL.

Bug: #6604 (comment)
Reported-by: arvids-kokins-bidstack@users.noreply.github.com

Closes #6613
@arvids-kokins-bidstack
Copy link
Author

arvids-kokins-bidstack commented Mar 8, 2021

haven't had a lot of time to look into this lately but there are good news and bad news:

there was an increase in crashes from increased user count but a relatively small one so very few people are having this issue, which also makes it extremely difficult to debug

the percentage of users affected by it is about as much as one would expect from random network failures
each affected user seems to experience the crash just once

in addition, the same issue seems to be present on Macs too, though our crash dumps currently are a bit too imprecise to confirm this with 100% accuracy (same thread/error/frequency)

another curious observation is that I'm not getting any reports from development builds anymore (I asked them to try disabling DoH but no idea yet if they actually did)

@bagder
Copy link
Member

bagder commented Mar 26, 2021

@arvids-kokins-bidstack do you think there's actual value in keeping this issue open further? I don't see how we can do anything more from our end...

@arvids-kokins-bidstack
Copy link
Author

@bagder I was about to write actually, some new information has showed up

basically I'd like to pick your brain on another matter and then feel free to close the issue afterwards

so lately the crashes had increased and the only events we've correlated them to were:

  • request count (somehow we managed to start sending several requests in short bursts - and with DOH using default settings, for the first group of 5 of those requests as I understand that meant 15 new connections - 5 IPv4 DOH, 5 IPv6 DOH and 5 to the requested servers)
    • this will of course be fixed, but there seems to be some underlying issue that may reappear either way
  • network pressure - as it got lower, presumably the request times became smaller and crash count increased as well

so the question currently on my mind is - do you know of any firewall software that could in theory:

  • look at 10 near-simultaneous DOH connections (which may happen in various places in Europe/US at around the same time if it changes anything)
  • and think that's an attack of some sort...
  • then tamper with the socket or force an error code/reach some unexpected state...
  • which then would result in cURL entering a very rarely utilized branch and leaving a dangling pointer...
  • which subsequently results in the chain of events I described above?

and also:

  • any experience with firewalls resulting in unusual behavior/crashes?
  • is there some kind of code coverage data for 7.65.2-DEV (or a close enough version) somewhere? it could help narrow down the focus of a code review

@bagder
Copy link
Member

bagder commented Mar 26, 2021

know of any firewall software that could

No "firewall software" can do that.

7.65.2-DEV

I would advice you don't stick to this old version when you have issues. We've fixed DoH issues since, and other things. At least make sure that the latest version also has this problem as otherwise you've been hunting this in vain.

experience with firewalls resulting in unusual behavior/crashes?

I think you should scrap that idea. This is just a bug,

is there some kind of code coverage data

No, but all the tests are available so you can make them. I would of course suggest that code-reviewing to find this is a rather long shot. I would work focus my efforts on making a recipe for reproducing it.

@arvids-kokins-bidstack
Copy link
Author

No "firewall software" can do that.

Which part of the process seems most unlikely to you?
Similar things have happened but typically I'd assume that the failure case was simply mishandled:

https://steamcommunity.com/app/455820/discussions/1/1696048879951565073
https://community.mcafee.com/t5/Consumer-General-Discussions/McAfee-interferes-with-games/td-p/614598

Also we've had experience with firewalls not letting unusual ports through in a strange way - the first few packets do go through but then they stop coming and the connection isn't closed either. Some can even be configured to inspect HTTPS traffic.

That said, I've already tried reproducing by toggling Windows firewall rules and using clumsy and those were not sufficient.

I would advice you don't stick to this old version when you have issues.

Will upgrade eventually but if we don't know what causes the issue, there's really no guarantee that it would be solved by an upgrade. Not a fan of blindly doing things and praying that they work.

At least make sure that the latest version also has this problem

That would mean at least a month or two of doing nothing and waiting for upgrades to trickle down, hoping that they don't create new issues, all the while invalidating any existing data we may have on this issue, which adds another month of subsequent data collection.

This is just a bug,
I would work focus my efforts on making a recipe for reproducing it.

It's a bug that normally does not appear. Most people (including me) are completely unaffected but there are a few that have been getting it repeatedly with relatively few attempts.

Reproducing the bug has been the focus already. Various differentiators have already been ruled out but until we find the right one, I don't see how there's any chance of reproducing it.

@bagder
Copy link
Member

bagder commented Apr 14, 2021

I'll ask again: is there a purpose to keep this open? Without more details or a way to reproduce we have nothing to act on here...

@arvids-kokins-bidstack
Copy link
Author

I'll ask again: is there a purpose to keep this open? Without more details or a way to reproduce we have nothing to act on here...

I did say that you can feel free to close it but I guess I can do it too 😄

p.s. updating to 7.76.0 and turning off DOH for the upcoming release, so there probably won't be much I can add anyway

thanks for bearing with me, and if we do have something more concrete to share (like a bug reproduction example), I'll let you know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
@bagder @jay @arvids-kokins-bidstack and others