Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: use interruptible async getaddrinfo wrapper from libevent for DNS #27505

Closed
wants to merge 2 commits into from

Conversation

pinheadmz
Copy link
Member

Closes #16778

Bitcoin uses getaddrinfo to make DNS requests for DNS seed servers and for adding peers with -addnode, -seednode and -connect. Depending on the platform this can be clunky and a system issue could prevent name resolution from completing at all, blocking the thread and in some cases preventing a clean shutdown.

An attempt was made to switch to the asynchronous getaddrinfo_a in #4421 but that was reverted in #9229 after discovering that function has a segfault!

Taking BlueMatt's suggestion in #10215 (comment), this PR modifies our g_dns_lookup function to use evdns_getaddrinfo() from libevent. This is an asynchronous function but I've implemented it in a polling loop so it still blocks -- but now will timeout after 2 seconds.

TODO:

  • We could make the polling loop interruptible but I'm not sure the best approach to that, since these functions don't live in a class with a flag like interruptNet
  • Is it possible to add functional tests that mess with DNS? Probably not.
  • The unit test I added makes a live DNS request for nic.com that test will fail the platform has no DNS

Future work:

  • libevent has more low-level DNS functions as well, we could ultimately use those to (for example) request TXT records with onion addresses from our DNS seeders

@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 20, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #27530 (Remove now-unnecessary poll, fcntl includes from net(base).cpp by Empact)
  • #27405 (util: Use steady clock instead of system clock to measure durations by MarcoFalke)
  • #27375 (net: support unix domain sockets for -proxy and -onion by pinheadmz)
  • #27071 (Handle CJDNS from LookupSubNet() by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@dergoegge
Copy link
Member

Afaik we currently do not expose libevent on any public facing interface (e.g. p2p) and only use it for things that aren't supposed to be exposed (e.g. rpc or rest). By using it for DNS queries we would be changing that (e.g. a malicious DNS seeder) and I'm not sure if that is the best idea given that libevent is pretty archaic (I think the MSan, ASan, TSan and LSan failures in the CI are kind of proving my point).

@pinheadmz
Copy link
Member Author

pinheadmz commented Apr 21, 2023

@dergoegge I think at least some of the CI failures are memory leaks from my code, I'm going to fix that. But I hear your point about the arcane library. Any suggestions?

@dergoegge
Copy link
Member

Any suggestions?

The issue seems somewhat stale so maybe no need to fix? Judging by your last comment, you were also not able to reproduce? (#16778 (comment))

@pinheadmz
Copy link
Member Author

I was able to somewhat reproduce -- by sending all DNS requests on my machine to a blackhole resolver, I observed a 30 second pause during shutdown. But I think that 30 seconds may be platform-specific, or whatever the local getaddrinfo() implementation is.


thread.join();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also wait indefinitely for getaddrinfo to finish before the thread can join?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 sec timeout waiting for the callback and then the loop will close. With my black hole dns resolver, the patch saves 30 seconds against master on the unit test in the second commit.

It took me some work to figure out that using ..._once() to start the event loop would allow me to quit the loop manually which allows join() to succeed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(libevent does not call the platform getaddrinfo for dns requests, it has its own async methods)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libevent does not call the platform getaddrinfo for dns requests, it has its own async methods

That explains why it doesn't hang, thanks!

But it does seem like it uses getaddrinfo in some cases: https://github.com/libevent/libevent/blob/75208132d5b7a8fff59ca3bf47253179ec314951/evutil.c#L1686

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think thats just to decode numeric addresses like "100.200.30.40" etc. It looked to me like if an actual DNS request is necessary it has its own methods to make reqests directly to the nameservers provided by the platform? When things get "serious":

https://github.com/libevent/libevent/blob/master/evdns.c#L5623

@fanquake
Copy link
Member

fanquake commented May 2, 2023

In general, I'm ~0 on leaning into further libevent usage, especially for something like this. The upstream code is currently not necessarily very well maintained or tested. There is also at least one open issue which reports evdns_getaddrinfo just "crashing occasionally": libevent/libevent#1130. Although it's not entirely clear if this is user error.

@dergoegge
Copy link
Member

As an alternative to this PR, could we give the DNS thread time to join cleanly (e.g. 5s) and if it doesn't we just detach it and let the OS handle the clean up? That risks memory leaks but that shouldn't really matter when the program is about to exit anyway.

I would also be fine with not addressing this at all, because it seems like this only happens when a system generally fails to make DNS requests? In the absence of nice solutions, it seems like that isn't our problem and the user should fix their system instead.

@pinheadmz
Copy link
Member Author

@fanquake @dergoegge good feedback, thanks. I'll look into a different approach moving the lib call getaddrinfo to a detachable thread before abandoning, and then we may just have to close #16778 as "won't fix"

There is another thought I had about name resolution moving forward: If bitcoin can make more interesting DNS queries (either using a library or just implementing some bare necessities) we could ask the DNS seeders for I2P and onion addresses as TXT records

@fanquake
Copy link
Member

fanquake commented May 2, 2023

either using a library

It's very unlikely we are going to add a new external dependency to make "interesting" DNS queries.

@pinheadmz
Copy link
Member Author

Closing this now for alternative in #27557 which just calls getaddrinfo() from a detachable thread. No libevent at all and, yeeeeeesh it is way simpler.

@pinheadmz pinheadmz closed this May 2, 2023
@bitcoin bitcoin locked and limited conversation to collaborators May 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThreadDNSAddressSeed hangs on sk_wait_data and doesn't stop on exit
4 participants