Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avahi-browse bug with large LAN --terminate and --cache randomly never terminates #264

Open
bwfisher82 opened this issue Feb 19, 2020 · 6 comments

Comments

@bwfisher82
Copy link

Hello,

Reporting an issue discussed in #avahi on freenode with lathiat:

I am experiencing an issue where the avahi-browse command never terminates when it should, randomly, on a large network.

I have a network with ~300 publishing devices. I find that about ~100 devices it is fine, about ~150ish we start noticing the issue, and with ~300+ it's quite noticeable and easily reproducible, but doesn't always happen.

My automation software is regularly running the avahi-browse commands to pull detected node information, and then connect to devices and perform various operations. Right now as a work-around I am having it timeout out after a few seconds but it happens often enough with this many devices (this much mDNS traffic?) that the web UI for the software becomes noticeable slow waiting for detection attempts to timeout and retry a lot.

The detection software I wrote, fairly basic by calling avahi-browse from Python with a timeout, is running on a CentOS 7 server. The server is always up, and regarding other time sync bugs, uses chronyd with CentOS Internet time sources, so that is highly unlikely I think.

The command I am using is: avahi-browse -ltrp ._._tcp

This seems to be timing out (my timeout of 15s) about 3% of the time. If I exchange the -t option (terminate) for --cache it does the same thing. I believe it is the -r resolving action that probably has the issues.

When I do have the issue, I see output where it just continually re-resolves things that it has already displayed as if -t was not used.

This is on a large fully 10GigE network if that matters. 16 core by 32 GiB memory on the server if that matters. Probably not.

The avahi-daemon is started with -s and --debug currently.

The config file:

[server]
use-ipv4=no
use-ipv6=yes
allow-interfaces=ens256
deny-interfaces=ens192,ens224
enable-dbus=yes
disallow-other-stacks=yes
objects-per-client-max=2048
ratelimit-interval-usec=1000000
ratelimit-burst=1000
cache-entries-max=2048

[wide-area]
enable-wide-area=no

[publish]

[reflector]

[rlimits]
rlimit-core=0
rlimit-data=4194304
rlimit-fsize=0
rlimit-nofile=768
rlimit-stack=4194304
rlimit-nproc=3

@monsrud
Copy link

monsrud commented Mar 22, 2023

@bwfisher82 - did you ever find a way to better tune avahi for a large network?

@bwfisher82
Copy link
Author

bwfisher82 commented Mar 22, 2023 via email

@monsrud
Copy link

monsrud commented Mar 22, 2023 via email

@evverx
Copy link
Member

evverx commented Mar 25, 2024

I reproduced this issue by announcing a service pointing to an unresolvable host name and then sending a goodbye packet before the resolver timed out. Could anyone apply the following patch to see if it helps:

diff --git a/avahi-utils/avahi-browse.c b/avahi-utils/avahi-browse.c
index 4028ca0..f7542ff 100644
--- a/avahi-utils/avahi-browse.c
+++ b/avahi-utils/avahi-browse.c
@@ -284,8 +284,10 @@ static void remove_service(Config *c, ServiceInfo *i) {
 
     AVAHI_LLIST_REMOVE(ServiceInfo, info, services, i);
 
-    if (i->resolver)
+    if (i->resolver) {
         avahi_service_resolver_free(i->resolver);
+        n_resolving--;
+    }
 
     avahi_free(i->name);
     avahi_free(i->type);
@@ -331,6 +333,7 @@ static void service_browser_callback(
                 return;
 
             remove_service(c, info);
+            check_terminate(c);
 
             print_service_line(c, '-', interface, protocol, name, type, domain, 1);
             break;

?

@fisherbe
Copy link

@evverx I'm not sure how to apply a patch exactly. We just install avahi from RHEL / CentOS repos. I can grab this repo and jump on a given branch / tag (latest branch?) and apply the patch if I have instructions for that part. It may take some time because we don't currently have relevant devices in our data centers, but should soon ish.

evverx added a commit to evverx/avahi that referenced this issue Mar 26, 2024
…e resolvers fail/time out

Related to avahi#264

This PR addresses one particular scenario. There can be other scenarios
preventing avahi-browse from stopping:
avahi#444 (comment) but
they should be identified and fixed one by one.
@evverx
Copy link
Member

evverx commented Mar 26, 2024

@fisherbe I opened #583 so it should be possible to get that patch by running the following commands:

git clone https://github.com/avahi/avahi
cd avahi
git fetch origin pull/583/head:browse-cache-terminate
git checkout browse-cache-terminate

installing the build dependencies and running

./boostrap.sh
make

avahi-browse can be run directly from the avahi-utils directory without having to install anything:

./avahi-utils/avahi-browse -arpt

As mentioned there the PR fixes one particular issue and there can be other issues preventing avahi-browse from stopping. I found another way to trigger it but it's unlikely to happen in practice unless something really malfunctions somewhere (or does that deliberately). If the patch doesn't help it would be great if you could attach the output of avahi-browse when it happens and also the output of tcpdump/wireshark showing incoming/outgoing mDNS packets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants