avahi-browse bug with large LAN --terminate and --cache randomly never terminates #264

bwfisher82 · 2020-02-19T02:08:41Z

Hello,

Reporting an issue discussed in #avahi on freenode with lathiat:

I am experiencing an issue where the avahi-browse command never terminates when it should, randomly, on a large network.

I have a network with ~300 publishing devices. I find that about ~100 devices it is fine, about ~150ish we start noticing the issue, and with ~300+ it's quite noticeable and easily reproducible, but doesn't always happen.

My automation software is regularly running the avahi-browse commands to pull detected node information, and then connect to devices and perform various operations. Right now as a work-around I am having it timeout out after a few seconds but it happens often enough with this many devices (this much mDNS traffic?) that the web UI for the software becomes noticeable slow waiting for detection attempts to timeout and retry a lot.

The detection software I wrote, fairly basic by calling avahi-browse from Python with a timeout, is running on a CentOS 7 server. The server is always up, and regarding other time sync bugs, uses chronyd with CentOS Internet time sources, so that is highly unlikely I think.

The command I am using is: avahi-browse -ltrp ._._tcp

This seems to be timing out (my timeout of 15s) about 3% of the time. If I exchange the -t option (terminate) for --cache it does the same thing. I believe it is the -r resolving action that probably has the issues.

When I do have the issue, I see output where it just continually re-resolves things that it has already displayed as if -t was not used.

This is on a large fully 10GigE network if that matters. 16 core by 32 GiB memory on the server if that matters. Probably not.

The avahi-daemon is started with -s and --debug currently.

The config file:

[server]
use-ipv4=no
use-ipv6=yes
allow-interfaces=ens256
deny-interfaces=ens192,ens224
enable-dbus=yes
disallow-other-stacks=yes
objects-per-client-max=2048
ratelimit-interval-usec=1000000
ratelimit-burst=1000
cache-entries-max=2048

[wide-area]
enable-wide-area=no

[publish]

[reflector]

[rlimits]
rlimit-core=0
rlimit-data=4194304
rlimit-fsize=0
rlimit-nofile=768
rlimit-stack=4194304
rlimit-nproc=3

monsrud · 2023-03-22T14:33:52Z

@bwfisher82 - did you ever find a way to better tune avahi for a large network?

bwfisher82 · 2023-03-22T15:02:05Z

Nope. Not sure if it's multicast in general or this mDNS implementation. Because pinging the group address you stop getting replies from everything as you add more and more devices. At just 200 devices it takes 30 second ping to ff02::fb to be sure I probably got them all. I doubt it's ipv6 issue btw. We had to stop using this altogether and require users to provide static IPs for the automation program. We briefly considered multiple network segments but never made it that far before we just went with static addressing and no automatic detection on the network --- basically completely removed avahi/bonjour/mdns. :/

…

On Wed, Mar 22, 2023 at 09:34 Marshall Onsrud ***@***.***> wrote: @bwfisher82 <https://github.com/bwfisher82> - did you ever find a way to better tune avahi for a large network? — Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHSM4J2GAIVN3FEIYYH4C7TW5MEVZANCNFSM4KXQGLJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

monsrud · 2023-03-22T15:06:15Z

We are seeing that hosts don't show up for a very long time (up to an hour) via avahi-browse. However, if we restart the avahi-daemon on the remote host or the one on which we are running avahi-browse, the host shows up right away. This is on a network with only ~50 devices. Not a lot of leads out there on Google. On Wed, Mar 22, 2023 at 10:02 AM Ben Fisher ***@***.***> wrote:

…

Nope. Not sure if it's multicast in general or this mDNS implementation. Because pinging the group address you stop getting replies from everything as you add more and more devices. At just 200 devices it takes 30 second ping to ff02::fb to be sure I probably got them all. I doubt it's ipv6 issue btw. We had to stop using this altogether and require users to provide static IPs for the automation program. We briefly considered multiple network segments but never made it that far before we just went with static addressing and no automatic detection on the network --- basically completely removed avahi/bonjour/mdns. :/ On Wed, Mar 22, 2023 at 09:34 Marshall Onsrud ***@***.***> wrote: > @bwfisher82 <https://github.com/bwfisher82> - did you ever find a way to > better tune avahi for a large network? > > — > Reply to this email directly, view it on GitHub > <#264 (comment)>, or > unsubscribe > < https://github.com/notifications/unsubscribe-auth/AHSM4J2GAIVN3FEIYYH4C7TW5MEVZANCNFSM4KXQGLJQ > > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBJMGX23BBH5F477WMM2O3W5MH7RANCNFSM4KXQGLJQ> . You are receiving this because you commented.Message ID: ***@***.***>

evverx · 2024-03-25T22:18:37Z

I reproduced this issue by announcing a service pointing to an unresolvable host name and then sending a goodbye packet before the resolver timed out. Could anyone apply the following patch to see if it helps:

diff --git a/avahi-utils/avahi-browse.c b/avahi-utils/avahi-browse.c
index 4028ca0..f7542ff 100644
--- a/avahi-utils/avahi-browse.c
+++ b/avahi-utils/avahi-browse.c
@@ -284,8 +284,10 @@ static void remove_service(Config *c, ServiceInfo *i) {
 
     AVAHI_LLIST_REMOVE(ServiceInfo, info, services, i);
 
-    if (i->resolver)
+    if (i->resolver) {
         avahi_service_resolver_free(i->resolver);
+        n_resolving--;
+    }
 
     avahi_free(i->name);
     avahi_free(i->type);
@@ -331,6 +333,7 @@ static void service_browser_callback(
                 return;
 
             remove_service(c, info);
+            check_terminate(c);
 
             print_service_line(c, '-', interface, protocol, name, type, domain, 1);
             break;

?

fisherbe · 2024-03-26T03:10:35Z

@evverx I'm not sure how to apply a patch exactly. We just install avahi from RHEL / CentOS repos. I can grab this repo and jump on a given branch / tag (latest branch?) and apply the patch if I have instructions for that part. It may take some time because we don't currently have relevant devices in our data centers, but should soon ish.

…e resolvers fail/time out Related to avahi#264 This PR addresses one particular scenario. There can be other scenarios preventing avahi-browse from stopping: avahi#444 (comment) but they should be identified and fixed one by one.

evverx · 2024-03-26T11:06:09Z

@fisherbe I opened #583 so it should be possible to get that patch by running the following commands:

git clone https://github.com/avahi/avahi
cd avahi
git fetch origin pull/583/head:browse-cache-terminate
git checkout browse-cache-terminate

installing the build dependencies and running

./boostrap.sh
make

avahi-browse can be run directly from the avahi-utils directory without having to install anything:

./avahi-utils/avahi-browse -arpt

As mentioned there the PR fixes one particular issue and there can be other issues preventing avahi-browse from stopping. I found another way to trigger it but it's unlikely to happen in practice unless something really malfunctions somewhere (or does that deliberately). If the patch doesn't help it would be great if you could attach the output of avahi-browse when it happens and also the output of tcpdump/wireshark showing incoming/outgoing mDNS packets.

evverx mentioned this issue Sep 30, 2023

avahi-browse sometimes never stop despite terminate option #444

Closed

evverx added the bug label Mar 25, 2024

evverx added the needs-reporter-feedback label Mar 25, 2024

evverx mentioned this issue Mar 26, 2024

avahi-browse: make -t/-c work when goodbye packets are received before resolvers fail/time out #583

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avahi-browse bug with large LAN --terminate and --cache randomly never terminates #264

avahi-browse bug with large LAN --terminate and --cache randomly never terminates #264

bwfisher82 commented Feb 19, 2020

monsrud commented Mar 22, 2023

bwfisher82 commented Mar 22, 2023 via email

monsrud commented Mar 22, 2023 via email

evverx commented Mar 25, 2024

fisherbe commented Mar 26, 2024

evverx commented Mar 26, 2024

avahi-browse bug with large LAN --terminate and --cache randomly never terminates #264

avahi-browse bug with large LAN --terminate and --cache randomly never terminates #264

Comments

bwfisher82 commented Feb 19, 2020

monsrud commented Mar 22, 2023

bwfisher82 commented Mar 22, 2023 via email

monsrud commented Mar 22, 2023 via email

evverx commented Mar 25, 2024

fisherbe commented Mar 26, 2024

evverx commented Mar 26, 2024