Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mDNS sends reply to multicast instead of unicast if request source port is 5353 (breaks avahi-resolve) (IDFGH-5375) #7124

Closed
DaStoned opened this issue Jun 9, 2021 · 8 comments
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@DaStoned
Copy link

DaStoned commented Jun 9, 2021

Environment

  • Development Kit: ESP32-DevKitC
  • Kit version (for WroverKit/PicoKit/DevKitC): v4
  • Module or chip used: ESP32-WROOM-32D
  • IDF version: v4.2.1
  • Build System: idf.py
  • Compiler version: xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0
  • Operating System: Linux (Debian 11)
  • Using an IDE?: No
  • Power Supply: USB

Problem Description

The mDNS resolution of device name to IP has stopped working in v4.2.1 (don't remember in which version it was OK, maybe v4.0?). Specifically, the ESP IDF mDNS service responds to queries from avahi-resolve -n -4 UBIK-71F164.local with a packet destined to the multicast group 224.0.0.251 which is not received by avahi-resolve. As a result avahi-resolve doesn't resolve the name.

As contrast, running dig +short -p 5353 @224.0.0.251 UBIK-71F164.local does resolve the name perfectly. Details and packet capture below.

NB! The problem raises only 2 minutes after ESP32 boots, probably because on boot mDNS broadcasts the relevant info which gets cached by my machine.

The network setup is fairly common, a Technicolor domestic router/AP with ESP32 connected via WiFi and my computer via Ethernet.

Expected Behavior

Running avahi-resolve 2 minutes after ESP32 boot resolves the name to IP:

tarmo@lumi:~$ avahi-resolve -n -4 UBIK-71F164.local -v
Server version: avahi 0.8; Host name: lumi.local
UBIK-71F164.local       192.168.224.120

Actual Behavior

avahi-resolve fails to resolve the name:

tarmo@lumi:~$ avahi-resolve -n -4 UBIK-71F164.local -v
Server version: avahi 0.8; Host name: lumi.local
Failed to resolve host name 'UBIK-71F164.local': Timeout reached

At the same time dig correctly resolves the name:

tarmo@lumi:~$ dig +short -p 5353 @224.0.0.251 UBIK-71F164.local
192.168.224.120

Steps to reproduce

  1. Start ESP32, init mDNS and set name per official instructions
  2. Wait 2+ minutes
  3. Run avahi-resolve -n -4 <name>.local
  4. Resolution times out

Code to reproduce this issue

bool NetworkHub::runMdns() {
    esp_err_t result = mdns_init();
    if (ESP_OK != result) {
        err("Fail mDNS init: %s (%d)", esp_err_to_name(result), result);
        return false;
    }
    result = mdns_hostname_set(mTrait.id());
    if (ESP_OK != result) {
        err("Fail mDNS set hostname to [%s]: %s (%d)", mTrait.id(), esp_err_to_name(result), result);
        return false;
    }
    char buf[50];
    const int printed = snprintf(buf, sizeof(buf), "Solar panel inverter/converter %s", mTrait.id());
    assert(printed > 0 && printed < sizeof(buf));
    result = mdns_instance_name_set(buf);
    if (ESP_OK != result) {
        err("Fail mDNS set instance name to [%s]: %s (%d)", buf, esp_err_to_name(result), result);
        return false;
    }
    result = mdns_service_add(nullptr, "_https", "_tcp", 443, NULL, 0);
    if (ESP_OK != result) {
        err("Fail mDNS add service HTTPS: %s (%d)", esp_err_to_name(result), result);
        return false;
    }
    return true;
}

Debug Logs

There is nothing relevant to mDNS in the ESP32 logs.

Initial analysis

A screenshot from Wireshark shows two different mDNS queries (request-response pairs):

image

First query-response pair is from running dig +short -p 5353 @224.0.0.251 UBIK-71F164.local which successfully resolves the name. You can see that the request's source port is a random ephemeral port 44238. The response is sent as unicast to source IP of resolve query (192.168.224.100). All good.

Second query-response pair is from running avahi-resolve -n -4 UBIK-71F164.local -v which fails. The request's source port is 5353. The response is sent to multicast group 224.0.0.251 instead of source IP of resolve query. Not good.

The packet capture for those 4 packets is here:
packetcapture.zip

I can trace the problem in mDNS source code to this line, which seems to make a decision to "flush" the response if source port is 5353:

bool send_flush = parsed_packet->src_port == MDNS_SERVICE_PORT;

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jun 9, 2021
@github-actions github-actions bot changed the title mDNS sends reply to multicast instead of unicast if request source port is 5353 (breaks avahi-resolve) mDNS sends reply to multicast instead of unicast if request source port is 5353 (breaks avahi-resolve) (IDFGH-5375) Jun 9, 2021
@negativekelvin
Copy link
Contributor

negativekelvin commented Jun 9, 2021

The cache-flush bit is only set in records in the Resource Record
Sections of Multicast DNS responses sent to UDP port 5353.

The cache-flush bit MUST NOT be set in any resource records in a
response message sent in legacy unicast responses to UDP ports other
than 5353.

Note that neither request sets the unicast response bit. I don't know if the fact that a unicast response is forced when an alternate port is used is per spec or not.

esp-idf/components/mdns/mdns.c

Lines 1334 to 1337 in 21ecef5

if (unicast || !send_flush) {
memcpy(&packet->dst, &parsed_packet->src, sizeof(esp_ip_addr_t));
packet->port = parsed_packet->src_port;
}

@Alvin1Zhang
Copy link
Collaborator

Thanks for sharing the detailed report, we will look into.

@david-cermak
Copy link
Collaborator

Thanks again for this report and sharing the analysis, this is helpful 👍

The actual difference between the dig query and the one sent by avahi is not only the source port, but mainly the fact that dig (and some other libraries, lwip, too) sends so called One shot queries defined in https://datatracker.ietf.org/doc/html/rfc6762#section-5.1, whereas avahi is a fully compliant mDNS querier.

The mDNS library doesn't (correctly) support both scenarios, trying to comply with full featured mDNS responder and at the same time responds to lwip mDNS queries introduced MDNS_REPEAT_QUERY_IN_RESPONSE and CONFIG_MDNS_STRICT_MODE configuration options.
Enabling the strict mode in the project configuration menu would make it respond to avahi's queries (after those 2 minute TTLs), but stop responding to dig's (note, that avahi deviates from the spec here, a little, as well: https://datatracker.ietf.org/doc/html/rfc6762#page-14)

Here's a quick fix, which should work for both queriers:
mdns-Support-for-One-Shot-mDNS-queries.patch.txt

I don't know if the fact that a unicast response is forced when an alternate port is used is per spec or not.

https://datatracker.ietf.org/doc/html/rfc6762#section-6.7

@DaStoned
Copy link
Author

DaStoned commented Jun 10, 2021

Thank you very much for looking into it. My conclusions probably threw a bunch of red herrings on your path, sorry for that.

I confirm that the patch works. I tested on Linux with every tool I could think of (avahi-resolve, dig, ping, Chrome, Firefox). All successfully resolve the name of the patched device. At the same time the unpatched device is not resolved.

Just in case I also booted Windows and tested there (ping, Chrome, Edge) - works like a charm using whatever native thing that does mDNS on Windows. Unfortunately I don't have any Apple devices, so can't test with Mac or iOS.

Side note: dig is just a diagnostic tool, but the system name resolver on Linux uses avahi (most frequently, systemd also has something which I haven't tested), so if that doesn't resolve my device's name it's quite unfortunate. Can't access the web interface by name, can't ping to verify connectivity. If I access the HTTPS web interface by IP, the certificate no longer applies so the browser is screaming murder and hackers (and not using the stored password)...

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Jun 11, 2021
@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress labels Jun 24, 2021
@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting, the fix on master branch is available f601cb0. We are back porting the fix to release/4.3 and release/4.2. Thanks.

@DaStoned
Copy link
Author

Great news! Can't test/confirm with master, as it's frankly a pain in the butt to build. Waiting for the backports.

@Alvin1Zhang
Copy link
Collaborator

@DaStoned Thanks for being patient while waiting for the fix, fix on release/4.3 has been available 3a588d7, fix on release/4.2 is now being synchronized onto GitHub, will update once release/4.2 fix is available. Thanks.

@Alvin1Zhang
Copy link
Collaborator

Thanks for being patient while waiting for the fix, fix on release/4.2 has been available 93921e0, feel free to reopen if the issue still happens. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

5 participants