Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APs send out ARP broadcast with anycast IP to mesh (and clients) with local MAC address #1488

Closed
ecsv opened this issue Jul 21, 2018 · 4 comments
Labels
0. type: bug This is a bug

Comments

@ecsv
Copy link
Contributor

ecsv commented Jul 21, 2018

It seems like ARP packets are sent around in the mesh which contain different values for the anycast MAC address. These packets are received by the clients - which are then not able to communicate with their connected AP anymore over IPv4. This is especially problematic when dnsmasq is used as DNS forwarder on the node.


I wanted to ask about this on IRC but it seems hackint thought that I am a spammer. So here the full text

[2018-07-21 19:34:41] <ecsv> I've deployed here the dnsmasq forwarder on all nodes (gluon 2018.1) - but it made quite some problems and ruined my friday. but at least I've noticed an interesting property: problem only happens over ipv4
[2018-07-21 19:34:41] <ecsv> ok, the clients request the DNS records via ipv4 anycast address or via ipv6 anycast address. the problem only happens via ipv4
[2018-07-21 19:34:41] <ecsv> a similar problem can be seen via icmp echo requests. I have roughly a 12% packet loss when pinging the ipv4 anycast address (300x 1s ping) but 0% loss when using the link local anycast address
[2018-07-21 19:34:41] <ecsv> arp-limiter was disabled btw.
[2018-07-21 19:34:41] <ecsv> does anybody else notices the same problem?
[2018-07-22 00:06:54] <ecsv> ok, problem seems to be that from time to time, the client thinks that the anycast ipv4 address is on a completely different device (on the other end of the mesh)
[2018-07-22 00:06:54] <ecsv> and why does he seem to think so? because APs on the other end of of the mesh sends out an ARP broadcast (with the anycast ip address) which asks about some client mac which is not known to it
[2018-07-22 00:06:54] <ecsv> now my client device thinks "hey, cool, why the anycast ipv4 address just changed its mac address. now I have to sent all my traffic to the other one. probing is the cool kids way to handle it"
[2018-07-22 00:06:54] <ecsv> important question now would be: why do APs send out ARP requests with the anycast IP address but not anycast mac address (16:41:95:40:f7:dc) to get the mac address for some clients (maybe because they wanted to sent some DNS replies)
[2018-07-22 00:06:54] <ecsv> and why is this traffic allowed to enter the mesh. the last question can be answered using ebtables. (and the first one is a little harder and I will just ignore it for now)
[2018-07-22 00:06:54] <ecsv> i would say that all ARP traffic from the IPv4 anycast address in any arp ip field should be dropped before entering bat0 in the OUTPUT chain (done in FORWARD but not in OUTPUT chain)
[2018-07-22 00:06:54] <ecsv> and I would say that all traffic from bat0 with  any arp ip field equal to the IPv4 anycast address must be dropped in the ebtables INPUT chain
[2018-07-22 00:06:54] <ecsv> and I would say that all traffic from bat0 with  any arp ip field equal to the IPv4 anycast address must be dropped in the ebtables FORWARD chain
[2018-07-22 00:06:54] <ecsv> neoraider/T_X: would you agree?
[2018-07-22 00:06:54] <ecsv> I have now added following six rules to my local node to work around this problem:
[2018-07-22 00:06:54] <ecsv> ebtables-tiny -I OUTPUT 1 -p ARP --logical-out br-client -o bat0 --arp-ip-src 10.204.32.1 -j DROP 
[2018-07-22 00:06:54] <ecsv> ebtables-tiny -I OUTPUT 1 -p ARP --logical-out br-client -o bat0 --arp-ip-dst 10.204.32.1 -j DROP 
[2018-07-22 00:06:54] <ecsv> ebtables-tiny -I INPUT 1 -p ARP -i bat0 --arp-ip-src 10.204.32.1 -j DROP 
[2018-07-22 00:06:54] <ecsv> ebtables-tiny -I INPUT 1 -p ARP -i bat0 --arp-ip-dst 10.204.32.1 -j DROP 
[2018-07-22 00:06:54] <ecsv> ebtables-tiny -I FORWARD 1 -p ARP -i bat0 --arp-ip-src 10.204.32.1 -j DROP 
[2018-07-22 00:06:54] <ecsv> ebtables-tiny -I FORWARD 1 -p ARP -i bat0 --arp-ip-dst 10.204.32.1 -j DROP 
@ecsv ecsv changed the title APs send out ARP broadcast to mesh (and clients) with local MAC address APs send out ARP broadcast with anycast IP to mesh (and clients) with local MAC address Jul 21, 2018
@ecsv
Copy link
Contributor Author

ecsv commented Jul 21, 2018

Here is a pcap (captured on client0 of the AP) which shows this problem quite clearly anycast_ipv4_icmp_redirect.pcap.gz

First you see two packets of a working IPv4 ping. then you will notice the problematic ARP from the other AP (somewhere in Plauen - but I am currently not even in Vogtland and they are just connected via the VPN-Servers). The next seven packets are not really interesting (I just didn't remove them to keep the order of packets as they were). Packet 11 is then received by the AP but destination mac is obviously not anymore the anycast mac address - but the mac address from another AP in the mesh.

@ecsv
Copy link
Contributor Author

ecsv commented Jul 21, 2018

Gluon was modified the following way to integrate the workaround:

--- a/package/gluon-mesh-batman-adv/files/lib/gluon/ebtables/250-next-node
+++ b/package/gluon-mesh-batman-adv/files/lib/gluon/ebtables/250-next-node
@@ -15,6 +15,14 @@ rule('OUTPUT --logical-out br-client -o bat0 -s ' .. macaddr .. ' -j DROP')
 if next_node.ip4 then
 	rule('FORWARD --logical-out br-client -o bat0 -p ARP --arp-ip-src ' .. next_node.ip4 .. ' -j DROP')
 	rule('FORWARD --logical-out br-client -o bat0 -p ARP --arp-ip-dst ' .. next_node.ip4 .. ' -j DROP')
+	rule('FORWARD --logical-out br-client -i bat0 -p ARP --arp-ip-src ' .. next_node.ip4 .. ' -j DROP')
+	rule('FORWARD --logical-out br-client -i bat0 -p ARP --arp-ip-dst ' .. next_node.ip4 .. ' -j DROP')
+
+	rule('OUTPUT --logical-out br-client -o bat0 -p ARP --arp-ip-src ' .. next_node.ip4 .. ' -j DROP')
+	rule('OUTPUT --logical-out br-client -o bat0 -p ARP --arp-ip-dst ' .. next_node.ip4 .. ' -j DROP')
+
+	rule('INPUT -i bat0 -p ARP --arp-ip-src ' .. next_node.ip4 .. ' -j DROP')
+	rule('INPUT -i bat0 -p ARP --arp-ip-dst ' .. next_node.ip4 .. ' -j DROP')
 
 	rule('FORWARD --logical-out br-client -o bat0 -p IPv4 --ip-destination ' .. next_node.ip4 .. ' -j DROP')
 	rule('OUTPUT --logical-out br-client -o bat0 -p IPv4 --ip-destination ' .. next_node.ip4 .. ' -j DROP')

@ecsv
Copy link
Contributor Author

ecsv commented Jul 22, 2018

@T-X: I just had a look at the history an my guess it that both problems (anycast ipv4 anycast traffic on mesh and ) seems to be b3762fc ("gluon-client-bridge: move IPv4 local subnet route to br-client (#1312)"). Please think about reverting it (with a proper upgrade script) for now.

Here (duplicate_use_of_anycast_ipv4_arp.pcapng.gz) is for example a pcap from an offline node which uses two different mac addresses for the anycast ipv4 addresses. One is the default gluon anycast mac and the other one is the OpenMesh.com mac address of the device. My device is first trying to ping the anycast ipv4 address (10.204.32.1) and the device is answering with the anycast mac address. The device is then also trying to sent an ICMP reply and therefore also transmits an ARP. This time, the ARP is transmitted via the br-client interface (and is therefore using the wrong mac address).

I have have introduced following other new changes (+ran gluon-reconfigure) to work around the problem on my (offline) local testnode for 2018.1

--- a/package/gluon-client-bridge/luasrc/lib/gluon/upgrade/310-gluon-client-bridge-local-node
+++ b/package/gluon-client-bridge/luasrc/lib/gluon/upgrade/310-gluon-client-bridge-local-node
@@ -23,7 +23,8 @@ uci:section('network', 'device', 'local_node_dev', {
 local ip4, ip6
 
 if next_node.ip4 then
-	ip4 = next_node.ip4 .. '/32'
+	local plen = site.prefix4():match('/%d+$')
+	ip4 = next_node.ip4 .. plen
 end
 
 if next_node.ip6 then
--- a/package/gluon-mesh-batman-adv/luasrc/lib/gluon/upgrade/320-gluon-mesh-batman-adv-client-bridge
+++ b/package/gluon-mesh-batman-adv/luasrc/lib/gluon/upgrade/320-gluon-mesh-batman-adv-client-bridge
@@ -25,10 +25,6 @@ uci:section('network', 'interface', 'client', {
 uci:delete('network', 'client_lan')
 
 uci:delete('network', 'local_node_route')
-uci:section('network', 'route', 'local_node_route', {
-	interface = 'client',
-	target = site.prefix4(),
-})
 
 uci:delete('network', 'local_node_route6')
 uci:section('network', 'route6', 'local_node_route6', {

But the ebtables filter rules are still necessary because we have these broken nodes for a while. We must make sure that other nodes are not affected by that.

@ecsv
Copy link
Contributor Author

ecsv commented Jul 22, 2018

#1489 (2018.1.x) and #1490 (master) were merged

@ecsv ecsv closed this as completed Jul 22, 2018
@rotanid rotanid added the 0. type: bug This is a bug label Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug
Projects
None yet
Development

No branches or pull requests

2 participants