Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gluon on VMWare does not get DHCP on WAN interface #496

Open
nmaas87 opened this issue Sep 13, 2015 · 26 comments
Open

Gluon on VMWare does not get DHCP on WAN interface #496

nmaas87 opened this issue Sep 13, 2015 · 26 comments
Labels
0. type: bug This is a bug 0. type: regression 9. meta: known issue Known issue which should be mentioned in release notes

Comments

@nmaas87
Copy link

nmaas87 commented Sep 13, 2015

As written here (https://forum.freifunk.net/t/bug-in-gluon-0-7-2-x86-vmware/7801), I got an 0.7.2. version of the Freifunk Trier VMWare Image (https://github.com/freifunktrier/firmware_store/blob/master/firmware/stable/factory/gluon-fftr-0.7.2-x86-vmware.vmdk) - which does, after inital setup - not get an dhcp lease on the wan (eth1) interface. I left everything in the configuration on the website on default, activated mesh_via_vpn and checked in the vpn key - which was done successfully. however, my appliance can't get an ip on the wan interface and does not connect to the internet. same thing with the image from ggrz. Any ideas?
Can reproduce the error on Workstation 11 as well as on VMWare ESXi 4.1 - with multiple gluon 0.7.2 Versions. Wan Interface, as descibred in the core config of gluon is correctly configured to be eth1, udhcpc eht1 does not give back an ip. Using the UNOFFICAL gluon 0.3.3 image from IT-KL.eu (https://www.it-kl.eu/2015/08/gluon-x86-unter-vmware/ ) does work correctly in the same configuration and enviroment.

@jplitza
Copy link
Member

jplitza commented Sep 13, 2015

Could you verify the layer 2 connectivity of that interface, i.e. use something like ping6 -I eth1 ff02::1 and see if your host responds? Also, more details about the network setup of your VMWare setup would probably be needed.
Also, what Gluon versions do those community version numbers correspond to?

@nmaas87
Copy link
Author

nmaas87 commented Sep 13, 2015

Hi there,

ping6 -I eth1 ff02::1does only give me an ping6: sendto: Operation not permitted.
Same for eth0. However, an ping6 on br-client, br-wan and bat0 give me back an answer.
The system also gives out IPv6 addresses on eth0, the client network - but none ipv4, due to the fact that it is not online.

Network:
Freifunk VM, eth0 (E1000) for client port to vlan 5
Freifunk VM, eth1 (E1000) for wan port to vlan 1
Internal pfSense, which gives LAN and dhcp on vlan 1
Internal pfSense, which takes WAN on vlan 2
VMWare ESXi 4.1, attached with trunk port to switch
Switch with trunk port to VMWare
Switch with vlan 2, WAN access is attached here
Switch with vlan 1, LAN port, Freifunk is attached here with Uplink
Switch with vlan 5, Client port of Freifunk

I can plug my notebook into vlan 1, and it gets an IP Adress succesfully - so trunking, vlan and all is working. I can plug it into vlan 5, and it gets an autogenerated IPv6 address from Freifunk Client Port.
However, the freifunk vm does not get an WAN IP...

In easy, Freifunk is attached in that way:
WAN with dhcpdv4 ----- eth1 (Freifunk) eth0 ------ Client Laptop

The Image is based on Gluon 2015.1.2.

@neocturne
Copy link
Member

I've confirmed this issue. It is caused by Gluon explicitly setting the MAC address of br-wan (to avoid address conflicts in the case mesh-on-WAN is used). VMware blocks unknown MAC addresses though...

This is a regression introduced in v2015.1.2.

@neocturne neocturne added this to the 2015.2 milestone Sep 19, 2015
@adlerweb
Copy link
Contributor

For ESXi: Did you try to allow promisc mode in VMWares network security settings [1]? Gluon uses a network bridge (br-wan) for WAN causing the interface to enter promic mode - this however is forbidden in VMWares default configuration. This also affects static configuration and is not related to DHCP.

[1] http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004099

@Little-Ben
Copy link

nmaas87, this is Freifunk Westpfalz.

Our official repositories lives only under:
https://github.com/freifunk-westpfalz/

Official Downloads solely via:
http://westpfalz.freifunk.net/firmware/
or
http://download.westpfalz.freifunk.net/

Please do not announce other locations as official references!

@nmaas87
Copy link
Author

nmaas87 commented Sep 20, 2015

@adlerweb That was the right hint. I got it working by enabling promiscuous mode on the ESXi Software Switch. However, it would be better if there could be a software-fix in the Freifunk Firmware, as promiscuous mode is more of an work-around and not secure, if I recall correctly.

@Little-Ben Sorry, I rewrote the issue and stated unofficial, as well as removing the reference to Freifunk Westpfalz.

@nmaas87
Copy link
Author

nmaas87 commented Sep 21, 2015

Ok, promiscuous mode does not solve the problem completly. After enabling it, br-wan got an ip address and vpn and everything worked (eth1). On eth0 (client net), I also got an client ip from the net and got access to the internet. However, it turned out to be VERY slow ( In terms of loading www.google.de in about one minute... - or even dropping the connection while accessing other sites like www.web.de ).
On my console, I found the message
br-wan: recveived packet on eth1 with own address as source address
repeated in milisecond steps, spamming the console.
And sometimes something like net_ratelimit: 77 callbacks supressed.
I don't know wheter that is because of the promiscuous mode, however, it seems like a) the VPN network already got an host with the same mac address, or b) I'm somehow shorting the Freifunk net at my end?
I can however guarantee, that I only got eth1 on my WAN end online and eth0 on a own switchport and they never connect at any point... And that the wan interface br-wan is seeing its own source address on the wan port (eth1) is somehow funny.

Anyhow, something is still broken and I'll let the appliance switched off until further advice.

@neocturne
Copy link
Member

@nmaas87, could you post the complete outputs of ip a, brctl show and batctl if?

@nmaas87
Copy link
Author

nmaas87 commented Sep 21, 2015

@NeoRaider Will do. Is there any easy way to access that box from wan/lan interface? Can I enable ssh login via console? (Would make things easier) ( I think I will have time to get the outputs this evening :))

@dracoTrier
Copy link
Contributor

sure, during config-mode, you can enable ssh and set password or ssh key for login.

@nmaas87
Copy link
Author

nmaas87 commented Sep 21, 2015

@NeoRaider Okay, now comes the entertaining stuff: I installed the VM again from scratch, and this time, I did got errors on duplicate MAC addr on the client site (eth0). Turns out the VM fryed on of the FFTR Supernodes due to MAC addr collision. So I shut it down for good. Somehow the VM got the same MAC as the supernode - however, the ones randomly generated by VMWare on eth0 and eth1 are not identical to the Supernode Addresses and should not generated this problem...

@dracoTrier
Copy link
Contributor

my conclusion: current gluon x86vm ist buggy. Needs more testing.

@adlerweb
Copy link
Contributor

@nmaas87 as already said on the forum: I don't think this would help much. It could be possible to get rid of the bridge on the WAN interface (but this would make switching mesh-on-wan a nightmare), but since we are using L2-Routing it is not really possible to avoid promisc on the LAN interface

@nmaas87
Copy link
Author

nmaas87 commented Sep 21, 2015

@adlerweb That is ok, I found out that you can enable promisc mode on an "per interface" base instead of the whole switch. Which is ok by me :).

@neocturne
Copy link
Member

Is it possible that your image was not clean when you experienced the MAC collisions? Gluon derives all MAC addresses from the address of the LAN interface (eth0) on first boot. If the image has already been booted once, it should not be migrated to another host, as the MAC addresses won't be regenerated.

@nmaas87
Copy link
Author

nmaas87 commented Sep 23, 2015

It is possible for the MAC collisions on WAN,
however, after I setup the VM again from scratch - it worked - with the
limitation of collisions on the LAN side - killing the supernode of FFTR on
the way. And that is surely not supposed to happen in production :/.

2015-09-23 2:12 GMT+02:00 Matthias Schiffer notifications@github.com:

Is it possible that your image was not clean when you experienced the MAC
collisions? Gluon derives all MAC addresses from the address of the LAN
interface (eth0) on first boot. If the image has already been booted once,
it should not be migrated to another host, as the MAC addresses won't be
regenerated.


Reply to this email directly or view it on GitHub
#496 (comment)
.

@neocturne
Copy link
Member

I currently see three options to setup the WAN MAC address:

  1. Always explicitly set the WAN MAC address (current solution). Needs promicious mode permission in VMware even without mesh-on-wan. Simple to setup, won't lead to address conflicts (if the virtual NICs' MAC addresses are generated randomly by VMware)
  2. Only set the WAN MAC address explicitly when mesh-on-wan is enabled. MAC address conflicts on the WAN interface are only relevant when mesh-on-wan is enabled. Will make enabling/disabling mesh-on-wan more complex, as it can't be configured using a single UCI option anymore. Won't be a problem for the expert mode interface, but the current command line commands won't work anymore.
  3. Don't set the WAN MAC address at all (instead take the primary address from the eth1 MAC address). Probably not a good idea. Using eth1 won't work when there's only eth0; using eth1 seems arbitrary; would be a VMware-specific hack.

My current plan is keeping the current solution 1 for Gluon 2015.2, and switching to 2 as soon as #284 is solved.

@tcatm
Copy link

tcatm commented Oct 26, 2015

I think the plan is reasonable.

@neocturne neocturne modified the milestones: next, 2015.2 Oct 26, 2015
@neocturne neocturne modified the milestones: network-rewrite, 2016.2 Feb 24, 2016
@FFS-Roland
Copy link

On VMware ESXi you need an additional setting when using promiscuous mode, if vSwitch has more than one phys. NIC connected: "Net.ReversePathFwdCheckPromisc" must be set to 1. You will find it on Configuration - Software - Advanced Settings.

@nmaas87
Copy link
Author

nmaas87 commented Oct 3, 2016

@FFS-Roland Perfect Answer. I tried this now with 0.8.4 Gluon (from Freifunk Trier) - worked like a charm 👍

@nmaas87
Copy link
Author

nmaas87 commented Oct 3, 2016

I documented the whole thing on https://www.nico-maas.de/?p=1320 :). Maybe the helps someone in the future.

@Ranlvor
Copy link
Contributor

Ranlvor commented Oct 3, 2016

Trier 0.8.4 is gluon 2016.1.6-3-g9300421, it's just 2016.1.6 + ee597c6 + Webinterface-color-patches

@rotanid rotanid added the 2. status: blocked Marked as blocked because it's waiting on something label Oct 13, 2017
@rotanid rotanid removed the 2. status: blocked Marked as blocked because it's waiting on something label Feb 17, 2018
@rotanid rotanid added the 9. meta: known issue Known issue which should be mentioned in release notes label Jun 5, 2018
SvenRoederer added a commit to SvenRoederer/freifunk-gluon_core that referenced this issue Sep 29, 2019
830440d nodogsplash: Backport Version 4.0.1. (freifunk-gluon#494)
a93e684 alfred: Merge bugfixes from 2019.3
6ea9e9b batctl: Upgrade hardif settings patches to upstream version
d65d6f1 batctl: Merge bugfixes from 2019.3
9d559fd batman-adv: Merge bugfixes from 2019.3
784ae0e Merge pull request freifunk-gluon#496 from ecsv/batadv-for-19.07
@mweinelt mweinelt changed the title gluon-0.7.2-x86-vmware does not get dhcp on wan interface Gluon on VMWare does not get DHCP on WAN interface Mar 2, 2020
neocturne added a commit that referenced this issue May 8, 2020
As a partial fix to #496, do not touch the MAC address of the WAN
interface when using VXLANs (as only the MAC address of the VXLAN
interface matters to batman-adv).
@neocturne
Copy link
Member

A partial fix for this issue now exists in #2015.

mweinelt pushed a commit that referenced this issue May 12, 2020
As a partial fix to #496, do not touch the MAC address of the WAN
interface when using VXLANs (as only the MAC address of the VXLAN
interface matters to batman-adv).
@RalfJung
Copy link
Contributor

RalfJung commented Jun 19, 2021

A partial fix for this issue now exists in #2015.

FWIW, this also has some unfortunate side-effects: now the MAC can change when switching between a legacy non-vxlan domain and a modern vxlan domain. It would be better if the MAC would be consistent in all our domains, even if that MAC might not be the one printed on the box.
The current situation means that domain migrations can lead to MAC address changes which can lead to nodes being unable to reconnect (e.g. when their IP changes since the DHCP server does not recognize them any more, and the firewall then blocks them).

@neocturne
Copy link
Member

A partial fix for this issue now exists in #2015.

FWIW, this also has some unfortunate side-effects: now the MAC can change when switching between a legacy non-vxlan domain and a modern vxlan domain. It would be better if the MAC would be consistent in all our domains, even if that MAC might not be the one printed on the box.
The current situation means that domain migrations can lead to MAC address changes which can lead to nodes being unable to reconnect (e.g. when their IP changes since the DHCP server does not recognize them any more, and the firewall then blocks them).

That is unfortunate, but at some point the migration needs to happen if we want to fix this at all. Support for non-VXLAN domains will be dropped from Gluon eventually, at which point things will be consistent again.

@RalfJung
Copy link
Contributor

Support for non-VXLAN domains will be dropped from Gluon eventually

Oh, that's interesting news. We don't really have plans currently for changing our legacy domain (and our numbering scheme would make it hard to do so...).

That is unfortunate, but at some point the migration needs to happen if we want to fix this at all.

It would have helped if we could configure this in the site.conf, then the migration can be independent of whether a domain is using vxlan or not. But I guess that's water under the bridge at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug 0. type: regression 9. meta: known issue Known issue which should be mentioned in release notes
Projects
None yet
Development

No branches or pull requests