Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Skip devices without hardware address during device detection #12321

Merged
merged 1 commit into from Jun 29, 2020

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jun 29, 2020

We need NodePort and direct routing devices to have a MAC address. If they don't, init.sh fails with the following error:

level=warning msg="+ for NATIVE_DEV in ${NATIVE_DEVS//;/ }" subsys=datapath-loader
level=warning msg="++ cat /sys/class/net/lo/ifindex" subsys=datapath-loader
level=warning msg="+ IDX=1" subsys=datapath-loader
level=warning msg="++ ip link show lo" subsys=datapath-loader
level=warning msg="++ grep ether" subsys=datapath-loader
level=warning msg="++ awk '{print $2}'" subsys=datapath-loader
level=warning msg="+ MAC=" subsys=datapath-loader
level=error msg="Error while initializing daemon" error="exit status 1" subsys=daemon
level=fatal msg="Error while creating daemon" error="exit status 1" subsys=daemon

Thus, we need to skip auto-detected devices that don't have a MAC address. This commit implements that and was tested by injecting a loopback interface with an IP address in the code, in the dev. VM:

loAddr, err := netlink.ParseAddr("192.168.33.11/32")
if err == nil {
    loAddr.LinkIndex = 1
    addrs = append(addrs, *loAddr)
}

Fixes: #12228
Fixes: #12304
Fixes: #11894
/cc @brb

Fix failure to start agent when detected devices don't have hardware addresses

We need NodePort and direct routing devices to have a MAC address. If
they don't, init.sh fails with the following error:

    level=warning msg="+ for NATIVE_DEV in ${NATIVE_DEVS//;/ }" subsys=datapath-loader
    level=warning msg="++ cat /sys/class/net/lo/ifindex" subsys=datapath-loader
    level=warning msg="+ IDX=1" subsys=datapath-loader
    level=warning msg="++ ip link show lo" subsys=datapath-loader
    level=warning msg="++ grep ether" subsys=datapath-loader
    level=warning msg="++ awk '{print $2}'" subsys=datapath-loader
    level=warning msg="+ MAC=" subsys=datapath-loader
    level=error msg="Error while initializing daemon" error="exit status 1" subsys=daemon
    level=fatal msg="Error while creating daemon" error="exit status 1" subsys=daemon

Thus, we need to skip auto-detected devices that don't have a MAC
address. This commit implements that and was tested by injecting a
loopback interface with an IP address in the code, in the dev. VM:

    loAddr, err := netlink.ParseAddr("192.168.33.11/32")
    if err == nil {
        loAddr.LinkIndex = 1
        addrs = append(addrs, *loAddr)
    }

Fixes: #12228
Fixes: #12304
Fixes: 6730d0f ("daemon: Extend BPF NodePort device auto-detection")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. sig/loader Impacts the loading of BPF programs into the kernel. area/daemon Impacts operation of the Cilium daemon. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.8 labels Jun 29, 2020
@pchaigno pchaigno requested review from borkmann and a team June 29, 2020 13:34
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.1 Jun 29, 2020
@pchaigno pchaigno changed the title daemon: Skip devices without hw address during device detection daemon: Skip devices without hardware address during device detection Jun 29, 2020
@pchaigno
Copy link
Member Author

test-me-please

@coveralls
Copy link

Coverage Status

Coverage increased (+0.005%) to 36.94% when pulling 29f5e31 on pr/pchaigno/device-detection-skip-no-hw-addr into 48f8e79 on master.

@joestringer
Copy link
Member

Were you able to reproduce this locally to validate the fix?

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can revisit adding support by adding a all-zero HW address later (plus checking that redirect does the right thing).

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 29, 2020
@pchaigno
Copy link
Member Author

Were you able to reproduce this locally to validate the fix?

Yes, I reproduced by adding a bit of code to inject a loopback device with an IP address such that it would be selected by the device detection (see tested by injecting a loopback interface with an IP address in the code, in the dev. VM in OP). I got the same error as reported by users. With this fix applied, the loopback device is excluded from detection, enp0s8 selected instead, and there is no error.

If preferred, I think I would be able to reproduce without adding code now that I understand the different steps of the device detection (I need to set a specific IP address to an interface with an index higher than enp0s8 I think; otherwise it's overwritten).

LGTM, we can revisit adding support by adding a all-zero HW address later (plus checking that redirect does the right thing).

One thing that's important to note here is that I expect Cilium will still fail to start if a user explicitly configures a device without a HW address. This PR only fixes the detection.

My rationale is that excluding devices without HW addresses is a good way to avoid corner cases. If a user purposely wants to use a device without a HW address (common case is WireGuard), they should set it explicitly and we will need to provide a proper fix such as the all-zero HW address you mention.

There's a bit more work required for that fix (in particular, need to reproduce and maybe document) and I wanted to get the device-detection fix out quickly since a lot of users seem to be hitting that. Maybe I should also exclude devices explicitly set by users if they don't have a HW address now? Unless we expect to have a fix for that soon and it's not worth it?

@joestringer
Copy link
Member

I agree with fixing the auto-detection for most users where they're just incidentally hitting this without explicitly specifying devices, hence why I'm happy to get this in as-is.

In a lot of cases today, Cilium will fail out early to help signal to users that the configuration is wrong. In this case, I think the detection is actually too late and the log messages uninterpretable so at the minimum it'd be nice to add such a check to the --devices side of this to explicitly check and fail out with a clear message. It could certainly be argued that we could instead warn loudly in the logs and push through without configuring on such devices, I guess it depends on how explicitly we expect users to be specifying such devices.

@joestringer joestringer merged commit 089060b into master Jun 29, 2020
@joestringer joestringer deleted the pr/pchaigno/device-detection-skip-no-hw-addr branch June 29, 2020 20:18
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.1 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.1 Jun 30, 2020
@joestringer joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.1 Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
No open projects
1.8.1
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

Error while creating daemon when NodePort device is TUN or WireGuard interface Failure on first import
5 participants