kernel: 4.2.x infinite loop with bond interfaces and bridge fdb command #980
Comments
There's a very easy way to reproduce this: #!/bin/bash
set -x
modprobe bonding
modprobe dummy numdummies=2
echo "+bond0" > /sys/class/net/bonding_masters
echo "+dummy0" > /sys/class/net/bond0/bonding/slaves
echo "+dummy1" > /sys/class/net/bond0/bonding/slaves
bridge fdb |
TL;DR Now the details: It looks like in 4.2 the bonding driver started using fdb ops from switchdev which returns EOPTNOTSUPP. This error value gets propagated to the main fdb dump function as the idx value which is not expected to be negative and is forwarded to netlink. On a 4.1 kernel idx is always > 0. This code also changed from 4.2 to upstream tip. Looking into how this could be fixed. Not sure if the callbacks should be made to never return an error, or have a check in the rtnl_fdb_dump for negative values before assigning to idx. |
It looks like the main issue has been fixed in 4.3. However, that works only if you have CONFIG_NET_SWITCHDEV turned on. Oterwise, you'll still get an error from the unimplemented switchdev_port_obj_dump which returns an error instead of returning an index. Let's see what netdev has to say about below patch: diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index bc865e2..bc5765a 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -323,7 +323,7 @@ static inline int switchdev_port_fdb_dump(struct sk_buff *skb,
struct net_device *filter_dev,
int idx)
{
- return -EOPNOTSUPP;
+ return idx;
}
static inline void switchdev_port_fwd_mark_set(struct net_device *dev,
|
Fix was applied to the net-next kernel. |
@dtatulea thank you, sir! I've pulled both the patches in coreos/coreos-overlay#1640, just in case we enable CONFIG_NET_SWITCHDEV. |
/cc @dtatulea
I've discovered that the issue appeared in linux kernel 4.2.x. It's caused flannel OOM issue here coreos/flannel#367 which was already fixed in flannel but not in kernel.
It is possible to reproduce the issue by running this script: https://github.com/kayrus/scripts/blob/master/deploy_ubuntu_cluster.sh (ssh ubuntu@ubuntu1 with password: passw0rd)
just run
bridge fdb
and it will run foreverprobably problem is somewhere here
The text was updated successfully, but these errors were encountered: