Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the mii monitoring during bond creation #6598

Closed
wants to merge 1 commit into from

Conversation

MatteoManzoni
Copy link

This PR add the miimon polling rate during the creation of the
bond interface. The missing miimon configuration prevent the failover of
the interface.

This fix can be improved, the selection of monitoring and monitoring
rate is now static, an addition to the UI can be useful to select the
wanted monitoring type (ARP or MII), polling rate and/or arp target

This issue is referenced here in the forum

This commit add the miimon polling rate during the creation of the
bond interface. The missing miimon configuration prevent the failover of
the interface.

This fix can be improved, the selection of monitoring and monitoring
rate is now static, the addition to the UI can be useful to select the
wanted monitoring type (ARP or MII), polling rate and/or arp target
@pcbsd-commit-bot
Copy link

Can one of the admins verify this patch?

@william-gr
Copy link
Member

ok to test

@themylogin
Copy link
Contributor

@amotin asking for your help here: what is FreeBSD behavior? I don't know much about the subject, but it seems that we should detect and use either ARP, or MII monitoring, depending on the driver and hardware.

@themylogin themylogin requested a review from amotin March 11, 2021 12:18
@yocalebo
Copy link
Contributor

I'm curious how this effects LACP (802.3ad) type bonds.

@MatteoManzoni
Copy link
Author

Hi @yocalebo, miimon can and is recommended to be used with LACP, in this case the LACP slave can go down if the miimonitoring report the interface down or if the LACP PDUs aren't recived, the difference is that the LACP PDU is sent every 30s (slow-rate) or 1s (fast-rate), the mii status is polled every 100ms.

The following is the output of an LACP bond example with miimon (under proxmox, debian-based) (cat /proc/net/bonding/bondX)

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 48:df:37:6b:86:18
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 15
        Partner Key: 32802
        Partner Mac Address: 00:23:04:ee:be:02

Slave Interface: ens1f1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 48:df:37:6b:86:18
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 48:df:37:6b:86:18
    port key: 15
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 4096
    system mac address: 00:23:04:ee:be:02
    oper key: 32802
    port priority: 32768
    port number: 290
    port state: 61

Slave Interface: eno6
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 20:67:7c:ee:c1:74
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 48:df:37:6b:86:18
    port key: 15
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 4096
    system mac address: 00:23:04:ee:be:02
    oper key: 32802
    port priority: 32768
    port number: 16674
    port state: 61

@yocalebo
Copy link
Contributor

@MatteoManzoni thanks for the detailed information. Apologies, as I should have clarified my vague question. However, my main question is does the MII monitoring supersede the LACPPDU settings?

@MatteoManzoni
Copy link
Author

No problem @yocalebo, I'll try to articulate better my answer

The miimonitoring works in conjunction with the LACP PDU, they operate at different levels. MII check if the carrier is present, Layer 1. On the other hands the LACP PDU communicates how the peers are load balancing the traffic and other aggregation data, Layer 2.

Eg.: the link is up (miimonitor sensing the carrier) but misbehaving (Wrong LACP PDU/No LACP PDU), the resultant LACP slave is down
Eg.2: the link is down (no carrier), the resultant LACP slave is down

Let me know if I was more clear this time

@yocalebo
Copy link
Contributor

@MatteoManzoni very much appreciate the clear answer. Anyways, this sounds like MII is the equivalent (loosely) on BFD for routing?

@MatteoManzoni
Copy link
Author

@yocalebo I think that LACP is more like BFD, MII is more like me going in the datacenter and looking at the cable to see if it is connected to my server (10 times a second, or even more), there is no session establishment between the peers

@MatteoManzoni
Copy link
Author

Hi all,
May you give me a feedback about this PR?

@william-gr william-gr requested review from yocalebo and removed request for william-gr and amotin March 19, 2021 11:59
@yocalebo
Copy link
Contributor

@MatteoManzoni I agree by adding this, however, what happens in the rare circumstance that the underlying NIC does not support this? Am I understanding that this option is supported by the driver of the NIC? I realize that most modern NICs will support this but TrueNAS is installed on a large assortment of devices in the wild 😄 I'm mostly just curious what will happen if we try to enable this on a NIC that doesn't support MII link monitoring.

@MatteoManzoni
Copy link
Author

Hi @yocalebo, I've done a quick research in MII support across in-tree kernel drivers and I didn't find any instance of driver not supporting MII. I'll try disabling on firmware level the MII of an old Connect-X NIC and I'll report back what happens.

@amotin
Copy link
Contributor

amotin commented Mar 22, 2021

@amotin asking for your help here: what is FreeBSD behavior?

FreeBSD NIC drivers do not expose hardware MII interfaces directly (MII/GMII/etc are really a hardware interfaces and there are plenty of NICs not using them), so there is no concept of MII monitoring interval in FreeBSD LAGG. Instead every NIC driver that is able to detect link presence and speed just reports that information to network stack in abstracted way. LAGG code uses those abstracted link status change reports. LACP uses that in combination with PDU receive, since immediate report from NIC hardware via interrupt is always faster than PDU timeout or even link status polling.

According to https://www.kernel.org/doc/Documentation/networking/bonding.txt use_carrier option controls whether MII or abstracted netif_carrier_ok() interface should be used, and the last one should be used default, while the MII is deprecated. I'd investigate whether the NIC really doesn't implement the proper KPI. I don't like the idea of polling in general.

@william-gr
Copy link
Member

Given what Alexander said, I am include to close this PR, unless you have different thoughts @MatteoManzoni

@MatteoManzoni
Copy link
Author

Hi all, sorry for the delay, after some time-off I've forgot the issue. I've reimaged (with the newly released 21.04) my cluster and the problem is still present. Only adding miimon 100 and installing ifenslave with apt made my failover bond working.
I agree with @amotin, so I think that the reason behind the bond not working is the missing ifenslave package. @william-gr for me the issue is still present and I'm available if more information are needed.

Not releated I've noticed that on 2 different machines the computed MAC Address of the bond is the same causing collisions as attached
Screen Shot 2021-04-24 at 15 59 58

@william-gr
Copy link
Member

Do you mean the ifenslave package is enough? Or you need that and the mii change?

Perhaps more investigation is necessary on upstream why it doesn't work without mii.

Have you ever installed other OSes? I wonder how they handle it.

@MatteoManzoni
Copy link
Author

Hi, I've made some test on a dev server with ConnectX-4 and ubuntu 20.04. With netplan I've created an active-backup bond and didn't work neither. Once I've installed ifenslave package the bond went operative even without setting a miimon rate.

I'll try reimage my truenas scale cluster to test this theory.

@MatteoManzoni
Copy link
Author

I've reimaged the cluster, the problem is resolved after the installation of ifenslave package

@william-gr
Copy link
Member

To make it clear, all we need to do is have that package in the base system and this change is no longer required?

@MatteoManzoni
Copy link
Author

After the tests I've done, yes

themylogin added a commit that referenced this pull request May 5, 2021
@themylogin themylogin closed this May 5, 2021
themylogin added a commit that referenced this pull request May 5, 2021
@Hyffer
Copy link

Hyffer commented Feb 4, 2023

Sorry to bother you guys, it seems the problem still exists. I went through the Internet, but there was little information. So I decided to leave a comment here, continuing the discussion.

System info
TrueNAS-SCALE-22.12.0
According to commit 48ee577, ifenslave is now included in the base system.

How to reproduce
Add a link aggregation network interface (in my case, named "bond0"), protocol set to FAILOVER. (In the Web UI or Console setup menu)

In linux shell, run command: cat /proc/net/bonding/bond0, and it shows MII Polling Interval (ms): 0. So if I unplug an ethernet cable, although system detect a device down event (as shows in dmesg), the output is still MII Status: up under that slave interface. And it cannot switch to buckup network card.

My temporary workaround
In linux virtual filesystem sysfs, I can manually edit network interface params. Change the content of file /sys/class/net/bond0/bonding/miimon from 0 to 100 can solve this problem. (It is definitely not a suggested method)


And as for

Hi, I've made some test on a dev server with ConnectX-4 and ubuntu 20.04. With netplan I've created an active-backup bond and didn't work neither. Once I've installed ifenslave package the bond went operative even without setting a miimon rate.

I don't know the detail. But, when using netplan to create bond interface, if mii-monitor-interval not specified, it is set to 100 as default (tested on Ubuntu 22.04). So using netplan without setting a miimon rate, it just should work.

However, using ip link set command, the way Truenas Scale manages link aggregation, if miimon rate not specified, 0 is set as default (tested on Ubuntu 22.04).

@VladFlorinIlie
Copy link

The problem still exists which makes the LACP Failover mechanism (as used by Scale) unusable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants