-
Notifications
You must be signed in to change notification settings - Fork 314
set tx_queuelen to 0 when create veth device #193
Conversation
@hustcat Thanks! But you signature is not good for travis. There should be |
@LK4D4 Should I recreate PR again? |
@hustcat no, you can do just |
@LK4D4 Thank you! I have done it. |
Thanks for the contribution. I am going to run it by some networking experts and get back. I just want to be sure that there are no side effects if we disable queueing. |
Sorry about the delay. I have reached out again and expect a response soon. |
This seems to be a kernel bug, posted a fix: http://patchwork.ozlabs.org/patch/396190/ It should be safe to use your workaround, until the fix will appear in a kernel "close-to-you"... |
Thanks @netoptimizer :) Really appreciate you looking into this! |
@crosbymichael Do we want to wait for the kernel fix or merge this in for now? |
My kernel fix was rejected. You might still be able to take advantage of this, but you don't want to put this change into a release, without adding other changes too... The main problem is that we/you are breaking userspace setups that add some qdisc to a veth device, because some of these qdisc inherit and uses the tx_queue_len as their packet limit (and qdisc's with zero queue length cause packet drops). If you are in control of the userspace config, like you are, the fix is simply to set the tx_queue_len (ifconfig txqueuelen) if you will be adding a real qdisc to the "veth" device. |
Maybe we could add a parameter, just like '--mtu'. |
Jesper, thanks for the update. |
@hustcat I think adding a flag may mean going down the path of introducing settings for qdisc. I think that we may be okay to choose a default of no queueing here since we don't expose NET_ADMIN by default so most of the users won't be touching this setting. @crosbymichael WDYT? |
@hustcat @mrunalp @crosbymichael I think we should set the tx_queuelen to 0 if the qdisc on the device is pfifo_fast. We should also make this tx_queuelen configurable in libcontainer to expose it in the future as a configurable setting. |
@unclejack As per tc man page, an interface's default qdisc is pfifo_fast, so I don't think that we need to add that check. I don't mind the idea of adding configurable txqueuelen. |
@mrunalp The default qdisc could be something else, not pfifo_fast. Either way, I think this PR is a good start. This could be improved in future PRs if there's a need. |
@unclejack Agreed |
@unclejack can you help with testing this and give it a LGTM if seems right to you? If we want to make this a setting in the network config then now is the time to do it right and not wait. |
I have pushed an image mrunalp/fedora-netperf to help test this. |
I tested the changes with libcontainer/nsinit and they LGTM. @hustcat Could you add a tx_queuelen setting in the network configuration? |
@mrunalp, I have added the code for tx_queuelen setting. |
Thanks! One small nit -- could you use the full name txqueuelen instead of txquelen? I will test this first thing tomorrow morning. Sent from my iPhone
|
OK,@mrunalp, I have done it |
@unclejack it is very dangerous to set txqueuelen to zero on a pfifo_fast qdisc, it will cause packet drops! I'm a little afraid, that you have not 100% understood the hack involved in getting the qdisc "noqueue" attached to a device. Which is what you are trying to acheive... This "noqueue" qdisc can only be seen by looking at the output from: Notice: the "noqueue" qdisc MUST only be used for virtual/software devices (e.g. which have an underlying real device, that will take case of queueing if needed). Important: The hack only works if tx_queue_len is zero, when the device is brought up, that is before it gets assigned a queue disc (the ip link listing would have the dev "marked" with the qdisc as "noop") After the device is up (and it already got assigned the default qdisc "pfifo_fast" or the multiqueue version of pfifo_fast "mq"), setting tx_queue_len=0 is dangerous. The next problem that you need to handle is, then a device with "noqueue" (this also include VLANs) want to had a real qdisc attached, then you need to restore the tx_queue_len=1000, else e.g. bandwidth shaping on that device will be wrong/break. |
@netoptimizer Thanks for clearing that up. I was under the impression that setting the tx_queue_len to 0 during setup was equivalent to changing it later and that the qdisc would also get disabled. |
@hustcat This PR needs to be rebased against master. It can't be merged as it is. (I'm not a maintainer, but that's still a problem which needs to be addressed) |
@unclejack good thing I stopped you then! @hustcat @LK4D4 @mrunalp Why don't you simply introduce a fake qdisc called "noqueue" that you allow users to configure? And I'll teach you a ugly "tc" hack-dance, that allow you to get "noqueue" on a device which is already up, iif you promise to hide/shield it from the users to see it. This "noqueue" setting would also allow you to restore the txqueuelen, then some user request to change the qdisc away from "noqueue". |
@netoptimizer The intention is to have this as an option in the code, not as an user facing setting. We should set the veths to "noqueue" by default and allow that to be changed later via a tc script. How does that sound to you? |
@unclejack sounds good with a "noqueue" option for veth. You should consider "marking" other virtual interfaces (like VLANs) with the same "noqueue" config option. The "noqueue" qdisc dance, for a device which is already "up". # Change to a none default qdisc (here one not depending on tx_queue_len)
tc qdisc replace dev eth1 root pfifo limit 42
# Change tx_queue_len to zero
ifconfig eth1 txqueuelen 0
# Delete root qdisc, resulting in "noqueue" because txqueuelen was zero
tc qdisc del dev eth1 root
# Verify the qdisc changed to "noqueue" listing with cmd:
ip link show eth1 We are abusing the function attach_one_default_qdisc(). |
@netoptimizer ,Your method is very clever. But I think only veth should use txqueuelen, the other virtual interfaces don't need care txqueuelen setting. Of course, if they really need to be concerned, except. Different interface implement with different Create function, the other virtual interfaces can ignore txqueuelen. |
@netoptimizer Thank you for taking the time to explain this. I think we've got a very clear view now. Based on what you've told us regarding switching to "noqueue" and initializing a veth pair with tx_queue_len=0, this is what we need to do:
tx_queue_len is already made configurable by this PR. This is what this pull request is doing.
Requiring the tc "noqueue dance" to switch veths to noqueue feels wrong, this should be the default. @crosbymichael LGTM |
@netoptimizer Thanks for your invaluable inputs again :) |
@mrunalp @unclejack Should I re-clone from docker/libcontainer, then patch it; or merge from docker/libcontainer, then resolve conflicts? |
I suggest using a branch for any work as against using your master. It makes life much easier. |
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Ye Yin <hustcat@gmail.com>
@mrunalp Thank you, but I don't know why Travis CI build failed? |
Signed-off-by: Ye Yin <hustcat@gmail.com>
LGTM, but this needs commits squashed. |
@unclejack I acknowledge that this is a kernel bug, as userspace should not need to jump through these hoops to get the expected behaviour, when attaching a qdisc. I'm going to track this in Red Hat bugzilla: BZ 1152231 |
Closing in favor of #221 |
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
VETH network device will create only one Qdisc queue default, this will become peformance bottleneck.Set tx_queuelen to 0 when create VETH device, kernel will not create Qdisc queue for VETH device, so this will improve network performance.
perf outputs are as follows:
Samples: 1M of event 'cycles', Event count (approx.): 237661920980
As we can see, spin lock consume a lot of CPU. When set tx_queuelen to 0, the result are as follows:
Samples: 3M of event 'cycles', Event count (approx.): 464264275376
br_dev_queue_push_xmit
br_forward_finish
__br_forward
br_forward
br_handle_frame_finish
br_handle_frame
__netif_receive_skb
process_backlog
net_rx_action
__do_softirq
call_softirq
CPU consumption decreased spin locks.I test with netperf/netserver in TCP_RR mode,and performance upgrade from 46W+ to 70W+.
Signed-off-by: Ye Yin hustcat@gmail.com