set tx_queuelen to 0 when create veth device #193

hustcat · 2014-09-17T09:08:10Z

VETH network device will create only one Qdisc queue default, this will become peformance bottleneck.Set tx_queuelen to 0 when create VETH device, kernel will not create Qdisc queue for VETH device, so this will improve network performance.

perf outputs are as follows:

Samples: 1M of event 'cycles', Event count (approx.): 237661920980

13.97% [kernel] [k] _spin_lock
_spin_lock
- 37.64% dev_queue_xmit
  - 74.94% br_dev_queue_push_xmit
  - 25.06% neigh_resolve_output
- 19.41% try_to_wake_up
- 17.95% sch_direct_xmit
  - 74.23% __qdisc_run
    - 97.47% dev_queue_xmit
      - 84.43% br_dev_queue_push_xmit
      - 15.57% neigh_resolve_output
    - 2.53% net_tx_action
  - 25.77% dev_queue_xmit
    - 68.81% br_dev_queue_push_xmit
    - 31.19% neigh_resolve_output
- 17.53% task_rq_lock
- 3.71% mod_timer
- 0.86% enqueue_to_backlog
3.82% [kernel] [k] update_curr
1.95% netserver [.] 0x000000000001dd31
1.86% [kernel] [k] update_cfs_shares
1.81% [kernel] [k] tcp_ack
1.76% [kernel] [k] schedule
1.74% [kernel] [k] _spin_lock_irq
1.67% [kernel] [k] dev_queue_xmit

As we can see, spin lock consume a lot of CPU. When set tx_queuelen to 0, the result are as follows:

Samples: 3M of event 'cycles', Event count (approx.): 464264275376

8.23% [kernel] [k] _spin_lock
_spin_lock
- 54.43% dev_queue_xmit
  br_dev_queue_push_xmit
  br_forward_finish
  __br_forward
  br_forward
  br_handle_frame_finish
  br_handle_frame
  __netif_receive_skb
  process_backlog
  net_rx_action
  __do_softirq
  call_softirq
- 25.78% sch_direct_xmit
  - 86.24% __qdisc_run
    - 95.66% dev_queue_xmit
    - 4.34% net_tx_action
  - 13.76% dev_queue_xmit
- 7.12% task_rq_lock
- 3.56% try_to_wake_up
- 2.29% mod_timer
- 2.04% ipt_do_table
- 1.57% enqueue_to_backlog
- 0.60% tcp_v4_rcv
2.28% netserver [.] 0x000000000001a0ca
2.13% [kernel] [k] tcp_ack
2.00% [kernel] [k] update_curr
1.68% [igb] [k] igb_poll

CPU consumption decreased spin locks.I test with netperf/netserver in TCP_RR mode,and performance upgrade from 46W+ to 70W+.

Signed-off-by: Ye Yin hustcat@gmail.com

LK4D4 · 2014-09-17T09:11:26Z

@hustcat Thanks! But you signature is not good for travis. There should be <> around email. I personally generate signs with git commit -s

hustcat · 2014-09-17T09:24:29Z

@LK4D4 Should I recreate PR again?

LK4D4 · 2014-09-17T09:30:43Z

@hustcat no, you can do just amend and push -f.

hustcat · 2014-09-17T09:43:58Z

@LK4D4 Thank you! I have done it.

mrunalp · 2014-09-18T16:51:41Z

Thanks for the contribution. I am going to run it by some networking experts and get back. I just want to be sure that there are no side effects if we disable queueing.

mrunalp · 2014-10-02T18:06:40Z

Sorry about the delay. I have reached out again and expect a response soon.

netoptimizer · 2014-10-03T10:57:26Z

This seems to be a kernel bug, posted a fix: http://patchwork.ozlabs.org/patch/396190/

It should be safe to use your workaround, until the fix will appear in a kernel "close-to-you"...

mrunalp · 2014-10-03T15:10:31Z

Thanks @netoptimizer :) Really appreciate you looking into this!

mrunalp · 2014-10-03T15:10:58Z

@crosbymichael Do we want to wait for the kernel fix or merge this in for now?

netoptimizer · 2014-10-04T09:54:25Z

My kernel fix was rejected.
http://thread.gmane.org/gmane.linux.network/333349/focus=333456

You might still be able to take advantage of this, but you don't want to put this change into a release, without adding other changes too...

The main problem is that we/you are breaking userspace setups that add some qdisc to a veth device, because some of these qdisc inherit and uses the tx_queue_len as their packet limit (and qdisc's with zero queue length cause packet drops). If you are in control of the userspace config, like you are, the fix is simply to set the tx_queue_len (ifconfig txqueuelen) if you will be adding a real qdisc to the "veth" device.
Is that understandable?

hustcat · 2014-10-05T14:00:47Z

Maybe we could add a parameter, just like '--mtu'.

mrunalp · 2014-10-05T17:14:22Z

Jesper, thanks for the update.

mrunalp · 2014-10-05T17:16:16Z

@hustcat I think adding a flag may mean going down the path of introducing settings for qdisc. I think that we may be okay to choose a default of no queueing here since we don't expose NET_ADMIN by default so most of the users won't be touching this setting. @crosbymichael WDYT?

unclejack · 2014-10-08T17:55:52Z

@hustcat @mrunalp @crosbymichael I think we should set the tx_queuelen to 0 if the qdisc on the device is pfifo_fast. We should also make this tx_queuelen configurable in libcontainer to expose it in the future as a configurable setting.

mrunalp · 2014-10-08T19:24:15Z

@unclejack As per tc man page, an interface's default qdisc is pfifo_fast, so I don't think that we need to add that check. I don't mind the idea of adding configurable txqueuelen.

unclejack · 2014-10-08T19:43:32Z

@mrunalp The default qdisc could be something else, not pfifo_fast. Either way, I think this PR is a good start. This could be improved in future PRs if there's a need.

mrunalp · 2014-10-08T20:36:41Z

@unclejack Agreed

crosbymichael · 2014-10-08T21:27:45Z

@unclejack can you help with testing this and give it a LGTM if seems right to you? If we want to make this a setting in the network config then now is the time to do it right and not wait.

mrunalp · 2014-10-08T23:26:32Z

I have pushed an image mrunalp/fedora-netperf to help test this.

mrunalp · 2014-10-08T23:35:26Z

I tested the changes with libcontainer/nsinit and they LGTM. @hustcat Could you add a tx_queuelen setting in the network configuration?

hustcat · 2014-10-09T05:03:06Z

@mrunalp, I have added the code for tx_queuelen setting.

mrunalp · 2014-10-09T05:22:15Z

Thanks! One small nit -- could you use the full name txqueuelen instead of txquelen? I will test this first thing tomorrow morning.

Sent from my iPhone

On Oct 8, 2014, at 10:03 PM, Ye Yin notifications@github.com wrote:

@mrunalp, I have added the code for tx_queuelen setting.

—
Reply to this email directly or view it on GitHub.

hustcat · 2014-10-09T08:48:36Z

OK,@mrunalp, I have done it

netoptimizer · 2014-10-09T09:56:04Z

@unclejack it is very dangerous to set txqueuelen to zero on a pfifo_fast qdisc, it will cause packet drops!

I'm a little afraid, that you have not 100% understood the hack involved in getting the qdisc "noqueue" attached to a device. Which is what you are trying to acheive...

This "noqueue" qdisc can only be seen by looking at the output from:
ip link | grep qdisc

Notice: the "noqueue" qdisc MUST only be used for virtual/software devices (e.g. which have an underlying real device, that will take case of queueing if needed).

Important: The hack only works if tx_queue_len is zero, when the device is brought up, that is before it gets assigned a queue disc (the ip link listing would have the dev "marked" with the qdisc as "noop")

After the device is up (and it already got assigned the default qdisc "pfifo_fast" or the multiqueue version of pfifo_fast "mq"), setting tx_queue_len=0 is dangerous.

The next problem that you need to handle is, then a device with "noqueue" (this also include VLANs) want to had a real qdisc attached, then you need to restore the tx_queue_len=1000, else e.g. bandwidth shaping on that device will be wrong/break.

unclejack · 2014-10-09T10:04:18Z

@netoptimizer Thanks for clearing that up. I was under the impression that setting the tx_queue_len to 0 during setup was equivalent to changing it later and that the qdisc would also get disabled.

unclejack · 2014-10-09T10:12:35Z

@hustcat This PR needs to be rebased against master. It can't be merged as it is. (I'm not a maintainer, but that's still a problem which needs to be addressed)

netoptimizer · 2014-10-09T10:26:09Z

@unclejack good thing I stopped you then!

@hustcat @LK4D4 @mrunalp
AFAIKU you are planning to export the txqueuelen as a user configurable, because want to allow users to disable the qdisc on a device... which I think is the wrong approach and will cause support issues later.

Why don't you simply introduce a fake qdisc called "noqueue" that you allow users to configure?
This is really what you want to achieve, and then you can hide the ugly details of setting txqueuelen while the device is in the correct state.

And I'll teach you a ugly "tc" hack-dance, that allow you to get "noqueue" on a device which is already up, iif you promise to hide/shield it from the users to see it.

This "noqueue" setting would also allow you to restore the txqueuelen, then some user request to change the qdisc away from "noqueue".

unclejack · 2014-10-09T10:33:34Z

@netoptimizer The intention is to have this as an option in the code, not as an user facing setting. We should set the veths to "noqueue" by default and allow that to be changed later via a tc script.

How does that sound to you?

netoptimizer · 2014-10-09T11:41:53Z

@unclejack sounds good with a "noqueue" option for veth.

You should consider "marking" other virtual interfaces (like VLANs) with the same "noqueue" config option.
The TC script should make sure to change the tx_queue_len to something else than zero (default 1000), when adding a new qdisc to a device which are in the "noqueue" state.

The "noqueue" qdisc dance, for a device which is already "up".
(notice this is unsafe for real devices) :

 # Change to a none default qdisc (here one not depending on tx_queue_len)
 tc qdisc replace dev eth1 root pfifo limit 42

 # Change tx_queue_len to zero
 ifconfig eth1 txqueuelen 0

 # Delete root qdisc, resulting in "noqueue" because txqueuelen was zero
 tc qdisc del dev eth1 root

 # Verify the qdisc changed to "noqueue" listing with cmd:
 ip link show eth1

We are abusing the function attach_one_default_qdisc().
https://github.com/torvalds/linux/blob/master/net/sched/sch_generic.c#L726
And the (dev->tx_queue_len == 0) check in attach_default_qdiscs()
https://github.com/torvalds/linux/blob/master/net/sched/sch_generic.c#L752

hustcat · 2014-10-09T11:57:30Z

@netoptimizer ,Your method is very clever. But I think only veth should use txqueuelen, the other virtual interfaces don't need care txqueuelen setting. Of course, if they really need to be concerned, except.

Different interface implement with different Create function, the other virtual interfaces can ignore txqueuelen.
https://github.com/docker/libcontainer/blob/master/namespaces/exec.go#L182

unclejack · 2014-10-09T14:29:43Z

@netoptimizer Thank you for taking the time to explain this. I think we've got a very clear view now.

Based on what you've told us regarding switching to "noqueue" and initializing a veth pair with tx_queue_len=0, this is what we need to do:

make the tx_queue_len a configurable option in libcontainer itself; this would make it easy for projects like Docker which rely on libcontainer to set the txqueuelen to 0 or 1000. Having this as a configurable option for libcontainer would allow us to avoid having to make changes to libcontainer if we need to switch Docker's veths back to tx_queue_len=1000 for some reason.

tx_queue_len is already made configurable by this PR. This is what this pull request is doing.

we're not going to support changing qdiscs in any way in Docker containers (to keep things simple; this is advanced usage and DIY)

Requiring the tc "noqueue dance" to switch veths to noqueue feels wrong, this should be the default.
I think the veths should have a tx_queue_len of 0 by default in the kernel. Whenever a qdisc is added to a veth device with tx_queue_len=0, the kernel should bump the tx_queue_len to 1000. I'm not sure if the kernel could detect whether a qdisc relies on the tx_queue_len or not, but I think having a sane default for this would be the right thing to do.

@crosbymichael
I have tested a "static" version of this patch which only set tx_queue_len to 0 all the time (like the previous version of this PR) and that worked just fine. The veths had txqueuelen=0 and "noqueue" qdisc.

LGTM

mrunalp · 2014-10-09T17:11:13Z

@netoptimizer Thanks for your invaluable inputs again :)
@hustcat Please rebase to latest.

hustcat · 2014-10-10T02:35:21Z

@mrunalp @unclejack Should I re-clone from docker/libcontainer, then patch it; or merge from docker/libcontainer, then resolve conflicts？

mrunalp · 2014-10-10T03:30:49Z

@hustcat

Check if you have a remote for the upstream by doing a git remote -v. If not, then do
git remote add upstream https://github.com/docker/libcontainer
git fetch upstream
git rebase upstream/master // deal with conflicts.
git push origin master // assuming origin is pointing to your fork.

I suggest using a branch for any work as against using your master. It makes life much easier.

Signed-off-by: Ye Yin <hustcat@gmail.com>

hustcat · 2014-10-10T06:37:51Z

@mrunalp Thank you, but I don't know why Travis CI build failed?
----I have found the reasons, ignore it please.

Signed-off-by: Ye Yin <hustcat@gmail.com>

mrunalp · 2014-10-10T17:22:32Z

LGTM, but this needs commits squashed.

netoptimizer · 2014-10-13T16:10:54Z

@unclejack I acknowledge that this is a kernel bug, as userspace should not need to jump through these hoops to get the expected behaviour, when attaching a qdisc.

I'm going to track this in Red Hat bugzilla: BZ 1152231
https://bugzilla.redhat.com/show_bug.cgi?id=1152231
Titled: "qdisc: address unexpected behavior when attaching qdisc to virtual device"

mrunalp · 2014-10-15T23:07:56Z

Closing in favor of #221

It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] docker-archive/libcontainer#193 [2] openshift/origin#11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

hustcat force-pushed the master branch from dd32bb3 to 6f5d217 Compare September 17, 2014 09:43

yeyin added 4 commits October 10, 2014 14:07

set tx_queuelen to 0 when create veth device

0305ada

Signed-off-by: Ye Yin <hustcat@gmail.com>

add a tx_queuelen setting for veth in the network configuration

861a4f8

Signed-off-by: Ye Yin <hustcat@gmail.com>

change TxQueLen to TxQueueLen

57d250f

Signed-off-by: Ye Yin <hustcat@gmail.com>

rebase to upstream for add a tx_queuelen setting for veth

723e7b2

Signed-off-by: Ye Yin <hustcat@gmail.com>

hustcat force-pushed the master branch from 8b3753f to 723e7b2 Compare October 10, 2014 06:30

gofmt "network/types.go network/veth.go"

e0a2aea

Signed-off-by: Ye Yin <hustcat@gmail.com>

mrunalp mentioned this pull request Oct 10, 2014

Adds a tx_queuelen setting for veth in the network configuration #221

Merged

mrunalp closed this Oct 15, 2014

hustcat mentioned this pull request Nov 6, 2014

poor NAT & networking performance moby/moby#7857

Closed

danwinship mentioned this pull request Sep 27, 2016

sdn: set veth TX queue length to unblock QoS openshift/origin#11126

Merged

set tx_queuelen to 0 when create veth device #193

set tx_queuelen to 0 when create veth device #193

Conversation

hustcat commented Sep 17, 2014

LK4D4 commented Sep 17, 2014

hustcat commented Sep 17, 2014

LK4D4 commented Sep 17, 2014

hustcat commented Sep 17, 2014

mrunalp commented Sep 18, 2014

mrunalp commented Oct 2, 2014

netoptimizer commented Oct 3, 2014

mrunalp commented Oct 3, 2014

mrunalp commented Oct 3, 2014

netoptimizer commented Oct 4, 2014

hustcat commented Oct 5, 2014

mrunalp commented Oct 5, 2014

mrunalp commented Oct 5, 2014

unclejack commented Oct 8, 2014

mrunalp commented Oct 8, 2014

unclejack commented Oct 8, 2014

mrunalp commented Oct 8, 2014

crosbymichael commented Oct 8, 2014

mrunalp commented Oct 8, 2014

mrunalp commented Oct 8, 2014

hustcat commented Oct 9, 2014

mrunalp commented Oct 9, 2014

hustcat commented Oct 9, 2014

netoptimizer commented Oct 9, 2014

unclejack commented Oct 9, 2014

unclejack commented Oct 9, 2014

netoptimizer commented Oct 9, 2014

unclejack commented Oct 9, 2014

netoptimizer commented Oct 9, 2014

hustcat commented Oct 9, 2014

unclejack commented Oct 9, 2014

mrunalp commented Oct 9, 2014

hustcat commented Oct 10, 2014

mrunalp commented Oct 10, 2014

hustcat commented Oct 10, 2014

mrunalp commented Oct 10, 2014

netoptimizer commented Oct 13, 2014

mrunalp commented Oct 15, 2014