Skip to content
This repository has been archived by the owner on Dec 13, 2018. It is now read-only.

set tx_queuelen to 0 when create veth device #193

Closed
wants to merge 5 commits into from

Conversation

hustcat
Copy link

@hustcat hustcat commented Sep 17, 2014

VETH network device will create only one Qdisc queue default, this will become peformance bottleneck.Set tx_queuelen to 0 when create VETH device, kernel will not create Qdisc queue for VETH device, so this will improve network performance.

perf outputs are as follows:

Samples: 1M of event 'cycles', Event count (approx.): 237661920980

  • 13.97% [kernel] [k] _spin_lock
  • _spin_lock
    • 37.64% dev_queue_xmit
      • 74.94% br_dev_queue_push_xmit
      • 25.06% neigh_resolve_output
    • 19.41% try_to_wake_up
    • 17.95% sch_direct_xmit
      • 74.23% __qdisc_run
        • 97.47% dev_queue_xmit
          • 84.43% br_dev_queue_push_xmit
          • 15.57% neigh_resolve_output
        • 2.53% net_tx_action
      • 25.77% dev_queue_xmit
        • 68.81% br_dev_queue_push_xmit
        • 31.19% neigh_resolve_output
    • 17.53% task_rq_lock
    • 3.71% mod_timer
    • 0.86% enqueue_to_backlog
  • 3.82% [kernel] [k] update_curr
  • 1.95% netserver [.] 0x000000000001dd31
  • 1.86% [kernel] [k] update_cfs_shares
  • 1.81% [kernel] [k] tcp_ack
  • 1.76% [kernel] [k] schedule
  • 1.74% [kernel] [k] _spin_lock_irq
  • 1.67% [kernel] [k] dev_queue_xmit

As we can see, spin lock consume a lot of CPU. When set tx_queuelen to 0, the result are as follows:

Samples: 3M of event 'cycles', Event count (approx.): 464264275376

  • 8.23% [kernel] [k] _spin_lock
  • _spin_lock
    • 54.43% dev_queue_xmit
      br_dev_queue_push_xmit
      br_forward_finish
      __br_forward
      br_forward
      br_handle_frame_finish
      br_handle_frame
      __netif_receive_skb
      process_backlog
      net_rx_action
      __do_softirq
      call_softirq
    • 25.78% sch_direct_xmit
      • 86.24% __qdisc_run
        • 95.66% dev_queue_xmit
        • 4.34% net_tx_action
      • 13.76% dev_queue_xmit
    • 7.12% task_rq_lock
    • 3.56% try_to_wake_up
    • 2.29% mod_timer
    • 2.04% ipt_do_table
    • 1.57% enqueue_to_backlog
    • 0.60% tcp_v4_rcv
  • 2.28% netserver [.] 0x000000000001a0ca
  • 2.13% [kernel] [k] tcp_ack
  • 2.00% [kernel] [k] update_curr
  • 1.68% [igb] [k] igb_poll

CPU consumption decreased spin locks.I test with netperf/netserver in TCP_RR mode,and performance upgrade from 46W+ to 70W+.

Signed-off-by: Ye Yin hustcat@gmail.com

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 17, 2014

@hustcat Thanks! But you signature is not good for travis. There should be <> around email. I personally generate signs with git commit -s

@hustcat
Copy link
Author

hustcat commented Sep 17, 2014

@LK4D4 Should I recreate PR again?

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 17, 2014

@hustcat no, you can do just amend and push -f.

@hustcat
Copy link
Author

hustcat commented Sep 17, 2014

@LK4D4 Thank you! I have done it.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 18, 2014

Thanks for the contribution. I am going to run it by some networking experts and get back. I just want to be sure that there are no side effects if we disable queueing.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 2, 2014

Sorry about the delay. I have reached out again and expect a response soon.

@netoptimizer
Copy link

This seems to be a kernel bug, posted a fix: http://patchwork.ozlabs.org/patch/396190/

It should be safe to use your workaround, until the fix will appear in a kernel "close-to-you"...

@mrunalp
Copy link
Contributor

mrunalp commented Oct 3, 2014

Thanks @netoptimizer :) Really appreciate you looking into this!

@mrunalp
Copy link
Contributor

mrunalp commented Oct 3, 2014

@crosbymichael Do we want to wait for the kernel fix or merge this in for now?

@netoptimizer
Copy link

My kernel fix was rejected.
http://thread.gmane.org/gmane.linux.network/333349/focus=333456

You might still be able to take advantage of this, but you don't want to put this change into a release, without adding other changes too...

The main problem is that we/you are breaking userspace setups that add some qdisc to a veth device, because some of these qdisc inherit and uses the tx_queue_len as their packet limit (and qdisc's with zero queue length cause packet drops). If you are in control of the userspace config, like you are, the fix is simply to set the tx_queue_len (ifconfig txqueuelen) if you will be adding a real qdisc to the "veth" device.
Is that understandable?

@hustcat
Copy link
Author

hustcat commented Oct 5, 2014

Maybe we could add a parameter, just like '--mtu'.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 5, 2014

Jesper, thanks for the update.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 5, 2014

@hustcat I think adding a flag may mean going down the path of introducing settings for qdisc. I think that we may be okay to choose a default of no queueing here since we don't expose NET_ADMIN by default so most of the users won't be touching this setting. @crosbymichael WDYT?

@unclejack
Copy link
Contributor

@hustcat @mrunalp @crosbymichael I think we should set the tx_queuelen to 0 if the qdisc on the device is pfifo_fast. We should also make this tx_queuelen configurable in libcontainer to expose it in the future as a configurable setting.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 8, 2014

@unclejack As per tc man page, an interface's default qdisc is pfifo_fast, so I don't think that we need to add that check. I don't mind the idea of adding configurable txqueuelen.

@unclejack
Copy link
Contributor

@mrunalp The default qdisc could be something else, not pfifo_fast. Either way, I think this PR is a good start. This could be improved in future PRs if there's a need.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 8, 2014

@unclejack Agreed

@crosbymichael
Copy link
Contributor

@unclejack can you help with testing this and give it a LGTM if seems right to you? If we want to make this a setting in the network config then now is the time to do it right and not wait.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 8, 2014

I have pushed an image mrunalp/fedora-netperf to help test this.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 8, 2014

I tested the changes with libcontainer/nsinit and they LGTM. @hustcat Could you add a tx_queuelen setting in the network configuration?

@hustcat
Copy link
Author

hustcat commented Oct 9, 2014

@mrunalp, I have added the code for tx_queuelen setting.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 9, 2014

Thanks! One small nit -- could you use the full name txqueuelen instead of txquelen? I will test this first thing tomorrow morning.

Sent from my iPhone

On Oct 8, 2014, at 10:03 PM, Ye Yin notifications@github.com wrote:

@mrunalp, I have added the code for tx_queuelen setting.


Reply to this email directly or view it on GitHub.

@hustcat
Copy link
Author

hustcat commented Oct 9, 2014

OK,@mrunalp, I have done it

@netoptimizer
Copy link

@unclejack it is very dangerous to set txqueuelen to zero on a pfifo_fast qdisc, it will cause packet drops!

I'm a little afraid, that you have not 100% understood the hack involved in getting the qdisc "noqueue" attached to a device. Which is what you are trying to acheive...

This "noqueue" qdisc can only be seen by looking at the output from:
ip link | grep qdisc

Notice: the "noqueue" qdisc MUST only be used for virtual/software devices (e.g. which have an underlying real device, that will take case of queueing if needed).

Important: The hack only works if tx_queue_len is zero, when the device is brought up, that is before it gets assigned a queue disc (the ip link listing would have the dev "marked" with the qdisc as "noop")

After the device is up (and it already got assigned the default qdisc "pfifo_fast" or the multiqueue version of pfifo_fast "mq"), setting tx_queue_len=0 is dangerous.

The next problem that you need to handle is, then a device with "noqueue" (this also include VLANs) want to had a real qdisc attached, then you need to restore the tx_queue_len=1000, else e.g. bandwidth shaping on that device will be wrong/break.

@unclejack
Copy link
Contributor

@netoptimizer Thanks for clearing that up. I was under the impression that setting the tx_queue_len to 0 during setup was equivalent to changing it later and that the qdisc would also get disabled.

@unclejack
Copy link
Contributor

@hustcat This PR needs to be rebased against master. It can't be merged as it is. (I'm not a maintainer, but that's still a problem which needs to be addressed)

@netoptimizer
Copy link

@unclejack good thing I stopped you then!

@hustcat @LK4D4 @mrunalp
AFAIKU you are planning to export the txqueuelen as a user configurable, because want to allow users to disable the qdisc on a device... which I think is the wrong approach and will cause support issues later.

Why don't you simply introduce a fake qdisc called "noqueue" that you allow users to configure?
This is really what you want to achieve, and then you can hide the ugly details of setting txqueuelen while the device is in the correct state.

And I'll teach you a ugly "tc" hack-dance, that allow you to get "noqueue" on a device which is already up, iif you promise to hide/shield it from the users to see it.

This "noqueue" setting would also allow you to restore the txqueuelen, then some user request to change the qdisc away from "noqueue".

@unclejack
Copy link
Contributor

@netoptimizer The intention is to have this as an option in the code, not as an user facing setting. We should set the veths to "noqueue" by default and allow that to be changed later via a tc script.

How does that sound to you?

@netoptimizer
Copy link

@unclejack sounds good with a "noqueue" option for veth.

You should consider "marking" other virtual interfaces (like VLANs) with the same "noqueue" config option.
The TC script should make sure to change the tx_queue_len to something else than zero (default 1000), when adding a new qdisc to a device which are in the "noqueue" state.

The "noqueue" qdisc dance, for a device which is already "up".
(notice this is unsafe for real devices) :

 # Change to a none default qdisc (here one not depending on tx_queue_len)
 tc qdisc replace dev eth1 root pfifo limit 42

 # Change tx_queue_len to zero
 ifconfig eth1 txqueuelen 0

 # Delete root qdisc, resulting in "noqueue" because txqueuelen was zero
 tc qdisc del dev eth1 root

 # Verify the qdisc changed to "noqueue" listing with cmd:
 ip link show eth1

We are abusing the function attach_one_default_qdisc().
https://github.com/torvalds/linux/blob/master/net/sched/sch_generic.c#L726
And the (dev->tx_queue_len == 0) check in attach_default_qdiscs()
https://github.com/torvalds/linux/blob/master/net/sched/sch_generic.c#L752

@hustcat
Copy link
Author

hustcat commented Oct 9, 2014

@netoptimizer ,Your method is very clever. But I think only veth should use txqueuelen, the other virtual interfaces don't need care txqueuelen setting. Of course, if they really need to be concerned, except.

Different interface implement with different Create function, the other virtual interfaces can ignore txqueuelen.
https://github.com/docker/libcontainer/blob/master/namespaces/exec.go#L182

@unclejack
Copy link
Contributor

@netoptimizer Thank you for taking the time to explain this. I think we've got a very clear view now.

Based on what you've told us regarding switching to "noqueue" and initializing a veth pair with tx_queue_len=0, this is what we need to do:

  1. make the tx_queue_len a configurable option in libcontainer itself; this would make it easy for projects like Docker which rely on libcontainer to set the txqueuelen to 0 or 1000. Having this as a configurable option for libcontainer would allow us to avoid having to make changes to libcontainer if we need to switch Docker's veths back to tx_queue_len=1000 for some reason.

tx_queue_len is already made configurable by this PR. This is what this pull request is doing.

  1. we're not going to support changing qdiscs in any way in Docker containers (to keep things simple; this is advanced usage and DIY)

Requiring the tc "noqueue dance" to switch veths to noqueue feels wrong, this should be the default.
I think the veths should have a tx_queue_len of 0 by default in the kernel. Whenever a qdisc is added to a veth device with tx_queue_len=0, the kernel should bump the tx_queue_len to 1000. I'm not sure if the kernel could detect whether a qdisc relies on the tx_queue_len or not, but I think having a sane default for this would be the right thing to do.

@crosbymichael
I have tested a "static" version of this patch which only set tx_queue_len to 0 all the time (like the previous version of this PR) and that worked just fine. The veths had txqueuelen=0 and "noqueue" qdisc.

LGTM

@mrunalp
Copy link
Contributor

mrunalp commented Oct 9, 2014

@netoptimizer Thanks for your invaluable inputs again :)
@hustcat Please rebase to latest.

@hustcat
Copy link
Author

hustcat commented Oct 10, 2014

@mrunalp @unclejack Should I re-clone from docker/libcontainer, then patch it; or merge from docker/libcontainer, then resolve conflicts?

@mrunalp
Copy link
Contributor

mrunalp commented Oct 10, 2014

@hustcat

  1. Check if you have a remote for the upstream by doing a git remote -v. If not, then do
    git remote add upstream https://github.com/docker/libcontainer
  2. git fetch upstream
  3. git rebase upstream/master // deal with conflicts.
  4. git push origin master // assuming origin is pointing to your fork.

I suggest using a branch for any work as against using your master. It makes life much easier.

Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Ye Yin <hustcat@gmail.com>
@hustcat
Copy link
Author

hustcat commented Oct 10, 2014

@mrunalp Thank you, but I don't know why Travis CI build failed?
----I have found the reasons, ignore it please.

Signed-off-by: Ye Yin <hustcat@gmail.com>
@mrunalp
Copy link
Contributor

mrunalp commented Oct 10, 2014

LGTM, but this needs commits squashed.

@netoptimizer
Copy link

@unclejack I acknowledge that this is a kernel bug, as userspace should not need to jump through these hoops to get the expected behaviour, when attaching a qdisc.

I'm going to track this in Red Hat bugzilla: BZ 1152231
https://bugzilla.redhat.com/show_bug.cgi?id=1152231
Titled: "qdisc: address unexpected behavior when attaching qdisc to virtual device"

@mrunalp
Copy link
Contributor

mrunalp commented Oct 15, 2014

Closing in favor of #221

@mrunalp mrunalp closed this Oct 15, 2014
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Nov 3, 2016
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] docker-archive/libcontainer#193
[2] openshift/origin#11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Nov 8, 2016
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] docker-archive/libcontainer#193
[2] openshift/origin#11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hohoxu pushed a commit to hohoxu/n5kernel that referenced this pull request Aug 22, 2018
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] docker-archive/libcontainer#193
[2] openshift/origin#11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
7 participants