Alarm "system.softnet_stat" is very strict. #1076

Closed
LukasMa opened this Issue Oct 5, 2016 · 18 comments

Projects

None yet

4 participants

@LukasMa
Contributor
LukasMa commented Oct 5, 2016 edited

I find myself in the situation that I get the system.softnet_stat alarm most of the time. Here is a reminder what I'm talking about:
unbenannt

So at first I thought I might have a serious issue with my server, but after reading the Red Head guide linked on the dashboard I think the alarm is just "to strict". In the guide on page nine it says:

This value can be doubled if the 3rd column in /proc/net/softnet_stat is increasing, which indicates that the SoftIRQ did not get enough CPU time. Small increments are normal and do not require tuning.

On my two core server cat /proc/net/softnet_stat looks like this:

000008c5 00000000 00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000acd2 00000000 00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Lets take a look at the graph:
unbenannt1

Most of the time, the blue squeezed line is 0 and I have never seen a dropped package. Lets zoom in a bit:
unbenannt2

Ah there it is: squeezed=0,001 :)

Maybe the alarm should only count when the squeezed value is bigger than 1.

@ktsaou
Member
ktsaou commented Oct 5, 2016 edited

The value just prior to 0,001 should be 0,999. So it is one. These fractions appear when netdata calculates the rate of it (e.g 1 packet in 1,1 seconds is 0.999 on the first second and 0,001 on the second).

If you look closer at the alarm on the netdata sceenshot you posted, it says:

sum of all values of dimension squeezed, of chart system.softnet_stat, starting 1 hour ago and up to now, with options abs, unaligned.

So, it calculates the 1 hour sum. How much is it on your system?

image

It is 10. So making this > 1 will not help.

When I realized this check is not easy to overcome (I have such a server too), I had 2 paths:

  1. make it more loose to stop triggering the alarm, for example > 5 / hour,

  2. or keep it complaining (there is something wrong when this happens - the system is not powerful enough for the work it has), but somehow work around the alarms. So I decided to add this:

    image

    So, it is silent. It does not send any notifications. You see it, only when you have the dashboard open.

On one of my servers that had a lot of issues, including dropped packets, I had to install this as /etc/sysctl.d/99-network-tuning.conf:

# http://www.nateware.com/linux-network-tuning-for-2013.html
# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960

# cloudflare uses this for balancing latency and throughput
# https://blog.cloudflare.com/the-story-of-one-latency-spike/
net.ipv4.tcp_rmem = 4096 1048576 2097152

net.ipv4.tcp_wmem = 4096 65536 16777216

# Also increase the max packet backlog
net.core.netdev_max_backlog = 100000
net.core.netdev_budget = 50000

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

Using the above settings, I do not have dropped packets any more, and the budget issue is rare. For sure the machine that produces these alarms, is not powerful enough for what it does. I believe that if you check the softirq charts of your machine, you will see something like this on a core:

image

This means the core was 100% working to deliver packets, but it was not fast enough to handle them all.

Check also #984.

Of course, If you have better ideas, I am open...

@LukasMa
Contributor
LukasMa commented Oct 5, 2016

Thanks for clarification. I just discovered the 0,999 values, had to zoom in to detect them.

But this is still confusing: The server is a fresh installation, CPU idling most of the time (0-10%), system load is far below 0.5 all the time. There is literally no network traffic, just a very, very small amount.

I will apply some different kernel settings to tune the tcp/ip stack and see if thats helping and report back. Since this is a virtual server (KVM) and the host machine is shared by other VMs, this could be an issue.

@LukasMa LukasMa closed this Oct 5, 2016
@LukasMa LukasMa reopened this Oct 5, 2016
@ktsaou
Member
ktsaou commented Oct 5, 2016

Try the sysctl settings I gave you above. They may solve the issue.

@ktsaou
Member
ktsaou commented Oct 11, 2016

Did it solve it.
I am evaluating the idea of increasing the threshold, although I am not exactly sure about the value.

@ktsaou ktsaou added the question label Oct 11, 2016
@LukasMa
Contributor
LukasMa commented Oct 11, 2016

No, I tried multiple settings over the last week and didn't saw any difference! I also haven't noticed any problems. I couldn't see that the softnet_stat influences any other metrics/values.

@ktsaou
Member
ktsaou commented Oct 11, 2016

Is it 10/hour a good threshold for your case?

@LukasMa
Contributor
LukasMa commented Oct 12, 2016

I can't really tell if it is useful to introduce a threshold just because there have been two issues about it. Can you give me some more time? I will spin up a DigitalOcean droplet with the same setup and do some testing.
I'm still thinking there might be an issue because im running in a KVM environment.

@ktsaou
Member
ktsaou commented Oct 12, 2016

ok. nice!

@kachkaev
kachkaev commented Oct 19, 2016 edited

I also see 1hour netdev budget ran outs warning pretty much all the time. In the last hour there were only 2 squeezed events, which did not even match the peaks in the load.

netdata-softirqs

Apart from that, the server looks healthy and there is very little pressure on the resources overall.

alarm

How are your tests going @LukasMa?

@LukasMa
Contributor
LukasMa commented Oct 19, 2016 edited

I couldn't really isolate the issue. It is fascinating: On a 20$ DigitalOcean VPS I have not seen one single squezzed event in the last week (6d4h to be honest)! But my "personal" server with comparable hardware hosted by a different provider is showing the 1hour netdev budget ran outs all the time. It seems to be exactly like your problem: Little pressure on resources, but still ~3 events per hour. The highest value I witnessed was 11 but that was only one time. The squezzed events don't match with load peeks, disk utilization and other monitored stats.

@ktsaou
Member
ktsaou commented Oct 19, 2016

So, shall I set the threshold to > 10 /hour ?

@LukasMa
Contributor
LukasMa commented Oct 19, 2016

Might be worth a try 👍
But it still annoys me that I can't identify the issue. It might have something to do with my provider and his hardware/drivers (my VPS is KMV virtualized).

@ktsaou ktsaou added enhancement fixed and removed question labels Oct 23, 2016
@ktsaou
Member
ktsaou commented Oct 23, 2016

ok, did that.
I'll merge it later today.

@ktsaou ktsaou added a commit that closed this issue Oct 23, 2016
@ktsaou ktsaou softnet budget is now triggered when it is > 10/hour; fixes #1076 9920443
@ktsaou ktsaou closed this in 9920443 Oct 23, 2016
@Thumpxr
Thumpxr commented Dec 13, 2016

I'm getting the same errors:
grafik

Unfortunatly i have 350+ events. My server is also virtualized, maybe there are issues underlying in the way the virtualization works ?

@ktsaou
Member
ktsaou commented Dec 13, 2016

Have you tried increasing the budget via sysctl?

@Thumpxr
Thumpxr commented Dec 14, 2016

I did as you commented here but still get alarms, although lower.
grafik

@ktsaou
Member
ktsaou commented Dec 14, 2016

yes, I know. This is why they are silent alarms.

If you have raised the budget enough (e.g. 50000 or 100000) and still happens, I have concluded that it is an indication the machine is under-powered for what it does: the machine tries to dequeue an ethernet device, and hits this limit. So, it receives packets in a rate faster than it can process them.

If you have any suggestions, of course I am open to discuss them...

@LukasMa
Contributor
LukasMa commented Dec 14, 2016

@Thumpxr I only noticed this problem on a specific KVM machine and guess this problem relates to visualization. Netdata works without a problem on my RaspPi 2 and a DigitalOcean VPS. So maybe you have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment