Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sockets Used Rises, Network Dies, Fixed by running tcpdump #2572

Open
privatwolke opened this Issue Mar 26, 2019 · 2 comments

Comments

Projects
None yet
3 participants
@privatwolke
Copy link

commented Mar 26, 2019

Issue Report

Bug

I'm creating this issue based on this ServerFault question. Short summary:

We run a small Elasticsearch cluster based on CoreOS/Azure. This setup has been working almost entriely maintenance-free for over 1.5 years. As of March 11th (which is when 2023.5.0 was released to stable) we're seeing nodes entering into a strange state after running for a few days.

"Strange state" means that network communication from the affected node to the other two nodes (which are deployed in the same subnet) breaks down. The machine is still reable from different subnets/peered virtual networks and has (very spotty) Internet connection with most requests timing out.

From monitoring the node we could see that the "sockets used" metric reported by /proc/net/sockstat increases from about 300 on a health node to over 4k. We were unsuccessful in lowering this number through the termination of processes (we shutdown the Elasticsearch container and even the entire docker runtime as well as several other processes).

When trying to run tcpdump on the machine, the metric immediately dropped down to a low level and network connectvitiy was restored. We then ran tcpdump on another affected machine through gdb with breakpoints set on all syscalls. It turned out what actually "fixes" the issue is tcpdump through libpcap creating a packet capture socket and turning on promiscous mode.

sockets used metric

We've been in contact with Azure about this who are as confused by this as we are. I have noticed that CoreOS 2023.5.0 included a kernel upgrade from 4.14.96 to 4.19.25 which seems to include lots of changes to hv_netvsc (the driver that the Azure/Hyper-V provided network interface uses).

For full details please check out: https://serverfault.com/questions/959833
I have also captured system stats: https://gist.github.com/privatwolke/e7e2e7eb0272787765f5d3726f37107c

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2023.5.0
VERSION_ID=2023.5.0
BUILD_ID=2019-03-09-0138
PRETTY_NAME="Container Linux by CoreOS 2023.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

What hardware/cloud provider/hypervisor is being used to run Container Linux?

Azure (seen on both Standard_L4s and Standard_DS2_v2 machines)

Expected Behavior

Network keeps running smoothly and without interruptions.

Actual Behavior

"sockets used" metric rises and network becomes unusable

@josephtsalisbury

This comment has been minimized.

Copy link

commented Apr 16, 2019

Have you found a way to consistently reproduce this issue, or does it only happen sporadically?

@dimaslv

This comment has been minimized.

Copy link

commented Apr 17, 2019

We have similar problem of "sockets: used" leak in /proc/net/sockstat after switching from tc fq to tc mq as default default_qdisc on 4.19.30 + Centos 7.6.
But we don't have problems with connectivity and counters still grow after running tcpdump.
I assume that it could be connected with in-kernel tcp pacing feature as it is turned on by switching fq off.
And it is reproducible in our case.

Haven't found any other similar reports, so just FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.