Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Sockets Used Rises, Network Dies, Fixed by running tcpdump #2572
I'm creating this issue based on this ServerFault question. Short summary:
We run a small Elasticsearch cluster based on CoreOS/Azure. This setup has been working almost entriely maintenance-free for over 1.5 years. As of March 11th (which is when 2023.5.0 was released to stable) we're seeing nodes entering into a strange state after running for a few days.
"Strange state" means that network communication from the affected node to the other two nodes (which are deployed in the same subnet) breaks down. The machine is still reable from different subnets/peered virtual networks and has (very spotty) Internet connection with most requests timing out.
From monitoring the node we could see that the "sockets used" metric reported by
When trying to run tcpdump on the machine, the metric immediately dropped down to a low level and network connectvitiy was restored. We then ran tcpdump on another affected machine through gdb with breakpoints set on all syscalls. It turned out what actually "fixes" the issue is tcpdump through libpcap creating a packet capture socket and turning on promiscous mode.
We've been in contact with Azure about this who are as confused by this as we are. I have noticed that CoreOS 2023.5.0 included a kernel upgrade from 4.14.96 to 4.19.25 which seems to include lots of changes to hv_netvsc (the driver that the Azure/Hyper-V provided network interface uses).
For full details please check out: https://serverfault.com/questions/959833
Container Linux Version
Azure (seen on both Standard_L4s and Standard_DS2_v2 machines)
Network keeps running smoothly and without interruptions.
"sockets used" metric rises and network becomes unusable
We have similar problem of "sockets: used" leak in /proc/net/sockstat after switching from tc fq to tc mq as default default_qdisc on 4.19.30 + Centos 7.6.
Haven't found any other similar reports, so just FYI.