antrea-agent kept crashing on iptables error after restarting a Node with many Pods running on it #1499
Labels
kind/bug
Categorizes issue or PR as related to a bug.
Milestone
Describe the bug
In large scale clusters, xtables lock may be hold by kubelet/ kube-proxy/ portmap for a long time, especially when there are many service rules in nat table. antrea-agent may not be able to acquire the lock in short time. If the agent blocks on the lock or quit itself, the CNI server won't be running, causing all CNI requests to fail.
If the Pods' restart policy is Always and there are dead Pods, container runtime will keep retrying calling CNIs, during which portmap is called first, leading to more xtables lock competitor.
For example, after rebooting a Node which had ran many Pods, antrea-agent kept crashing.
Many iptables commands are running:
They are spawned by portmap processes:
The parent process 613 is containerd:
To Reproduce
Expected
antrea-agent shouldn't keep crashing.
Actual behavior
antrea-agent kept crashing, no CNI requests succeeded, kubelet kept retrying StopPodSandbox.
Versions:
Please provide the following information:
The text was updated successfully, but these errors were encountered: