Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If proxy port is taken on node, cilium will crashloop #22465

Closed
2 tasks done
anfernee opened this issue Dec 1, 2022 · 2 comments · Fixed by #25466
Closed
2 tasks done

If proxy port is taken on node, cilium will crashloop #22465

anfernee opened this issue Dec 1, 2022 · 2 comments · Fixed by #25466
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@anfernee
Copy link
Contributor

anfernee commented Dec 1, 2022

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

  1. It happens very rare in our test grid. When it happens we observe crashlooping cilium:
Name:                 cilium-fsqd5
Namespace:            kube-system

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   25m (x65 over 5h26m)    kubelet  Container image "gcr.io/anthos-edgecontainer-release/cilium/cilium:v1.11.7-anthos1.14-gke3.0.3" already present on machine
  Warning  BackOff  58s (x1495 over 5h25m)  kubelet  Back-off restarting failed container
  1. the cilium log says:
2022-11-16T01:35:04.841353158Z stderr F level=info msg="Using autogenerated IPv4 allocation range" subsys=node v4Prefix=10.4.0.0/16
2022-11-16T01:35:04.841362441Z stderr F level=info msg="Initializing daemon" subsys=daemon
2022-11-16T01:35:04.882162787Z stderr F level=info msg="Establishing connection to apiserver" host="https://35.202.63.6:6443" subsys=k8s
2022-11-16T01:35:04.934070731Z stderr F level=info msg="Connected to apiserver" subsys=k8s
2022-11-16T01:35:04.948171533Z stderr F level=info msg="Trying to auto-enable \"enable-node-port\", \"enable-external-ips\", \"enable-host-reachable-services\", \"enable-host-port\", \"enable-session-affinity\" features" subsys=daemon
2022-11-16T01:35:05.187120354Z stderr F level=info msg="Restored services from maps" failed=0 restored=0 subsys=service
2022-11-16T01:35:05.187311329Z stderr F level=info msg="Reading old endpoints..." subsys=daemon
2022-11-16T01:35:05.187345526Z stderr F level=info msg="No old endpoints found." subsys=daemon
2022-11-16T01:35:05.187423583Z stderr F level=info msg="Envoy: Starting xDS gRPC server listening on /var/run/cilium/xds.sock" subsys=envoy-manager
2022-11-16T01:35:05.188911845Z stderr F level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:40963: bind: address already in use" subsys=fqdn/dnsproxy
2022-11-16T01:35:09.189686261Z stderr F level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:40963: bind: address already in use" subsys=fqdn/dnsproxy
2022-11-16T01:35:13.190614861Z stderr F level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:40963: bind: address already in use" subsys=fqdn/dnsproxy
2022-11-16T01:35:17.191445821Z stderr F level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:40963: bind: address already in use" subsys=fqdn/dnsproxy
2022-11-16T01:35:21.192432893Z stderr F level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:40963: bind: address already in use" subsys=fqdn/dnsproxy
2022-11-16T01:35:25.192647526Z stderr F level=fatal msg="Error while creating daemon" error="listen udp 0.0.0.0:40963: bind: address already in use" subsys=daemon
  1. Run sudo lsof -i -P -n to get port usage status in den1401 Spot that the port 40963 is used by command envelope_
COMMAND      PID            USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
envelope_   2483         prodbin   46u  IPv6   76910      0t0  UDP *:40963
  1. Looking into the code, cilium does have port allocation logic to find free port to use. However, it also tries to recover port from existing iptables rules, which shadow the port allocation logic.

func (m *IptablesManager) doGetProxyPort(prog iptablesInterface, name string) uint16 {
rules, err := prog.runProgOutput([]string{"-t", "mangle", "-n", "-L", ciliumPreMangleChain})
if err != nil {
return 0
}
re := regexp.MustCompile(name + ".*TPROXY redirect (0.0.0.0|::):([1-9][0-9]*) mark")
strs := re.FindAllString(rules, -1)
if len(strs) == 0 {
return 0
}
// Pick the port number from the last match, as rules are appended to the end (-A)
portStr := re.ReplaceAllString(strs[len(strs)-1], "$2")
portUInt64, err := strconv.ParseUint(portStr, 10, 16)
if err != nil {
log.WithError(err).Debugf("Port number cannot be parsed: %s", portStr)
return 0
}
return uint16(portUInt64)
}

Cilium Version

Discovered in v1.11. Suspect it also exists on master.

Kernel Version

not kernel specific.

Kubernetes Version

not k8s specific

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@anfernee anfernee added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. kind/community-report This was reported by a user in the Cilium community, eg via Slack. labels Dec 1, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 31, 2023
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2023
@anfernee anfernee reopened this Feb 14, 2023
@anfernee anfernee self-assigned this Feb 14, 2023
anfernee added a commit to anfernee/cilium that referenced this issue May 15, 2023
When cilium-agent starts, it will allocate a free port for proxy to
use, if users don't speicify in config. It also tries to recover
previous allocation from iptables rules, but the recover doesn't check
if the port is already open by other processes on the host. This change
will check the recovered port is free before assign it to DNS proxy.

Fix cilium#22465

Signed-off-by: Yongkun Gui <ygui@google.com>
julianwiedmann pushed a commit that referenced this issue May 31, 2023
When cilium-agent starts, it will allocate a free port for proxy to
use, if users don't speicify in config. It also tries to recover
previous allocation from iptables rules, but the recover doesn't check
if the port is already open by other processes on the host. This change
will check the recovered port is free before assign it to DNS proxy.

Fix #22465

Signed-off-by: Yongkun Gui <ygui@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant