Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to get of_port of OVS port xxxxxxxx-yyyyyy: timed out: "wait" timed out after 5002 ms #1022

Closed
alex-vmw opened this issue Jul 31, 2020 · 3 comments · Fixed by #1052
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@alex-vmw
Copy link

Describe the bug
After increasing the timeout to 5 seconds for getting the of_port (#830), we stopped getting thousands of the timeout errors, until last night. We had a single node that started to continuously fail pod creation until we cordoned it off (more than 122,000 errors in less than 2 days). The node produced thousands of errors like below:

E0731 09:53:14.111650    1139 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "gulel-cat-zx2m8_gulel-cat(3dee1435-8a5a-4f6c-b899-cfd3bbf8fdd1)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "aef6250bfa5cc7d4b411750dea9829883e3885cde4be692375af07fa44234a21" network for pod "gulel-cat-zx2m8": NetworkPlugin cni failed to set up pod "gulel-cat-zx2m8_gulel-cat" network: failed to connect to ovs for container aef6250bfa5cc7d4b411750dea9829883e3885cde4be692375af07fa44234a21: failed to get of_port of OVS port gulel-ca-1d1224: timed out: "wait" timed out after 5002 ms

To Reproduce
Do not know how to reproduce.

Expected
Pod creation should not fail due to inability to get a port from an OVS.

Actual behavior
OVS was deadlocked on something, so Antrea could not get a port from OVS, causing new pod creation to fail on the node.

Versions:
Please provide the following information:

  • Antrea version: v0.8.2
  • Kubernetes version: 1.15.4
  • Container runtime: Docker 18.6.3
  • Linux kernel version on the Kubernetes Nodes (uname -r): 4.19.43-coreos

Additional context
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200711-020555.1.zip
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200716-023200.1.zip
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200730-141639.1.zip
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200731-061931.1.zip
sc-prd-decc-001-md-dy-minion043-ERROR-logs.zip
sc-prd-decc-001-md-dy-minion043-ovs-logs.zip
sc-prd-decc-001-md-dy-minion043-WARNING-logs.zip
sc-prd-decc-001-md-dy-minion043-journalctl.zip

@alex-vmw alex-vmw added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2020
@antoninbas antoninbas self-assigned this Aug 4, 2020
@antoninbas antoninbas added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 4, 2020
@antoninbas
Copy link
Contributor

I looked at this issue with @alex-vmw. At this point we are pretty convinced that this is the same issues as this:

This leads us to this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=23861. We confirmed that the glibc version shipped in the Antrea Docker image (glibc version for Ubuntu 18.04) is affected by this bug. We also looked at some ovs-vswitchd tracebacks in GDB and they matched the ones from the Github issue (although we were missing some info because of inlining / missing symbols).

Since backporting a more recent version of glibc to Ubuntu 18.04 may not be practical and may carry its own risk, I suggest updating the distrib for the Antrea Docker image to Ubuntu 20.04, which is also a LTS version and comes with a more recent version of glibc in which the bug has been fixed. Obviously this update also carries some risk, as all the packages we depend on (e.g. iptables) will be affected.

@jianjuns @tnqn let me know what you think and if you are opposed to switching to Ubuntu 20.04.

@antoninbas antoninbas added this to the Antrea v0.9.0 release milestone Aug 5, 2020
@tnqn
Copy link
Member

tnqn commented Aug 6, 2020

@antoninbas @alex-vmw great finding!
I'm open to switching to Ubuntu 20.04 but I don't really know the potential risks, would see @jianjuns's input.

@jianjuns
Copy link
Contributor

jianjuns commented Aug 6, 2020

I think fine to switch to 20.04. It has been released for a while.

antoninbas added a commit to antoninbas/antrea that referenced this issue Aug 6, 2020
The main reason for this update is picking up a more recent version of
glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock
(See antrea-io#1022).

In this PR, we only update the distribution for the "main" Antrea Docker
image; other images, such as the ones we use for testing or for
deploying the Antrea Octant plugin, can be updated later if needed.

This is also a good opportunity to upgrade OVS daemons from 2.13.0 to
2.13.1, since the Docker build had to be updated anyway. For the sake of
simplicity, from now on we will only support building the base
openvswitch Docker image for OVS >= 2.13.0.

Fixes antrea-io#1022
antoninbas added a commit to antoninbas/antrea that referenced this issue Aug 6, 2020
The main reason for this update is picking up a more recent version of
glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock
(See antrea-io#1022).

In this PR, we only update the distribution for the "main" Antrea Docker
image; other images, such as the ones we use for testing or for
deploying the Antrea Octant plugin, can be updated later if needed.

This is also a good opportunity to upgrade OVS daemons from 2.13.0 to
2.13.1, since the Docker build had to be updated anyway. For the sake of
simplicity, from now on we will only support building the base
openvswitch Docker image for OVS >= 2.13.0.

Fixes antrea-io#1022
antoninbas added a commit to antoninbas/antrea that referenced this issue Aug 6, 2020
The main reason for this update is picking up a more recent version of
glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock
(See antrea-io#1022).

In this PR, we only update the distribution for the "main" Antrea Docker
image; other images, such as the ones we use for testing or for
deploying the Antrea Octant plugin, can be updated later if needed.

This is also a good opportunity to upgrade OVS daemons from 2.13.0 to
2.13.1, since the Docker build had to be updated anyway. For the sake of
simplicity, from now on we will only support building the base
openvswitch Docker image for OVS >= 2.13.0.

Fixes antrea-io#1022
antoninbas added a commit that referenced this issue Aug 7, 2020
The main reason for this update is picking up a more recent version of
glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock
(See #1022).

In this PR, we only update the distribution for the "main" Antrea Docker
image; other images, such as the ones we use for testing or for
deploying the Antrea Octant plugin, can be updated later if needed.

This is also a good opportunity to upgrade OVS daemons from 2.13.0 to
2.13.1, since the Docker build had to be updated anyway. For the sake of
simplicity, from now on we will only support building the base
openvswitch Docker image for OVS >= 2.13.0.

Fixes #1022
GraysonWu pushed a commit to GraysonWu/antrea that referenced this issue Sep 22, 2020
The main reason for this update is picking up a more recent version of
glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock
(See antrea-io#1022).

In this PR, we only update the distribution for the "main" Antrea Docker
image; other images, such as the ones we use for testing or for
deploying the Antrea Octant plugin, can be updated later if needed.

This is also a good opportunity to upgrade OVS daemons from 2.13.0 to
2.13.1, since the Docker build had to be updated anyway. For the sake of
simplicity, from now on we will only support building the base
openvswitch Docker image for OVS >= 2.13.0.

Fixes antrea-io#1022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants