-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to get of_port of OVS port xxxxxxxx-yyyyyy: timed out: "wait" timed out after 5002 ms #1022
Comments
I looked at this issue with @alex-vmw. At this point we are pretty convinced that this is the same issues as this:
This leads us to this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=23861. We confirmed that the glibc version shipped in the Antrea Docker image (glibc version for Ubuntu 18.04) is affected by this bug. We also looked at some ovs-vswitchd tracebacks in GDB and they matched the ones from the Github issue (although we were missing some info because of inlining / missing symbols). Since backporting a more recent version of glibc to Ubuntu 18.04 may not be practical and may carry its own risk, I suggest updating the distrib for the Antrea Docker image to Ubuntu 20.04, which is also a LTS version and comes with a more recent version of glibc in which the bug has been fixed. Obviously this update also carries some risk, as all the packages we depend on (e.g. iptables) will be affected. @jianjuns @tnqn let me know what you think and if you are opposed to switching to Ubuntu 20.04. |
@antoninbas @alex-vmw great finding! |
I think fine to switch to 20.04. It has been released for a while. |
The main reason for this update is picking up a more recent version of glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock (See antrea-io#1022). In this PR, we only update the distribution for the "main" Antrea Docker image; other images, such as the ones we use for testing or for deploying the Antrea Octant plugin, can be updated later if needed. This is also a good opportunity to upgrade OVS daemons from 2.13.0 to 2.13.1, since the Docker build had to be updated anyway. For the sake of simplicity, from now on we will only support building the base openvswitch Docker image for OVS >= 2.13.0. Fixes antrea-io#1022
The main reason for this update is picking up a more recent version of glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock (See antrea-io#1022). In this PR, we only update the distribution for the "main" Antrea Docker image; other images, such as the ones we use for testing or for deploying the Antrea Octant plugin, can be updated later if needed. This is also a good opportunity to upgrade OVS daemons from 2.13.0 to 2.13.1, since the Docker build had to be updated anyway. For the sake of simplicity, from now on we will only support building the base openvswitch Docker image for OVS >= 2.13.0. Fixes antrea-io#1022
The main reason for this update is picking up a more recent version of glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock (See antrea-io#1022). In this PR, we only update the distribution for the "main" Antrea Docker image; other images, such as the ones we use for testing or for deploying the Antrea Octant plugin, can be updated later if needed. This is also a good opportunity to upgrade OVS daemons from 2.13.0 to 2.13.1, since the Docker build had to be updated anyway. For the sake of simplicity, from now on we will only support building the base openvswitch Docker image for OVS >= 2.13.0. Fixes antrea-io#1022
The main reason for this update is picking up a more recent version of glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock (See #1022). In this PR, we only update the distribution for the "main" Antrea Docker image; other images, such as the ones we use for testing or for deploying the Antrea Octant plugin, can be updated later if needed. This is also a good opportunity to upgrade OVS daemons from 2.13.0 to 2.13.1, since the Docker build had to be updated anyway. For the sake of simplicity, from now on we will only support building the base openvswitch Docker image for OVS >= 2.13.0. Fixes #1022
The main reason for this update is picking up a more recent version of glibc, as the one that ships with Ubuntu 18.04 can cause OVS to deadlock (See antrea-io#1022). In this PR, we only update the distribution for the "main" Antrea Docker image; other images, such as the ones we use for testing or for deploying the Antrea Octant plugin, can be updated later if needed. This is also a good opportunity to upgrade OVS daemons from 2.13.0 to 2.13.1, since the Docker build had to be updated anyway. For the sake of simplicity, from now on we will only support building the base openvswitch Docker image for OVS >= 2.13.0. Fixes antrea-io#1022
Describe the bug
After increasing the timeout to 5 seconds for getting the of_port (#830), we stopped getting thousands of the timeout errors, until last night. We had a single node that started to continuously fail pod creation until we cordoned it off (more than 122,000 errors in less than 2 days). The node produced thousands of errors like below:
To Reproduce
Do not know how to reproduce.
Expected
Pod creation should not fail due to inability to get a port from an OVS.
Actual behavior
OVS was deadlocked on something, so Antrea could not get a port from OVS, causing new pod creation to fail on the node.
Versions:
Please provide the following information:
uname -r
): 4.19.43-coreosAdditional context
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200711-020555.1.zip
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200716-023200.1.zip
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200730-141639.1.zip
antrea-agent.sc-prd-decc-001-md-dy-minion043.root.log.INFO.20200731-061931.1.zip
sc-prd-decc-001-md-dy-minion043-ERROR-logs.zip
sc-prd-decc-001-md-dy-minion043-ovs-logs.zip
sc-prd-decc-001-md-dy-minion043-WARNING-logs.zip
sc-prd-decc-001-md-dy-minion043-journalctl.zip
The text was updated successfully, but these errors were encountered: