vmware: persistent network interface name changed on upgrade #2437
Comments
Got things working again by putting a udev rule to set it back based on the PCI address.
|
Question for the other folks in here (@zeeZ, @richardmoe, @nrk-msa, @gcyre): is this only happening on vmware instances or elsewhere too? |
for us its just vmware instances |
@philosifer thanks for the report. Do you still have some nodes on the previous stable? If so can you please attach the output of |
@lucab here's the output from one of our nodes on previous stable
|
and new version
|
gcyre beat me to it but i've stopped update-engine for now on one of my etcd nodes so if you want my output as well let me know |
The obvious difference between the dumps is that |
Looks like the issue is probably systemd/systemd#8446, fixed by systemd/systemd#8458. |
@lucab Yes vmware, after upgrade from an existing vmware_raw iso installation. Since we know all MAC addresses beforehand, my workaround is putting a copy of |
Confirmed that systemd/systemd#8458 fixes it. |
@bgilbert Thanks for the update. Mine is an upgraded vmware raw iso also so that's a match for me. I've now switched one of my clusters from stable to beta so I should pick this sort of thing up before it hits production next time and I can remove my workaround easily once this fix gets into the releases. |
This should be fixed in alpha 1786.1.0 and stable 1745.4.0, due shortly. The next beta release will also include the fix. Thanks for the report. |
@philosifer And thanks for switching one of your clusters to beta. |
@philosifer I hate to ask this here, but how did you get into the VM to change the network? |
I already had SSH access and the name change resulted in a DHCP address
which I could see on the console so I was able to SSH to the new IP.
…On Fri, 25 May 2018, 21:09 scott1138, ***@***.***> wrote:
@philosifer <https://github.com/philosifer> I hate to ask this here, but
how did you get into the VM to change the network?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2437 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/APwF_mXlRsxlrVsIarza6SiMLkbMqQv8ks5t2GTygaJpZM4UL8YE>
.
|
Thanks for reaching out. We don’t have DHCP on that network, guess I’ll have to set something up.
From: philosifer <notifications@github.com>
Sent: Friday, May 25, 2018 4:04 PM
To: coreos/bugs <bugs@noreply.github.com>
Cc: Scott Heath <Scott.Heath@freemanco.com>; Comment <comment@noreply.github.com>
Subject: [EXTERNAL] Re: [coreos/bugs] vmware: persistent network interface name changed on upgrade (#2437)
I already had SSH access and the name change resulted in a DHCP address
which I could see on the console so I was able to SSH to the new IP.
On Fri, 25 May 2018, 21:09 scott1138, ***@***.******@***.***>> wrote:
@philosifer <https://github.com/philosifer> I hate to ask this here, but
how did you get into the VM to change the network?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2437 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/APwF_mXlRsxlrVsIarza6SiMLkbMqQv8ks5t2GTygaJpZM4UL8YE>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcoreos%2Fbugs%2Fissues%2F2437%23issuecomment-392186916&data=02%7C01%7Cscott.heath%40freemanco.com%7Ce6ded7b36ae34472f17a08d5c2831bac%7C25c91f35fc554202b188efdf9ef650e2%7C1%7C0%7C636628790716935238&sdata=NMLs9fA2Orm122PpXSbfsYub0ODmbyCcJAizHx2Y3So%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAc8jnIXFldikEvK9YLIit10KbmnDGqIgks5t2HHdgaJpZM4UL8YE&data=02%7C01%7Cscott.heath%40freemanco.com%7Ce6ded7b36ae34472f17a08d5c2831bac%7C25c91f35fc554202b188efdf9ef650e2%7C1%7C0%7C636628790716935238&sdata=%2BBs9ehoh5IW0twRR%2B5JwNnNdGLnD4LodgAhHgIL8tw0%3D&reserved=0>.
|
This should be fixed in beta 1772.2.0, due shortly. |
Confirmed working fine with my workaround removed in 1772.2.0 |
May that sounds crazy, but could team stop rolling that version causing problems? |
FYI - if you don't have DHCP there is a way to get into the console and roll back to the previous OS version |
Yes, but that's not the case;)
WBR,
Strukov Anton
…On Fri, Jun 1, 2018, 18:10 scott1138 ***@***.***> wrote:
FYI - if you don't have DHCP there is a way to get in at the console and
roll back to the previous OS version
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2437 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIh3ivK2XT-D5EAHNixG037xvu0-Pg0vks5t4Vl8gaJpZM4UL8YE>
.
|
@Savemech We did stop rolling stable until it was fixed. We didn't stop beta, since the previous beta also had the problem. As of today, all three channels include the fix. |
Well, if you'd ask me, my clusters that aren't in prod on VMware with
enabled auto update goes away. Good Lord we insist to disable auto-updates
for production, sane mind vs that tagline latest-greatest auto-updates,
huh. So here my 5 cents on that outage.
So we start to observe is that some boxes went away, and didn't go back.
And we found boxes in situation (sign into, try to gracefully power off
that turned into pumpkin instance, update interface in that tiny vSphere
GUI box to enp10s0, boot, log in, update box, back again to nicest vSphere
GUI and change interface name back to ens192), for ~500 nodes(on different
vSpheres) and all manual work, it's something(I do my best to not use any
strong words). May we did miss something but AFAIK there is no way to
change all of this with tools like vSphere powercli or terraform, correct
me please if I wrong(yes I knew if I have access to vSphere storage I can
use some find + sed magic, but we don't have such access)
So I'm stopping raining on nuts, I'd like to have some option to auto
detect interface and configure it accordingly, or took those pans from
vSphere and configure any first interface found, etc. In short, me, and tema don't
really want to do through this once again. After all, shitstorm happens, team
thoughts like to change distro as well.
WBR,
Strukov Anton
…On Fri, Jun 1, 2018, 19:06 Benjamin Gilbert ***@***.***> wrote:
@Savemech <https://github.com/Savemech> We did stop rolling stable until
it was fixed. We didn't stop beta, since the previous beta also had the
problem. As of today, all three channels include the fix.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2437 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIh3ilzb9byUxH_YoKo4iKF0mQmQ8Kemks5t4WZ4gaJpZM4UL8YE>
.
|
@Savemech I'm sorry this bug caused so much additional work for you. By default, Container Linux will DHCP on any network interface it finds, regardless of the interface name, so typical network configurations should not have been affected. In what way do your nodes depend on the interface name? Do you have static IP address bindings, firewall rules, something else? You can configure locksmith to allow a limited number of machines in a cluster to reboot at a time. That way, if this happened again, you'd lose only a small number of nodes and would have time to track down the problem. Unfortunately that doesn't address more subtle issues that still allow the machine to boot, but at least in that case you'd likely have shell access to fix the problem. One way you can help avoid similar issues in the future is to run some of your nodes on the alpha or beta channels. This problem wasn't caught by our internal testing, and we didn't receive any reports about it while the change was in alpha or beta, so unfortunately we weren't aware of the problem until the bug was partially rolled out to the stable channel. If you run some nodes on alpha or beta, you can help us catch these sorts of problems early. |
@bgilbert I only have 10 nodes on vSphere, but we use static IPs. The instructions for changing some of those settings could be better. i followed a KB for disabling reboots so it wouldn't continue to affect my remaining nodes when we saw it (I had two nodes down) but it just kept rebooting anyway. I see in one of the docs about upcoming support for maintenance windows with container linux. Any chance that will be happening soon? |
@scott1138 Whoops, the maintenance window documentation wasn't updated when we added support in Container Linux Configs; thanks for pointing that out! Fixed in coreos/docs#1244. Which instructions did you follow for disabling reboots? |
@bgilbert I was sent this link by support - https://support.coreos.com/hc/en-us/articles/115001299613. |
@scott1138 To confirm, you're running Tectonic, or at least the Container Linux Update Operator? |
Yes, I am using Tectonic.
|
Issue Report
Auto upgrade from 1688.5.3 to 1745.3.1 has changed the network interface name from
ens192
toenp11s0
on all my vmware based systems. As a consequence the IP addresses went to dhcp and changed and flannel also then failed to find the interface.Bug
Container Linux Version
Environment
Bare-metal kubernetes and etcd running on vmware. VMware is version 6.5 deployed using vmware cloud config.
Expected Behavior
Stable upgrades shouldn't change the adapter name.
Actual Behavior
Interface name changed
Reproduction Steps
Not reproduced but the same thing has happened on 7 VMs before i managed to stop the update service.
Other Information
Is there any way to downgrade again to the old version?
The text was updated successfully, but these errors were encountered: