[3] How to know if a machine is supported? #32

lentzi90 · 2021-01-28T13:41:45Z

I basically spent a full day trying to get compliantkubernetes-kubespray working without success, only to realize the next morning that if I changed the VM template/image everything started working.
To be clear, kubespray run without error multiple times, but the network was broken within the cluster. This was most easily noticed by checking the node-local-dns pods that constantly crashed.

I'm not sure what the difference is between the two images, just that one is named ubuntu 20.04 cloud image (20201111) and the other ubuntu-20.40-server-cloudimg-amd64-20200423. I naturally went for the newer one first, but that is the one I never got working.

What I want with this issue is an answer to how we can avoid wasting time on broken images. Should we upload known good images instead of using whatever the cloud provider happens to have? What do we do if the cloud provider doesn't have any suitable image? Should we test the images in some specific way or just "try it and see"?

Definition of done: We have a documented answer on how to avoid wasting time on broken images. It would be very nice to point out some known good image (perhaps the official ubuntu images?).

Note: I learned that kubespray can deploy a netchecker for you, which can come in handy to quickly determine if the network is working or not.

The text was updated successfully, but these errors were encountered:

Xartos · 2021-01-28T13:45:32Z

Is this not an issue of kubespray? I guess we could try to find out the difference and upstream a check that makes sure that the network is supported.

cristiklein · 2021-01-28T15:23:58Z

Sorry to hear about your bad day @lentzi90.

We moved away from having "known good images" for due to maintainability and agility. Previously we were maintaining our images for each cloud provider and each region of each cloud provider. That proved expensive, incomplete and unrewarding. Regarding agility, we had cases where we needed to switch on short notice from one base image to another (e.g., Ubuntu 18.04 to 20.04 or even RHEL). Hence, starting from "almost whatever the cloud provider offers" makes the most sense to me.

I second @Xartos suggestion. Kubespray includes tons of pre-flight checks, including my least favorite ping_access_ip. Long-term, we should contribute checks we discover upstream.

As a short-term solution, I think we need to get better at performing checks at the end of each step, as opposed to "just" executing each step. For example, we could make it a documented practice to SSH into each node after the Terraform step, i.e., before Kubespray-ing them, and checking "basic VM hygiene":

Does Internet work on each node? Can I ping 8.8.4.4?
Does DNS resolution / HTTPS work? Can I curl https://www.google.com?
Are nodes able to communicate with each-other?
Do I get a full CPU or is my "steal time" larger than zero?
Does the memory pass basic checks?
Do I get the expected network bandwidth?
Do I get the expected disk bandwidth?

I don't know what such a list would look like, but I heard that "not getting a good VM" is a recurring problem with some cloud providers. Opting for "a known good" image might not help.

lentzi90 · 2021-01-29T06:14:44Z

Good points @cristiklein . I agree that it would be good to have a some "smoke tests" for the VMs (and Kubernetes).

As for known good images, I wasn't thinking about us building our own. We did that since we needed to have the Kubernetes packages pre-installed but that's not the case with Kubespray. I'm just thinking that instead of relying on the cloud provider to provide the image, we go to the source. Grab one of the official ubuntu images and upload it, unless the customer has some specific requirements, in which case I guess it is up to them to make sure the image works with Kubespray.

Wouldn't this also be better from a compliance perspective? Fewer hands involved and we "know what we are running".

cristiklein · 2021-01-29T08:48:24Z

@lentzi90 I am honestly a bit surprised that cloud providers would "spoil the source" for no good reasons. I would expect cloud providers to do just that: Copy the image from "the source" and make it available to everyone. We should be able to verify this by checking the checksum of the image. Unfortunately, I couldn't establish equality.

I would very much like to clarify with our partners technically what they do to their images. We rely on the underlying cloud provider to help us be compliant. Working around flaws in their compliance process is unsustainable.

So yes, I would "go to the source", but I would expect cloud providers to do so for us.

cristiklein · 2021-02-04T09:31:28Z

My mom always said life was like a box of chocolates. You never know what you're gonna get.
Forest Gump

We decided to time box this task and document a few commands that one can run to check that one truly got the VMs that one requested.

If, indeed, there is an issue with a specific SafeSpring Ubuntu image, then this should be reported to SafeSpring.

cristiklein · 2021-02-04T11:04:16Z

Ideal acceptance criteria: Come up with a test that succeeds with the good VM image and fails with the bad VM image.

lentzi90 · 2021-02-19T07:58:11Z

Alright I think I have it now!

First some speculation: I think the "bad image" was created from an ubuntu minimal template whereas the "good image" is a normal ubuntu server template. It is not completely obvious, but the naming and description suggests this and the "bad image" also tells you that it has been "minimized" when you log in.

The most obvious error on the "bad image" is that node-local-dns is crashing with the following log:

2021/02/19 06:44:00 [ERROR] Failed to add non-existent interface nodelocaldns: operation not supported                                   
2021/02/19 06:44:00 [INFO] Added interface - nodelocaldns                                                                                
2021/02/19 06:44:00 [ERROR] Error checking dummy device nodelocaldns - operation not supported                                                                                                                                                                                    
listen tcp 169.254.20.10:8080: bind: cannot assign requested address

I found this issue which seemed related, and tried a few debug commands mentioned there. What I found was that it is not possible to create a dummy device on a VM based on the "bad image":

# ip link add dummy0 type dummy
Error: Unknown device type.
# modprobe dummy
modprobe: FATAL: Module dummy not found in directory /lib/modules/5.4.0-1026-kvm

These commands work fine using the "good image".

I don't know enough about modprobe, networking and the linux kernel to tell what would be needed to fix this, but I guess a simple test would be to run modprobe dummy. This could be done as part of the Kubespray preinstall tasks for example. Do you think this would make sense?

cristiklein · 2021-02-19T15:19:57Z

Nice catch @lentzi90 ! I would argue that this is a nice test to have upstream.

lentzi90 · 2021-02-22T08:29:38Z

Created an issue to start the discussion upstream. Let's see how it goes. 🙂
kubernetes-sigs/kubespray#7307

tordsson · 2021-03-04T10:03:35Z

Please implement + PR upstream instead of only discussing upstream

lentzi90 · 2021-03-10T05:44:25Z

This was fixed in kubernetes-sigs/kubespray#7348

lentzi90 added kind/documentation Improvements or additions to documentation kind/investigation Investigating something new labels Jan 28, 2021

tordsson changed the title ~~How to know if a machine is supported?~~ [3] How to know if a machine is supported? Feb 4, 2021

lentzi90 self-assigned this Feb 18, 2021

lentzi90 closed this as completed Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3] How to know if a machine is supported? #32

[3] How to know if a machine is supported? #32

lentzi90 commented Jan 28, 2021

Xartos commented Jan 28, 2021

cristiklein commented Jan 28, 2021

lentzi90 commented Jan 29, 2021

cristiklein commented Jan 29, 2021

cristiklein commented Feb 4, 2021

cristiklein commented Feb 4, 2021

lentzi90 commented Feb 19, 2021

cristiklein commented Feb 19, 2021

lentzi90 commented Feb 22, 2021

tordsson commented Mar 4, 2021

lentzi90 commented Mar 10, 2021

[3] How to know if a machine is supported? #32

[3] How to know if a machine is supported? #32

Comments

lentzi90 commented Jan 28, 2021

Xartos commented Jan 28, 2021

cristiklein commented Jan 28, 2021

lentzi90 commented Jan 29, 2021

cristiklein commented Jan 29, 2021

cristiklein commented Feb 4, 2021

cristiklein commented Feb 4, 2021

lentzi90 commented Feb 19, 2021

cristiklein commented Feb 19, 2021

lentzi90 commented Feb 22, 2021

tordsson commented Mar 4, 2021

lentzi90 commented Mar 10, 2021