-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3] How to know if a machine is supported? #32
Comments
Is this not an issue of kubespray? I guess we could try to find out the difference and upstream a check that makes sure that the network is supported. |
Sorry to hear about your bad day @lentzi90. We moved away from having "known good images" for due to maintainability and agility. Previously we were maintaining our images for each cloud provider and each region of each cloud provider. That proved expensive, incomplete and unrewarding. Regarding agility, we had cases where we needed to switch on short notice from one base image to another (e.g., Ubuntu 18.04 to 20.04 or even RHEL). Hence, starting from "almost whatever the cloud provider offers" makes the most sense to me. I second @Xartos suggestion. Kubespray includes tons of pre-flight checks, including my least favorite As a short-term solution, I think we need to get better at performing checks at the end of each step, as opposed to "just" executing each step. For example, we could make it a documented practice to SSH into each node after the Terraform step, i.e., before Kubespray-ing them, and checking "basic VM hygiene":
I don't know what such a list would look like, but I heard that "not getting a good VM" is a recurring problem with some cloud providers. Opting for "a known good" image might not help. |
Good points @cristiklein . I agree that it would be good to have a some "smoke tests" for the VMs (and Kubernetes). As for known good images, I wasn't thinking about us building our own. We did that since we needed to have the Kubernetes packages pre-installed but that's not the case with Kubespray. I'm just thinking that instead of relying on the cloud provider to provide the image, we go to the source. Grab one of the official ubuntu images and upload it, unless the customer has some specific requirements, in which case I guess it is up to them to make sure the image works with Kubespray. Wouldn't this also be better from a compliance perspective? Fewer hands involved and we "know what we are running". |
@lentzi90 I am honestly a bit surprised that cloud providers would "spoil the source" for no good reasons. I would expect cloud providers to do just that: Copy the image from "the source" and make it available to everyone. We should be able to verify this by checking the checksum of the image. Unfortunately, I couldn't establish equality. I would very much like to clarify with our partners technically what they do to their images. We rely on the underlying cloud provider to help us be compliant. Working around flaws in their compliance process is unsustainable. So yes, I would "go to the source", but I would expect cloud providers to do so for us. |
We decided to time box this task and document a few commands that one can run to check that one truly got the VMs that one requested. If, indeed, there is an issue with a specific SafeSpring Ubuntu image, then this should be reported to SafeSpring. |
Ideal acceptance criteria: Come up with a test that succeeds with the good VM image and fails with the bad VM image. |
Alright I think I have it now! First some speculation: I think the "bad image" was created from an ubuntu minimal template whereas the "good image" is a normal ubuntu server template. It is not completely obvious, but the naming and description suggests this and the "bad image" also tells you that it has been "minimized" when you log in. The most obvious error on the "bad image" is that node-local-dns is crashing with the following log:
I found this issue which seemed related, and tried a few debug commands mentioned there. What I found was that it is not possible to create a # ip link add dummy0 type dummy
Error: Unknown device type.
# modprobe dummy
modprobe: FATAL: Module dummy not found in directory /lib/modules/5.4.0-1026-kvm These commands work fine using the "good image". I don't know enough about modprobe, networking and the linux kernel to tell what would be needed to fix this, but I guess a simple test would be to run |
Nice catch @lentzi90 ! I would argue that this is a nice test to have upstream. |
Created an issue to start the discussion upstream. Let's see how it goes. 🙂 |
Please implement + PR upstream instead of only discussing upstream |
This was fixed in kubernetes-sigs/kubespray#7348 |
I basically spent a full day trying to get compliantkubernetes-kubespray working without success, only to realize the next morning that if I changed the VM template/image everything started working.
To be clear, kubespray run without error multiple times, but the network was broken within the cluster. This was most easily noticed by checking the node-local-dns pods that constantly crashed.
I'm not sure what the difference is between the two images, just that one is named
ubuntu 20.04 cloud image (20201111)
and the otherubuntu-20.40-server-cloudimg-amd64-20200423
. I naturally went for the newer one first, but that is the one I never got working.What I want with this issue is an answer to how we can avoid wasting time on broken images. Should we upload known good images instead of using whatever the cloud provider happens to have? What do we do if the cloud provider doesn't have any suitable image? Should we test the images in some specific way or just "try it and see"?
Definition of done: We have a documented answer on how to avoid wasting time on broken images. It would be very nice to point out some known good image (perhaps the official ubuntu images?).
Note: I learned that kubespray can deploy a netchecker for you, which can come in handy to quickly determine if the network is working or not.
The text was updated successfully, but these errors were encountered: