Add installer tests for GKE and k3s #10365

nandajavarma · 2022-05-30T22:50:51Z

Description

This PR adds the first draft of automated installer tests for GKE and k3s. As of this PR the following are the pieces of the testing pipeline:

An installer-tests.ts script that has phases defined(each pointing to a target in a Makefile to perform an operation like create cluster, install gitpod, etc.). A test configuration is basically a combination of multiple phases. This allows easy re-use of code and testing against multiple configurations.
The Makefile that has the targets mentioned above
YAML files that define a cron job which initiates one of the test configurations. There are two added with this PR: 1) GKE 2) k3s

The rest is basically configurations necessary for the make targets. These include:

Terraform modules for infra creation. With this PR, we have 1) gke and 2) k3s modules
Manifests necessary for Gitpod config, like cluster-issuer or kots-config all present in the manifests folder.
Small script to run the existing tests, to be cleaned up in a future iteration.

The terraform modules can be use individually to setup the cluster, usage is documented in the corresponding READMEs:

gke
k3s

Related Issue(s)

Fixes #7316

How to test

To create a GKE cluster, setup Gitpod and run existing tests on it, run:

werft run github -f -s .werft/installer-tests.ts -j .werft/gke-installer-tests.yaml -a debug=true

Similarly, to run on a k3s cluster:

werft run github -f -s .werft/installer-tests.ts -j .werft/k3s-installer-tests.yaml -a debug=true

Make sure that it is run from a workspace that has access to the core-dev cluster since it uses secrets already created there for authentication

Release Notes

NONE

Documentation

corneliusludmann

As far as I can tell, it looks good so far. However, I would like to get a second pair of eyes for this from the platform team. @meysholdt 👀

corneliusludmann · 2022-06-01T12:32:58Z

install/tests/manifests/gcp-issuer.yaml

+  name: gitpod-issuer
+spec:
+  acme:
+    email: nandaja.varma@gmail.com


As far as I know, you can simply remove the email line. This is only needed to notify when a cert expires (which is not relevant for test setups).

woops! you are right, totally missed to clean that up

If you need an email address for internal use as a sink for all emails from LetsEncrypt - we already use letsencrypt-throwaway@gitpod.io for that. With that being said - if it's possible to not specify an email at all then that is the better solution.

turns out, I could just leave out the email key!

meysholdt · 2022-06-02T13:12:29Z

install/infra/terraform/gke/main.tf

+  # We can't create a cluster with no node pool defined, but we want to only use
+  # separately managed node pools. So we create the smallest possible default
+  # node pool and immediately delete it.
+  remove_default_node_pool = true


Deleting the node pool and creating another node pool can make "terraform apply" ~10min slower.

Proposed solution: Instead of deleting the default node pool you configure it to be the "service" node pool. Will this work?

My understanding is that the default node pool uses the default GCP service account, which grants broad permissions. For testing purposes, reusing this node pool is reasonable since it'll be extremely hard to exploit this, so I think that running services on the default node pool is reasonable. For production clusters it may be worth taking the additional cost of spinning up specialized node pools and wiping the default pool.

This is a great point. The solution to re-use the default node as service node is something I thought too. But I couldn't see an option at least in the container cluster terraform resource to set labels only to the default node(the labels being meta and ide). If I set the label field, it would apply the same label to every nodepool. For the testing purposes, I imagine this is probably okay. Not sure if this is the best course of action.

Turns out, I couldn't have been more wrong. That option only applied the labels to the default node. Making this change! Thanks for the suggestion, Mo!

meysholdt · 2022-06-02T13:16:38Z

install/infra/terraform/gke/main.tf

+
+    preemptible  = var.pre-emptible
+    image_type   = "UBUNTU_CONTAINERD"
+    disk_type    = "pd-ssd"


a standrad hdd is chaper is should be sufficient.

meysholdt · 2022-06-02T13:17:43Z

install/infra/terraform/gke/main.tf

+    }
+
+    preemptible  = var.pre-emptible
+    image_type   = "UBUNTU_CONTAINERD"


The COS image is optimised for performance and security. Since we don't run workspaces on this node pool (that are not compatible with COS), I think we should use COS nodes for the service pool.

meysholdt · 2022-06-02T13:20:14Z

install/infra/terraform/gke/main.tf

+
+  management {
+    auto_repair  = true
+    auto_upgrade = false


You intending to enable preemtible nodes shows me that you're confident services will robustly hande restarts.
if that's true, it should be find to enable auto_upgrade.

@meysholdt I have to be honest, I am not super confident how fault-tolerant gitpod setup is. In a way, I was basically trying to translate the current gitpod-gke-guide configuration pretty much exactly like that to terraform and hence this disparity. I will make preemptible not configurable for now. Will iterate on this if gitpod is fault-tolerant enough, considering preemptible VMs are cheaper

I am pretty sure prebuilds are not fault tolerant yet. If they were, we'd be running them on spot VMs now.

install/infra/terraform/gke/main.tf

meysholdt · 2022-06-02T13:23:55Z

install/infra/terraform/gke/variables.tf

+variable "zone" {
+  type        = string
+  description = "The zone to create the cluster."
+  default     = "b"


Can you document how to make the cluster regional (high-availablility-k8s-master) vs. zonal?

meysholdt · 2022-06-02T13:24:55Z

install/infra/terraform/gke/variables.tf

+  default     = "1.21.11-gke.1100"
+}
+
+variable "name" {


why make this configurable? Do you wan to support more than one cluster per GCP project?

I see the possibility of parallel CI jobs creating parallel clusters; parameterizing the cluster name is cheap to do now but can be a bit painful down the road. I'm in favor of this approach.

(Side note; if we have multiple parallel CI jobs then the terraform state bucket will need to have a prefix per run, but this is a yak that we can shave another day.)

Like @adrienthebo suggested, the main reason for considering configurable name parameter is parallel CI runs(it is highly likely since we are planning to test multiple configs). I was hoping that having the default name reduces the hassle a bit.
Regarding the state @adrienthebo , this is a great point. The way I have worked around it currently, is by using terraform workspaces in the Makefile since I couldn't set the the prefix using variables. Each job is run in its own tf workspace and this means the state file will have the name <workspace-name>.tfstate.

meysholdt · 2022-06-02T13:25:31Z

install/infra/terraform/gke/variables.tf

+}
+
+variable "min_count" {
+  type        = number


Why make this configurable?

meysholdt · 2022-06-02T13:26:19Z

install/infra/terraform/gke/variables.tf

+  default     = 50
+}
+
+variable "disk_size_gb" {


Why make this configurable? I think the disk size can be decided statically based on what our services need.

The assumption was that, if the user decides to use more in-cluster storage, it would mean that they need more storage. So keeping it configurable made sense at the time.

meysholdt · 2022-06-02T13:26:44Z

install/infra/terraform/gke/variables.tf

+  default     = 100
+}
+
+variable "initial_node_count" {


Why make this configurable?

meysholdt · 2022-06-02T13:34:50Z

install/infra/terraform/gke/variables.auto.tfvars

+
+project = "sh-automated-tests"
+region = "europe-west1"
+zone = "b"


Is this file helpful? I'm asking because variables.tf already has defaults and for the sake of simplicity it would be nice to keep the number if files low.

Good point. Removed.

meysholdt · 2022-06-02T13:34:59Z

install/infra/terraform/k3s/variables.auto.tfvars

+
+gcp_project = "sh-automated-tests"
+gcp_region  = "europe-west1"
+gcp_zone    = "europe-west1-b"


Is this file helpful? I'm asking because variables.tf already has defaults and for the sake of simplicity it would be nice to keep the number if files low.

kylos101 · 2022-06-03T16:23:56Z

👋 this is awesome to see! 🙏 Thank you so much.

@corneliusludmann What is the plan to help teams own exclusive testing for their domains of expertise? For example, I do not see workspace sub-folders, or updates to the codeowners file for this PR. Is that something that we'd layer on later, in a follow-on PR?

For example, I see some workspace things now, and if we are to own them later, I would prefer that the workspace team is looped in as an explicit approver so we can have related dialog.

adrienthebo · 2022-06-03T17:53:39Z

install/infra/terraform/gke/main.tf

+  }
+}
+
+resource "google_container_cluster" "gitpod-cluster" {


Noting this for information sharing, no action needed.

This module handles a number of cases that are great for a production cluster (and thus would be great for a reference architecture module), but may be omitted during testing. Features like cluster autoscaling or regional clusters will provide better performance and resilience for production traffic but for everyday system testing we shouldn't need a lot of nodes, autoscaling, or resilience in the case of a failed GCP zone.

Having a single, unified module for system tests (automated and manual) as well as reference architectures will be fantastic, and this will be great for that - so I don't think we need to pull out any features. However keeping an eye towards the overall cost of test clusters gives us an advantage; if we can drive down the cost of a single test run then we can run tests more cheaply and more frequently.

adrienthebo · 2022-06-03T18:15:06Z

install/infra/terraform/gke/variables.tf

+variable "zone" {
+  type        = string
+  description = "The zone to create the cluster."
+  default     = "b"


I'm not sure this default will behave as expected; my understanding (rooted in the google provider circa 2019) of the google provider block is that the zone field is the fully qualified zone name, so europe-west1-b.

My understanding may be out of date though, so we'll need to check if this works as expected.

This is true @adrienthebo. Like you can see here, I am appending this value(eg. b) to the var.region (eg. europe-west1). In retrospect, I understand this is particularly hacky. Will fix. 👍

adrienthebo · 2022-06-03T18:16:26Z

install/infra/terraform/gke/variables.tf

+variable "max_count" {
+  type        = number
+  description = "Maximum number of nodes in the NodePool. Must be >= min_node_count."
+  default     = 50


Small comment: a default max_count of 50 for Gitpod services seems pretty high, especially for tests. If we have a reasonable estimate of the number of nodes spun up during an average test run then we may be able to constrain this lower.

In addition, it may be worthwhile to have separate max_count values for the services and workspaces node pools.

Out currently suggested max_count value in the guides are 100, I wanted to bring it a bit down if we are using default but keeping it configurable. We can configure this value in the main.tf used for tests. I believe it is safe to assume that install/infra/terraform/ will have more production ready code and install/tests has test code.

adrienthebo

Hello!

I'm dropping in to review this to get a feel for how y'all perform code reviews.

I'd like to start by noting that I don't yet know the culture or review practices for Gitpod; sincere apologies if I'm stepping on toes, dropping too many comments, or focusing on topics that might be out of scope for this PR. In addition I'm approaching this PR from the perspective of my role as a GCP consultant, which is an outside perspective driven by GCP best practices but isn't yet informed about Gitpod's standard GCP operations. I hope to share my experience in this domain and offer useful insights in constructive and effective ways, and getting this down will take a few review cycles. Thanks for bearing with me!

My overall perspective is that this is a rock solid foundation for system testing the Gitpod installer for self hosting purposes. I've been talking with @mrzarquon regarding reference architectures and automation for self hosted customers and this is the type of module that could be the basis for that reference, which is really awesome.

In terms of correctness, I think I saw one place (the zone variable) that needed a tweak to behave as expected but it might be using a feature (the full zone name being derived based on region + zone) that I'm not familiar with. 👍

Summarizing some of my inline comments, there's some functionality in this module that's well suited for a reference architecture but may benefit from adjustments for running in a CI system. If our installer systems tests are cheap and fast to run then we can run them more frequently, and lightweight test environments will allow upstream tests to validate changes impacting the installation process earlier.

All in all, I think this is really good and look forward to taking this for a spin!

adrienthebo · 2022-06-03T18:20:00Z

install/infra/terraform/gke/main.tf

+  # We can't create a cluster with no node pool defined, but we want to only use
+  # separately managed node pools. So we create the smallest possible default
+  # node pool and immediately delete it.
+  remove_default_node_pool = true


My understanding is that the default node pool uses the default GCP service account, which grants broad permissions. For testing purposes, reusing this node pool is reasonable since it'll be extremely hard to exploit this, so I think that running services on the default node pool is reasonable. For production clusters it may be worth taking the additional cost of spinning up specialized node pools and wiping the default pool.

adrienthebo · 2022-06-03T18:22:12Z

install/infra/terraform/gke/variables.tf

+  default     = 1
+}
+
+variable "pre-emptible" {


Tiny style concern - I think that hyphens are allowed in variable names but underscores are preferred (or omitting the separator entirely in this case).

nandajavarma · 2022-06-03T19:53:25Z

wave this is awesome to see! pray Thank you so much.

@corneliusludmann What is the plan to help teams own exclusive testing for their domains of expertise? For example, I do not see workspace sub-folders, or updates to the codeowners file for this PR. Is that something that we'd layer on later, in a follow-on PR?

For example, I see some workspace things now, and if we are to own them later, I would prefer that the workspace team is looped in as an explicit approver so we can have related dialog.

@kylos101 Yes, we need more iterations to add component specific tests. In general, I have tried to move the current guides that we have for self-hosted setup on gke and k3s to IaC. As of the tests, currently I am running the tests the integrations tests run(the ones inside the tests/ directory). Since the coverage is low, we have to do more iterations to add more component specific tests. The hope is that it will be easily pluggable with the current structure. We will empower the individual teams to add tests for their domains of expertise.

kylos101 · 2022-06-03T20:56:46Z

@nandajavarma that is super helpful to know, thank you! So this PR is more about provisioning infrastructure and installing Gitpod, and then running our existing tests?

Is it safe to say that we'll see future iterations of this, that use different reference architectures to provision infrastructure and install Gitpod?

nandajavarma · 2022-06-03T21:22:10Z

@nandajavarma that is super helpful to know, thank you! So this PR is more about provisioning infrastructure and installing Gitpod, and then running our existing tests?

Is it safe to say that we'll see future iterations of this, that use different reference architectures to provision infrastructure and install Gitpod?

@kylos101 yes, correct! We will add automated setup of different reference architectures(for eg: the WIP aks one here) including all available configurations(eg: self-signed certs or otherwise) to test on in following iterations.

corneliusludmann

Let's ship this as a first iteration and see how well it works. Good job!

nandajavarma requested a review from a team May 30, 2022 22:50

roboquat added do-not-merge/release-note-label-needed size/XXL labels May 30, 2022

github-actions bot added the team: delivery Issue belongs to the self-hosted team label May 30, 2022

nandajavarma force-pushed the nvn/infra-tf branch 6 times, most recently from 0a078ae to b8a6f31 Compare May 31, 2022 22:02

roboquat added release-note-none and removed do-not-merge/release-note-label-needed labels May 31, 2022

nandajavarma requested a review from a team May 31, 2022 22:22

github-actions bot added the team: devx label May 31, 2022

corneliusludmann reviewed Jun 1, 2022

View reviewed changes

nandajavarma force-pushed the nvn/infra-tf branch from b8a6f31 to 5832150 Compare June 1, 2022 21:14

meysholdt reviewed Jun 2, 2022

View reviewed changes

install/infra/terraform/gke/main.tf Outdated Show resolved Hide resolved

meysholdt reviewed Jun 2, 2022

View reviewed changes

adrienthebo reviewed Jun 3, 2022

View reviewed changes

nandajavarma requested a review from meysholdt June 6, 2022 12:10

nandajavarma force-pushed the nvn/infra-tf branch 2 times, most recently from f042e5c to 5d02eed Compare June 6, 2022 18:41

Add installer tests for GKE and k3s

562285c

nandajavarma force-pushed the nvn/infra-tf branch from 46ff721 to 562285c Compare June 7, 2022 12:22

corneliusludmann approved these changes Jun 7, 2022

View reviewed changes

roboquat merged commit 6e583d5 into main Jun 7, 2022

roboquat deleted the nvn/infra-tf branch June 7, 2022 13:49

Add installer tests for GKE and k3s #10365

Add installer tests for GKE and k3s #10365

Conversation

nandajavarma commented May 30, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

corneliusludmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meysholdt Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nandajavarma Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylos101 commented Jun 3, 2022

adrienthebo Jun 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrienthebo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nandajavarma commented Jun 3, 2022

kylos101 commented Jun 3, 2022

nandajavarma commented Jun 3, 2022

corneliusludmann left a comment

Choose a reason for hiding this comment

nandajavarma commented May 30, 2022 •

edited

Loading

meysholdt Jun 2, 2022 •

edited

Loading

nandajavarma Jun 6, 2022 •

edited

Loading

adrienthebo Jun 3, 2022 •

edited

Loading