Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add installer tests for GKE and k3s #10365

Merged
merged 1 commit into from
Jun 7, 2022
Merged

Add installer tests for GKE and k3s #10365

merged 1 commit into from
Jun 7, 2022

Conversation

nandajavarma
Copy link
Contributor

@nandajavarma nandajavarma commented May 30, 2022

Description

This PR adds the first draft of automated installer tests for GKE and k3s. As of this PR the following are the pieces of the testing pipeline:

  • An installer-tests.ts script that has phases defined(each pointing to a target in a Makefile to perform an operation like create cluster, install gitpod, etc.). A test configuration is basically a combination of multiple phases. This allows easy re-use of code and testing against multiple configurations.
  • The Makefile that has the targets mentioned above
  • YAML files that define a cron job which initiates one of the test configurations. There are two added with this PR: 1) GKE 2) k3s

The rest is basically configurations necessary for the make targets. These include:

  • Terraform modules for infra creation. With this PR, we have 1) gke and 2) k3s modules
  • Manifests necessary for Gitpod config, like cluster-issuer or kots-config all present in the manifests folder.
  • Small script to run the existing tests, to be cleaned up in a future iteration.

The terraform modules can be use individually to setup the cluster, usage is documented in the corresponding READMEs:

Related Issue(s)

Fixes #7316

How to test

To create a GKE cluster, setup Gitpod and run existing tests on it, run:

werft run github -f -s .werft/installer-tests.ts -j .werft/gke-installer-tests.yaml -a debug=true

Similarly, to run on a k3s cluster:

werft run github -f -s .werft/installer-tests.ts -j .werft/k3s-installer-tests.yaml -a debug=true

Make sure that it is run from a workspace that has access to the core-dev cluster since it uses secrets already created there for authentication

Release Notes

NONE

Documentation

Copy link
Contributor

@corneliusludmann corneliusludmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, it looks good so far. However, I would like to get a second pair of eyes for this from the platform team. @meysholdt 👀

name: gitpod-issuer
spec:
acme:
email: nandaja.varma@gmail.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, you can simply remove the email line. This is only needed to notify when a cert expires (which is not relevant for test setups).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops! you are right, totally missed to clean that up

Copy link
Member

@meysholdt meysholdt Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need an email address for internal use as a sink for all emails from LetsEncrypt - we already use letsencrypt-throwaway@gitpod.io for that. With that being said - if it's possible to not specify an email at all then that is the better solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out, I could just leave out the email key!

# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the node pool and creating another node pool can make "terraform apply" ~10min slower.

Proposed solution: Instead of deleting the default node pool you configure it to be the "service" node pool. Will this work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the default node pool uses the default GCP service account, which grants broad permissions. For testing purposes, reusing this node pool is reasonable since it'll be extremely hard to exploit this, so I think that running services on the default node pool is reasonable. For production clusters it may be worth taking the additional cost of spinning up specialized node pools and wiping the default pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. The solution to re-use the default node as service node is something I thought too. But I couldn't see an option at least in the container cluster terraform resource to set labels only to the default node(the labels being meta and ide). If I set the label field, it would apply the same label to every nodepool. For the testing purposes, I imagine this is probably okay. Not sure if this is the best course of action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, I couldn't have been more wrong. That option only applied the labels to the default node. Making this change! Thanks for the suggestion, Mo!


preemptible = var.pre-emptible
image_type = "UBUNTU_CONTAINERD"
disk_type = "pd-ssd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a standrad hdd is chaper is should be sufficient.

}

preemptible = var.pre-emptible
image_type = "UBUNTU_CONTAINERD"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The COS image is optimised for performance and security. Since we don't run workspaces on this node pool (that are not compatible with COS), I think we should use COS nodes for the service pool.


management {
auto_repair = true
auto_upgrade = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You intending to enable preemtible nodes shows me that you're confident services will robustly hande restarts.
if that's true, it should be find to enable auto_upgrade.

Copy link
Contributor Author

@nandajavarma nandajavarma Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meysholdt I have to be honest, I am not super confident how fault-tolerant gitpod setup is. In a way, I was basically trying to translate the current gitpod-gke-guide configuration pretty much exactly like that to terraform and hence this disparity. I will make preemptible not configurable for now. Will iterate on this if gitpod is fault-tolerant enough, considering preemptible VMs are cheaper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure prebuilds are not fault tolerant yet. If they were, we'd be running them on spot VMs now.

variable "zone" {
type = string
description = "The zone to create the cluster."
default = "b"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document how to make the cluster regional (high-availablility-k8s-master) vs. zonal?

default = "1.21.11-gke.1100"
}

variable "name" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make this configurable? Do you wan to support more than one cluster per GCP project?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the possibility of parallel CI jobs creating parallel clusters; parameterizing the cluster name is cheap to do now but can be a bit painful down the road. I'm in favor of this approach.

(Side note; if we have multiple parallel CI jobs then the terraform state bucket will need to have a prefix per run, but this is a yak that we can shave another day.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @adrienthebo suggested, the main reason for considering configurable name parameter is parallel CI runs(it is highly likely since we are planning to test multiple configs). I was hoping that having the default name reduces the hassle a bit.
Regarding the state @adrienthebo , this is a great point. The way I have worked around it currently, is by using terraform workspaces in the Makefile since I couldn't set the the prefix using variables. Each job is run in its own tf workspace and this means the state file will have the name <workspace-name>.tfstate.

}

variable "min_count" {
type = number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this configurable?

default = 50
}

variable "disk_size_gb" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this configurable? I think the disk size can be decided statically based on what our services need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption was that, if the user decides to use more in-cluster storage, it would mean that they need more storage. So keeping it configurable made sense at the time.

default = 100
}

variable "initial_node_count" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this configurable?


project = "sh-automated-tests"
region = "europe-west1"
zone = "b"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file helpful? I'm asking because variables.tf already has defaults and for the sake of simplicity it would be nice to keep the number if files low.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed.


gcp_project = "sh-automated-tests"
gcp_region = "europe-west1"
gcp_zone = "europe-west1-b"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file helpful? I'm asking because variables.tf already has defaults and for the sake of simplicity it would be nice to keep the number if files low.

@kylos101
Copy link
Contributor

kylos101 commented Jun 3, 2022

👋 this is awesome to see! 🙏 Thank you so much.

@corneliusludmann What is the plan to help teams own exclusive testing for their domains of expertise? For example, I do not see workspace sub-folders, or updates to the codeowners file for this PR. Is that something that we'd layer on later, in a follow-on PR?

For example, I see some workspace things now, and if we are to own them later, I would prefer that the workspace team is looped in as an explicit approver so we can have related dialog.

}
}

resource "google_container_cluster" "gitpod-cluster" {
Copy link
Contributor

@adrienthebo adrienthebo Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this for information sharing, no action needed.

This module handles a number of cases that are great for a production cluster (and thus would be great for a reference architecture module), but may be omitted during testing. Features like cluster autoscaling or regional clusters will provide better performance and resilience for production traffic but for everyday system testing we shouldn't need a lot of nodes, autoscaling, or resilience in the case of a failed GCP zone.

Having a single, unified module for system tests (automated and manual) as well as reference architectures will be fantastic, and this will be great for that - so I don't think we need to pull out any features. However keeping an eye towards the overall cost of test clusters gives us an advantage; if we can drive down the cost of a single test run then we can run tests more cheaply and more frequently.

variable "zone" {
type = string
description = "The zone to create the cluster."
default = "b"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this default will behave as expected; my understanding (rooted in the google provider circa 2019) of the google provider block is that the zone field is the fully qualified zone name, so europe-west1-b.

My understanding may be out of date though, so we'll need to check if this works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true @adrienthebo. Like you can see here, I am appending this value(eg. b) to the var.region (eg. europe-west1). In retrospect, I understand this is particularly hacky. Will fix. 👍

variable "max_count" {
type = number
description = "Maximum number of nodes in the NodePool. Must be >= min_node_count."
default = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment: a default max_count of 50 for Gitpod services seems pretty high, especially for tests. If we have a reasonable estimate of the number of nodes spun up during an average test run then we may be able to constrain this lower.

In addition, it may be worthwhile to have separate max_count values for the services and workspaces node pools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out currently suggested max_count value in the guides are 100, I wanted to bring it a bit down if we are using default but keeping it configurable. We can configure this value in the main.tf used for tests. I believe it is safe to assume that install/infra/terraform/ will have more production ready code and install/tests has test code.

Copy link
Contributor

@adrienthebo adrienthebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

I'm dropping in to review this to get a feel for how y'all perform code reviews.

I'd like to start by noting that I don't yet know the culture or review practices for Gitpod; sincere apologies if I'm stepping on toes, dropping too many comments, or focusing on topics that might be out of scope for this PR. In addition I'm approaching this PR from the perspective of my role as a GCP consultant, which is an outside perspective driven by GCP best practices but isn't yet informed about Gitpod's standard GCP operations. I hope to share my experience in this domain and offer useful insights in constructive and effective ways, and getting this down will take a few review cycles. Thanks for bearing with me!

My overall perspective is that this is a rock solid foundation for system testing the Gitpod installer for self hosting purposes. I've been talking with @mrzarquon regarding reference architectures and automation for self hosted customers and this is the type of module that could be the basis for that reference, which is really awesome.

In terms of correctness, I think I saw one place (the zone variable) that needed a tweak to behave as expected but it might be using a feature (the full zone name being derived based on region + zone) that I'm not familiar with. 👍

Summarizing some of my inline comments, there's some functionality in this module that's well suited for a reference architecture but may benefit from adjustments for running in a CI system. If our installer systems tests are cheap and fast to run then we can run them more frequently, and lightweight test environments will allow upstream tests to validate changes impacting the installation process earlier.

All in all, I think this is really good and look forward to taking this for a spin!

# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the default node pool uses the default GCP service account, which grants broad permissions. For testing purposes, reusing this node pool is reasonable since it'll be extremely hard to exploit this, so I think that running services on the default node pool is reasonable. For production clusters it may be worth taking the additional cost of spinning up specialized node pools and wiping the default pool.

default = 1
}

variable "pre-emptible" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny style concern - I think that hyphens are allowed in variable names but underscores are preferred (or omitting the separator entirely in this case).

@nandajavarma
Copy link
Contributor Author

wave this is awesome to see! pray Thank you so much.

@corneliusludmann What is the plan to help teams own exclusive testing for their domains of expertise? For example, I do not see workspace sub-folders, or updates to the codeowners file for this PR. Is that something that we'd layer on later, in a follow-on PR?

For example, I see some workspace things now, and if we are to own them later, I would prefer that the workspace team is looped in as an explicit approver so we can have related dialog.

@kylos101 Yes, we need more iterations to add component specific tests. In general, I have tried to move the current guides that we have for self-hosted setup on gke and k3s to IaC. As of the tests, currently I am running the tests the integrations tests run(the ones inside the tests/ directory). Since the coverage is low, we have to do more iterations to add more component specific tests. The hope is that it will be easily pluggable with the current structure. We will empower the individual teams to add tests for their domains of expertise.

@kylos101
Copy link
Contributor

kylos101 commented Jun 3, 2022

@nandajavarma that is super helpful to know, thank you! So this PR is more about provisioning infrastructure and installing Gitpod, and then running our existing tests?

Is it safe to say that we'll see future iterations of this, that use different reference architectures to provision infrastructure and install Gitpod?

@nandajavarma
Copy link
Contributor Author

@nandajavarma that is super helpful to know, thank you! So this PR is more about provisioning infrastructure and installing Gitpod, and then running our existing tests?

Is it safe to say that we'll see future iterations of this, that use different reference architectures to provision infrastructure and install Gitpod?

@kylos101 yes, correct! We will add automated setup of different reference architectures(for eg: the WIP aks one here) including all available configurations(eg: self-signed certs or otherwise) to test on in following iterations.

@nandajavarma nandajavarma force-pushed the nvn/infra-tf branch 2 times, most recently from f042e5c to 5d02eed Compare June 6, 2022 18:41
Copy link
Contributor

@corneliusludmann corneliusludmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ship this as a first iteration and see how well it works. Good job!

@roboquat roboquat merged commit 6e583d5 into main Jun 7, 2022
@roboquat roboquat deleted the nvn/infra-tf branch June 7, 2022 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Epic: Automated Self-Hosted Testing
6 participants