Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes e2e tests #4984

Open
pkoutsovasilis opened this issue Jun 21, 2024 · 10 comments · May be fixed by #5013
Open

Kubernetes e2e tests #4984

pkoutsovasilis opened this issue Jun 21, 2024 · 10 comments · May be fixed by #5013
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Jun 21, 2024

After this PR the need to have container-based e2e tests became apparent. The first thing that comes to mind are the existing k8s-tests defined here. After syncing with @cmacknz and @blakerouse, in the context of the aforementioned PR, I took a deeper look on what these tests are testing. Here are my findings:

  • The kind cluster is provisioned with Crashing pods (tested with kind v0.20.0 and k8s v1.29.0)

    NAMESPACE     NAME                                         READY   STATUS             RESTARTS        AGE
    kube-system   etcd-kind-control-plane                      1/1     Running            0               9m6s
    kube-system   kube-apiserver-kind-control-plane            1/1     Running            0               9m6s
    kube-system   kube-controller-manager-kind-control-plane   0/1     CrashLoopBackOff   7 (3m18s ago)   9m6s
    kube-system   kube-scheduler-kind-control-plane            0/1     CrashLoopBackOff   7 (3m7s ago)    9m6s
    

    and the reason behind this is Error: unknown flag: --port defined here. As a result, no new pods are gonna get scheduled

  • The tests create the generated k8s manifest from kustomize templates with kubectl create .... This is essentially a yaml spec validation and it is not failing as the kube-apiserver-kind-control-plane is Running

  • Even if the pods were getting scheduled, the agent container image version "injected" in the k8s manifest is not the one build from sources of a commit but it derives from here.

@pkoutsovasilis pkoutsovasilis added the bug Something isn't working label Jun 21, 2024
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 21, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@blakerouse
Copy link
Contributor

I would really like to see the integration tests for kubernetes work similar to how the integration testing framwork for running on different hosts. Example tests would be something like:

func TestExampleKubernetesIntegration(t *testing.T) {
   info := define.Require({
       Type: define.Kubernetes,
       Cloud: define.GKE,
    })

   client := info.KubeClient()
   
   // perform kubernetes work using the client
   // create a pod, push image, etc...

Then running mage integration:container it would setup docker, kubernetes in GKE, AKS, etc all based on the defined tests and then run the tests. Exactly how its done for the integration testing framework, but this is really even simpler because SSH is not needed to actually push the tests to the VM nodes.

@cmacknz
Copy link
Member

cmacknz commented Jun 24, 2024

I like driving this with the integration test framework for consistency.

I don't think we need to start with using a remote cluster in GKE. We could, but we could also start with a Kind cluster running on the local machine if that is simpler to implement. It will certainly be faster to iterate on locally, and we'll be building that anyway as the anolgue of the multipass runner.

@blakerouse
Copy link
Contributor

@cmacknz Agree that starting local is easier and should be the first step. Extending to external k8s would be great for better coverage, which I think we would want to do once we have the local one working.

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jun 25, 2024

@blakerouse @cmacknz thanks for the proposals. After looking at the mage-related code, I have the following points for further dicsussion:

  • integRunnerOnce -> createTestRunner seems like the core of integration tests at the moment; but this doesn't seem to be ideal for k8s based tests as we don't want to provision any runners/vms neither from ogc nor multipass since we are gonna use VMs given by buildkite. On the same page, these two are too tightly coupled with ssh execution, matrices, etc. So we should stay away from these two, right?!
  • which stack provisioner to use for the needed stack stateful or serverless or support both?

some visual aid of what I have in mind

flowchart TB
dockerPackage[build agent docker package]
provisionCluster["provision k8s cluster [initially only kind, requires K8S_VERSION env var]"]
exportEnvVarsK8S[export env vars for kubernetes]
provisionStack[provision stack]
exportEnvVars[export env vars for stack]
invokeTest["invoke go kubernetes integration tests [separate package testing/kubernetes]"]
innerTest["define.Require(...)"]
subgraph mage["mage integration:kubernetes"]
dockerPackage --> provisionCluster --> exportEnvVarsK8S --> provisionStack --> exportEnvVars --> invokeTest 
end
subgraph test["Individual Kubernetes Test"]
invokeTest -.-> innerTest --> kube["client := info.KubeClient()"] --> dots["..."]
end
buildkite[buildkite VM] --> dockerPackage
Loading

@blakerouse
Copy link
Contributor

blakerouse commented Jun 25, 2024

@blakerouse @cmacknz thanks for the proposals. After looking at the mage-related code, I have the following points for further dicsussion:

  • integRunnerOnce -> createTestRunner seems like the core of integration tests at the moment; but this doesn't seem to be ideal for k8s based tests as we don't want to provision any runners/vms neither from ogc nor multipass since we are gonna use VMs given by buildkite. On the same page, these two are too tightly coupled with ssh execution, matrices, etc. So we should stay away from these two, right?!

How are developers going to perform the tests locally from there developer machines if it requires only buildkite to work? That does not provide a way for developers to easily inspect, interact, and debug an issue if it can only be done in CI.

Totally possible to have a buildkite provisioner in the integration testing framework that relies on information from buildkite or performs actions to buildkite. The system must be designed in a way that ensures that developers can do the following above. A kind provider for developers would be great and then a buildkite provisioner for CI.

  • which stack provisioner to use for the needed stack stateful or serverless or support both?

Would want to support both.

@pkoutsovasilis
Copy link
Contributor Author

How are developers going to perform the tests locally from there developer machines if it requires only buildkite to work? That does not provide a way for developers to easily inspect, interact, and debug an issue if it can only be done in CI.

probably I am missing something but sorry I don't follow. As far as my understanding goes a dev that wants to test this locally would have docker installed and kind based on their operating system and they will be able to run, as an example, something like thisK8S_VERSION=v1.29.0 mage integration:kubernetes. Why the above is more demanding that requiring the dev to have multipass installed?!

@cmacknz
Copy link
Member

cmacknz commented Jun 25, 2024

Having kind installed as a prerequisite makes sense to me, and this does not require any special handling for CI or introduce a dependency on Buildkite as we already have the ability to setup kind on the Buildkite agents:

- group: "K8s tests"
key: "k8s-tests"
steps:
- label: "K8s tests: {{matrix.k8s_version}}"
env:
K8S_VERSION: "v{{matrix.k8s_version}}"
KIND_VERSION: "v0.20.0"
command: ".buildkite/scripts/steps/k8s-tests.sh"
agents:
provider: "gcp"
image: "family/core-ubuntu-2204"
matrix:
setup:
k8s_version:
- "1.29.0"
- "1.28.0"
- "1.27.3"
- "1.26.6"
retry:
manual:
allowed: true

export PATH=$HOME/bin:${PATH}
source .buildkite/scripts/install-kubectl.sh
source .buildkite/scripts/install-kind.sh
kind create cluster --image "kindest/node:${K8S_VERSION}" --config - <<EOF

Given you have kind available, then the "provisioning a VM" step just becomes "ensure a kind cluster exists on the local machine".

The next piece is provisioning the stack for the agent to interact with. This could be left completely unchanged, and the agent in the kind cluster could be enrolled with a real stateful or serverless deployment as everything else does.

Having a kind cluster available does give us to the option to put the entire stack in the kind cluster as well. This would be quite convenient for local testing as it would have no remote dependencies at all, but it may be faster to start with using the existing stack provisions and add this as a follow up.

@cmacknz
Copy link
Member

cmacknz commented Jun 25, 2024

Essentially, this would lead us to replacing everything from this point onward in the existing k8s-tests.sh script with just mage integration:kubernetes which would also work locally as long as you have kind on your machine.

kind create cluster --image "kindest/node:${K8S_VERSION}" --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
scheduler:
extraArgs:
bind-address: "0.0.0.0"
port: "10251"
secure-port: "10259"
controllerManager:
extraArgs:
bind-address: "0.0.0.0"
port: "10252"
secure-port: "10257"
EOF
kubectl cluster-info
make -C deploy/kubernetes test

@blakerouse
Copy link
Contributor

Needing kind installed locally works fine for me. I just believe the integration testing framework should setup a new cluster to perform the work against. Look at code here https://github.com/elastic/elastic-agent/blob/main/pkg/testing/multipass/provisioner.go, this provides an interface to setting up VM's to perform work. We should use the same type of interface for setting up a kind cluster, as this should expand in the future to setup AKS, GKE, etc.

@pkoutsovasilis pkoutsovasilis linked a pull request Jun 27, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants