Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Action] Bootstrap Kubernetes cluster with IaC tooling #1

Closed
Tracked by #182
nikimanoledaki opened this issue Oct 4, 2023 · 29 comments · Fixed by #28
Closed
Tracked by #182

[Action] Bootstrap Kubernetes cluster with IaC tooling #1

nikimanoledaki opened this issue Oct 4, 2023 · 29 comments · Fixed by #28

Comments

@nikimanoledaki
Copy link
Contributor

nikimanoledaki commented Oct 4, 2023

Cluster API

We may want to use the Equinix Metal Cluster API Provider (CAPEM) for our cluster bootstrapping on the community cluster. Alternatives such as Ansible or Firecracker microVMs are being considered, to work with Falco's setup: cncf/tag-env-sustainability#182

Requirements

The cluster requirements are listed in the design doc.

Equinix infrastructure access

This issue will help us know more about the kind of access that will be needed for individual contributors to the infra. Please see this for some of the available options, and follow up in that thread with any questions/issues.

Documentation

We should document this process as we go.

Development environment

Dev environment setup tracked in this issue: #3

@nikimanoledaki nikimanoledaki changed the title Bootstrap cluster with CAPEM Bootstrap cluster with Cluster API Oct 4, 2023
@rossf7
Copy link
Contributor

rossf7 commented Oct 5, 2023

@nikimanoledaki I've been investigating using CAPI / CAPEM and I think us not having access to a permanent management cluster is a problem.

In the Equinix docs they have an alternative approach intended for management clusters using K3s managed by Pulumi.
https://deploy.equinix.com/developers/guides/k3s-management-plane/

I've added more detail in the design doc. PTAL

Do you think this is a good direction?

@nikimanoledaki
Copy link
Contributor Author

nikimanoledaki commented Oct 11, 2023

Update: @rossf7 and I discussed this and wrote a summary here. We will have more information after today's WG Green Reviews meeting with the Falco maintainers.

@rossf7
Copy link
Contributor

rossf7 commented Oct 13, 2023

Update following the WG meeting and discussions I had since with @nikimanoledaki

We think an important factor is how we isolate between test runs to have accurate results.

  • Least isolation needed would be to run each test on a separate node
  • Most isolation would be to use CAPEM and run each test in a separate workload cluster

We think we’ll need an IaC tool to manage the management cluster if we use CAPEM or the whole cluster if we don’t use CAPEM. At the moment we’re leaning toward Ansible for that. https://github.com/equinix/ansible-collection-metal

However we both think its important to continue work on the pipeline design. We can then design the cluster topology to support the pipeline rather than the other way round.

Lastly I added some notes to the design doc for #1 (comment) on K3s / Pulumi that are now outdated and would have been better here. 🤦‍♂️

I've removed them and updated the CAPI / CAPEM section. Sorry about that!

@nikimanoledaki nikimanoledaki changed the title Bootstrap cluster with Cluster API Bootstrap Kubernetes cluster Oct 18, 2023
@AntonioDiTuri
Copy link
Contributor

WG-meeting-recap:

We had thoughtful discussions and actionable steps directed towards the main objective of developing an end-to-end proof of concept concerning the Working Group Green Review.
The core focus is on manually (no automation for the moment) measuring Falco using Kepler, an initiative encapsulated under this milestone.

In the doc design you can find a first draft of the workflow:

  • Deployment of Kepler:
  • Deployment of Falco: Initial step to set the groundwork for the process.
  • Deployment of Demo Workload: Utilizing Google Cloud Platform’s microservices demo to create a workload for Falco to monitor.
  • Execution of Benchmark Tests: Employing k6 tests to call demo workload endpoints, ensuring the robustness and reliability of our setup.

We also thought of testing Falco in two ways:

  • Test 0 (Idle State): A foundational test to measure Falco in an idle state, providing insights into the baseline performance metrics (Falco maintainers have pointed out that this will be minimal for Falco if there are no kernel events happening for Falco to react to.)
  • Test 1 (Load Tests): A more intensive examination, suggested by Federico Di Pierro, to understand Falco’s CPU usage and event capturing capacity over a stipulated time.
    Some more details about the nature of the tests can be found in the docs under the section called “1.Falco”.

The roadmap ahead is well-defined with practical next steps, which will be documented soon as issues under the designated milestone.
Niki will check how to give contributors access to the cluster.
After that we that, we will start with testing and documenting the installation of every component of the workflow (kepler, falco, workload)
Once the end-to-end PoC is done we will think how to do some nice automations.

Thanks @rossf7 for his invaluable documentation on manual Equinix cluster creation, open for insightful comments on his pull request.
Thanks @nikimanoledaki niki for her exceptional coordination efforts, driving this project forward.

The team can’t wait for the initial measurements, let’s continue to collaborate and innovate as always!

@nikimanoledaki nikimanoledaki changed the title Bootstrap Kubernetes cluster Bootstrap Kubernetes cluster with IaC tooling Oct 27, 2023
@dipankardas011
Copy link
Contributor

If any help is needed I am happy to contribute 👍🏼

@rossf7
Copy link
Contributor

rossf7 commented Oct 28, 2023

I added some more detail in the design doc.

We want to use an IaC tool we can run in a GitHub action. Ansible and OpenTofu have both been discussed. I'd be fine with using Ansible (although full disclosure I don't have much experience with it).

We will need to provision the control plane and worker nodes as Equinix servers and they have integrations for Ansible and Terraform.

For each server we need to configure user_data that will bootstrap

For provisioning Kubernetes we could use Kubeadm. Unless anyone can suggest a better approach?

@dipankardas011 help with this would be much appreciated. I think we first need to agree on the design. Would you like to work on that?

cc @nikimanoledaki @guidemetothemoon @leonardpahlke @AntonioDiTuri

@dipankardas011
Copy link
Contributor

dipankardas011 commented Oct 28, 2023

@rossf7 sure I will give it a try
My follow up is does this design any way different from what you expected?

@dipankardas011
Copy link
Contributor

dipankardas011 commented Oct 28, 2023

also no sure, by the design. is it deciding on the Infrastructure code part or its just the diagrams?

@rossf7
Copy link
Contributor

rossf7 commented Oct 28, 2023

@dipankardas011 If you would like to investigate how you would do the Infrastructure as Code part that would be great.

But please don't spend too much time on it until we've heard from the rest of the team.

I'm happy to help with the Equinix Metal integration as I've worked with their infra quite a bit.

@dipankardas011
Copy link
Contributor

Okay I will be creating a basic diagram of workflow

@dipankardas011
Copy link
Contributor

Should I create it on excalidraw or draw io
Which one will be comfortable for you all?

@dipankardas011
Copy link
Contributor

dipankardas011 commented Oct 28, 2023

@AntonioDiTuri
Copy link
Contributor

I cannot access it. It says I don't have the permissions.

@dipankardas011
Copy link
Contributor

fixed the link

@rossf7
Copy link
Contributor

rossf7 commented Oct 31, 2023

Hi @dipankardas011 thanks the diagram is looking good!

In the diagram OpenTofu (Terraform) is used to provision the Equinix servers and Ansible is used to provision Kubernetes with Kubeadm. Do you think we could use a single tool for both? Or are advantages to using separate tools?

For the GitOps part you have this described as "GitOps for CNCF projects". I think this should be "GitOps for pipeline components". Could you update that?

This is because we want to use Flux to manage the components that should always be running like Prometheus. The CNCF projects like Falco and any workload specific test workloads will be managed by the pipeline.

@dipankardas011
Copy link
Contributor

In the diagram OpenTofu (Terraform) is used to provision the Equinix servers and Ansible is used to provision Kubernetes with Kubeadm. Do you think we could use a single tool for both? Or are advantages to using separate tools?

What I have experienced we can add the script in user_data section when we provision infra(iac tools)
And then configure then using ansible

I think this method involving 2 tools is good when a lot of times the infra needs to configure
But in our case as it's mostly single time declaration
We can add it to the user_data section

Also another issue I have seen that if error occurs in userdata section we don't get any signal like error occurred, just wanted to point that out

@dipankardas011
Copy link
Contributor

dipankardas011 commented Oct 31, 2023

For the GitOps part you have this described as "GitOps for CNCF projects". I think this should be "GitOps for pipeline components". Could you update that?

Yes

@dipankardas011
Copy link
Contributor

For the GitOps part you have this described as "GitOps for CNCF projects". I think this should be "GitOps for pipeline components". Could you update that?

Yes

Updated!

@rossf7
Copy link
Contributor

rossf7 commented Oct 31, 2023

What I have experienced we can add the script in user_data section when we provision infra(iac tools)
And then configure then using ansible

I think this method involving 2 tools is good when a lot of times the infra needs to configure

Yes, exactly that, the script in the user_data can run the IaC tool. I agree using 2 tools makes sense providing we can use the Equinix Terraform module with OpenTofu.

Also another issue I have seen that if error occurs in userdata section we don't get any signal like error occurred, just wanted to point that out

Good catch 👍 we will need to handle that. We have some contacts at Equinix. So we can try asking them for some guidance if needed.

@rossf7
Copy link
Contributor

rossf7 commented Nov 2, 2023

As suggested by @nikimanoledaki we could use this directory structure with the IaC code under infrastructure and the Kubernetes manifests under clusters managed by Flux.

├── infrastructure
│   └── equinix-metal
├── clusters
│   └── production

See #5 (comment)

@rossf7
Copy link
Contributor

rossf7 commented Nov 7, 2023

I did a spike to investigate this and I've created a WIP PR to get feedback #6

Dipankar I think the original design you proposed to use OpenTofu / Terraform to manage the Equinix infra and Ansible to provision Kubernetes is good. I don't see a benefit to using Ansible to manage both.

OpenTofu have a GitHub Action that works well and I think does everything we need https://github.com/opentofu/setup-opentofu

I'm using an S3 bucket to store the state. It looks like we can request a S3 bucket and credentials via servicedesk?

@dipankardas011 Would you like to work on the Ansible playbook?

@nikimanoledaki @guidemetothemoon @leonardpahlke @AntonioDiTuri Please take a look at the PR when you have time.

Leo / Niki no worries if that is after KubeCon!

@dipankardas011
Copy link
Contributor

@dipankardas011 Would you like to work on the Ansible playbook?

Okay then we can use the user_data section 👍

@rossf7
Copy link
Contributor

rossf7 commented Nov 8, 2023

Discussed with @wrkode and @dipankardas011 in the WG slack channel. We think there may be some advantages to using K3s instead of Kubeadm.

It makes it easier to provision the cluster and we could run the K3s steps in the user_data of the TF code so we wouldn't need Ansible. It is also a lighter distribution meaning the energy consumption of the cluster should be reduced.

The main challenge is we need to get the K3S_TOKEN from the control plane node and pass it to the worker nodes. Dipankar has experience doing this from working on ksctl which supports k3s https://github.com/kubesimplify/ksctl

@wrkode
Copy link

wrkode commented Nov 8, 2023

Discussed with @wrkode and @dipankardas011 in the WG slack channel. We think there may be some advantages to using K3s instead of Kubeadm.

It makes it easier to provision the cluster and we could run the K3s steps in the user_data of the TF code so we wouldn't need Ansible. It is also a lighter distribution meaning the energy consumption of the cluster should be reduced.

The main challenge is we need to get the K3S_TOKEN from the control plane node and pass it to the worker nodes. Dipankar has experience doing this from working on ksctl which supports k3s https://github.com/kubesimplify/ksctl

we can also use the k3s shell to up the cluster, this will pass tokens and stand-up the workers

@leonardpahlke
Copy link
Member

I can take a look at this early next week.

@nikimanoledaki
Copy link
Contributor Author

nikimanoledaki commented Nov 22, 2023

This should be unblocked once we get AWS access to use an S3 bucket: #8

@rossf7
Copy link
Contributor

rossf7 commented Nov 28, 2023

PR is updated with user_data to provision K8s with K3s added by @dipankardas011 Next step is installing Cilium for CNI using its Helm chart.

Once we have the AWS credentials for S3 we can add the secrets to the repo. There is an extra secret needed for the K3S_AGENT_TOKEN.

@dipankardas011
Copy link
Contributor

PR is updated with user_data to provision K8s with K3s added by @dipankardas011 Next step is installing Cilium for CNI using its Helm chart.

Once we have the AWS credentials for S3 we can add the secrets to the repo. There is an extra secret needed for the K3S_AGENT_TOKEN.

Helm install script for cilium
https://docs.cilium.io/en/stable/installation/k8s-install-helm/

@nikimanoledaki nikimanoledaki changed the title Bootstrap Kubernetes cluster with IaC tooling [Action] Bootstrap Kubernetes cluster with IaC tooling Jan 9, 2024
@kvendingoldo
Copy link

btw. you can also integrate tenv that support Terraform as well as OpenTofu (and Terragrunt :) ) in one tool. It allow you to simplify version management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment