add terraform for utility cluster. Add name override to gke #30847

volatilemolotov · 2024-04-04T09:40:29Z

Adds terraform for utility cluster which is to be used for test infra.

Uses existing GKE module from .test-infra/terraform
Adds a name override to GKE module, when a predictable name is needed.This is mostly so we can have a stable name in all the workflows that need to access this cluster
Install kafka using helm instead of versioned manifests
Remove namespace from kafka kustomization to allow installing in workflow temporary namespace
Bump kafka version to support newer operator

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

andreydevyatkin · 2024-04-04T10:45:56Z

.test-infra/terraform/google-cloud-platform/utility-cluster/README.md

+    under the License.
+-->
+
+# Overview


Could you please add more details about what the intent is to use this cluster instead of "datastores"?

andreydevyatkin

LGTM, thanks!

damccorm · 2024-04-04T14:04:56Z

@damondouglas would you mind taking a look at this one when you have a chance?

github-actions · 2024-04-04T14:05:52Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping added as fallback since no labels match configuration

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

damondouglas

Thank you for doing this.

damondouglas · 2024-04-04T15:36:23Z

.test-infra/terraform/google-cloud-platform/google-kubernetes-engine/outputs.tf

+  value = google_container_cluster.default.endpoint
+}
+
+output cluster_ca_certificate {


Thank you for adding outputs :-). Could you tell me what this output is needed for?

This output is used in the upper module for helm to authenticate

https://github.com/apache/beam/pull/30847/files#diff-06d3b2b8f4c2307a7709a5e070cf31b505b71432cc46eca5871a402e74163eb8R33-R39

I think the provisioning of the Kubernetes cluster and any workloads that depend on it should be in separate terraform modules. Then one would just follow typical gcloud command to connect.

.test-infra/terraform/google-cloud-platform/google-kubernetes-engine/cluster.tf

damondouglas · 2024-04-15T16:27:18Z

.test-infra/terraform/google-cloud-platform/google-kubernetes-engine/outputs.tf

+  value = google_container_cluster.default.endpoint
+}
+
+output cluster_ca_certificate {


I think the provisioning of the Kubernetes cluster and any workloads that depend on it should be in separate terraform modules. Then one would just follow typical gcloud command to connect.

damondouglas · 2024-04-15T16:29:11Z

.test-infra/terraform/google-cloud-platform/utility-cluster/gke.tf

+  source = "../google-kubernetes-engine"
+  project    = "apache-beam-testing"
+  network    = "default"
+  subnetwork = "default-f91f013bcf8bd369"
+  region     = "us-central1"
+  cluster_name_prefix = "beam-utility"
+  service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com"
+  cluster_name_override = "beam-utility"


Maybe one can just create a new tfvars file storing these values and have the workflow involve provisioning the Kubernetes cluster first, separate from the strimzi workload.

damondouglas · 2024-04-15T16:29:41Z

.test-infra/terraform/google-cloud-platform/utility-cluster/strimzi-contoller.tf

+resource "helm_release" "strimzi-helm-release" {
+  name       = "strimzi"
+  namespace  = "strimzi"
+  create_namespace = true
+  repository = "https://strimzi.io/charts/"
+  chart      = "strimzi-kafka-operator"
+  version = "0.40.0"
+
+  atomic = "true"
+  timeout = 500
+
+  set {
+    name  = "watchAnyNamespace"
+    value = "true"
+  }
+  depends_on = [ module.gke.google_container_cluster ]
+}
+


This could be in its own module separate from the GKE cluster provisioning.

Yes it is possible to put it in its own module but the idea behind the utility-cluster folder is to use the GKE module and install everything that is needed for that exact purpose via terraform and in one step. It does not make sense for me to separate out a module as there is no intetion to reuse this due to its specific purpose. Other clusters can crate different folders for different purposes.
Let me know if this is fine and if not ill try to come up with different structure

In my experience, I find co-mingling GKE provisioning with Kubernetes workload provisioning in the same terraform module to lead to problems in the future. I personally would like to see it in a separate module. I'm more than willing to defer to another Apache Beam committer's opinion, if they think the co-mingling design is ok and have a logical well articulated reason. Otherwise, I'm not comfortable approving this PR with the current design.

In summary, my design preference is:

separate GKE provisioning module - a version controlled tfvars file in the existing .test-infra/terraform/google-cloud-platform/google-kubernetes-engine folder could work

separate folder responsible for provisioning the strimzi cluster

…e and implement as var file

volatilemolotov · 2024-04-16T15:10:59Z

@damondouglas I have added a number of changes that implement most of what has been discussed. Please take a look when you have time. Thanks

damondouglas

See akvelon#487. It was easier to create akvelon#487 instead of commenting throughout this PR.

Refactor apache#30847

damondouglas

Thank you for making the changes. Additional questions/comments:

Is .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced/kustomization.yaml still needed?
When I tested the strimzi helm chart, only the strimzi operator deployment started but nothing else related to kafka.
Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

damondouglas · 2024-04-22T16:25:56Z

...aform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars

+region     = "us-central1"
+router     = "default-us-central1-router"
+router_nat = "default-us-central1-router-nat"
+cluster_name_override = "beam-utility"


Could we name this something more specific?

Think we should keep it this as we should add more to this cluster instead of creating multiple

Because auto pilot scales to the workload, we can have multiple clusters focused on a specific resource need. That's the reason for having this re-usable GKE auto pilot creating solution. I'd argue that beam-utility will not make sense to someone trying to fix or add to the infrastructure later.

Would kafka-workflows be precise enough?

@volatilemolotov Thank you for listening. That would be great.

damondouglas · 2024-04-22T16:31:57Z

.test-infra/kafka/strimzi/README.md

-```
-KafkaIO.write().withBootstrapServers("10.128.0.14:9094")
-```
+TODO: DEFINE HOW TO CONNECT TO CLUSTER; see .test-infra/kafka/bitnami/README.md


Will you be finishing this?

Yes, i have added lines to README that explain how its done

damondouglas · 2024-04-22T16:32:31Z

...m/google-cloud-platform/google-kubernetes-engine/.beam-utility.apache-beam-testing.tfbackend

+ */
+
+bucket = "b507e468-52e9-4e72-83e5-ecbf563eda12"
+prefix = ".test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility"


After changing the name of the cluster, could you also change this prefix to match?

damondouglas · 2024-04-22T16:33:20Z

.test-infra/terraform/google-cloud-platform/google-kubernetes-engine/variables.tf

+variable "cluster_name_override" {
+  type = string
+  description = "Use this to override naming and omit the postfix. Leave empty to use prefix-suffix format"
+  default = ""
+}
+


Could we remove this variable and just have the prefix to keep it simple?

We need a predictable name so we dont have to change x number of workflows each time we redeploy for any reason. I would like to keep it this way

Why not just keep the kafka cluster running continually and delete the topics after the workflows execute?

We've had flakey tests in this repository due to waiting on spinning up new clusters.

This way we ensure its fresh each time which is easier then to maintain a kafka instance and make sure it does not break between different tests. We could delete topics but still there could be issues.

volatilemolotov · 2024-04-22T16:42:09Z

Is .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced/kustomization.yaml still needed?

When I tested the strimzi helm chart, only the strimzi operator deployment started but nothing else related to kafka.

The kustomization is used in workflow that use these clusters to bring up kafka for their testing

volatilemolotov · 2024-04-22T16:42:39Z

Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat

damondouglas · 2024-04-23T17:55:08Z

...aform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars

+region     = "us-central1"
+router     = "default-us-central1-router"
+router_nat = "default-us-central1-router-nat"
+cluster_name_override = "beam-utility"


Because auto pilot scales to the workload, we can have multiple clusters focused on a specific resource need. That's the reason for having this re-usable GKE auto pilot creating solution. I'd argue that beam-utility will not make sense to someone trying to fix or add to the infrastructure later.

damondouglas · 2024-04-23T18:52:03Z

Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat

Could you explain your testing approach because the following in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/prerequisites.tf:

// Query the Service Account.
data "google_service_account" "default" {
  depends_on = [google_project_service.required]
  account_id = var.service_account_id
}

should have given you an error when you tested because https://github.com/apache/beam/pull/30847/files#diff-e53f48e6ee35cb4d93d7b0750674c071edb78e05e90cbadda94492ef2be95cc1R27 in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars (service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com") is an email and not the service account ID.

volatilemolotov · 2024-04-23T21:41:02Z

Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat

Could you explain your testing approach because the following in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/prerequisites.tf:
// Query the Service Account.
data "google_service_account" "default" {
  depends_on = [google_project_service.required]
  account_id = var.service_account_id
}
should have given you an error when you tested because https://github.com/apache/beam/pull/30847/files#diff-e53f48e6ee35cb4d93d7b0750674c071edb78e05e90cbadda94492ef2be95cc1R27 in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars (service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com") is an email and not the service account ID.

In google_service_account datasource email is allowed https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#argument-reference

damondouglas

Almost there. Thank you so much for your patience.

damondouglas · 2024-04-25T16:55:10Z

...aform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars

+region     = "us-central1"
+router     = "default-us-central1-router"
+router_nat = "default-us-central1-router-nat"
+cluster_name_override = "beam-utility"


@volatilemolotov Thank you for listening. That would be great.

damondouglas · 2024-04-25T16:55:36Z

...aform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars

+router_nat = "default-us-central1-router-nat"
+cluster_name_override = "beam-utility"
+cluster_name_prefix = "beam-utility"
+service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com"


In google_service_account datasource email is allowed https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#argument-reference

Thank you for confirming and testing this. I recommend either changing the variable name to service_account_email and providing the email or service_account_id and changing the tfvars to be an id only. Personally, I prefer an ID since it means less data in the configuration but still works in the same project.

What would be the id. According to the datasource argument spec:

The following arguments are supported: [account_id](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#account_id) - (Required) The Google service account ID. This be one of: The name of the service account within the project (e.g. my-service) The fully-qualified path to a service account resource (e.g. projects/my-project/serviceAccounts/...) The email address of the service account (e.g. my-service@my-project.iam.gserviceaccount.com)

I would think that fully qualified path would be ID but that just gives out more info. I will default to just name here as it gives out the least info. Let me know if that is ok.

damondouglas · 2024-04-25T17:06:12Z

.test-infra/kafka/strimzi/README.md


 ```
-kubectl get svc beam-testing-cluster-kafka-external-bootstrap --namespace strimzi
+DIR=.test-infra/kafka/strimzi


Could we:

Move the terraform module into 01-strimzi-operator folder?

Keeping .test-infra/kafka/strimzi/README.md where it is: change DIR=.test-infra/kafka/01-strimzi-operator

Also left the README.md in the strimzi folder and updated the DIR instruction

damondouglas · 2024-04-25T17:10:08Z

.test-infra/kafka/strimzi/02-kafka-persistent/README.md

+
+Simply deploy the cluster by using kustomize plugin of kubectl
+```
+kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent


I have two points:

When I tried this, I got the error:

error: unable to find one of 'kustomization.yaml', 'kustomization.yml' or 'Kustomization' in directory '.test-infra/kafka/strimzi/02-kafka-persistent'

This worked:

kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced

Solution deployed into the default namespace. Was this intended? Original solution was in the default namespace. I don't mind either way.

Following specifies the namespace.

kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced --namespace=strimzi

Fixed the path.
Yeah it was supposed to be able to deploy to any namespace. I decided to put strimzi namespace into the instruction for the sake of completeness

damondouglas · 2024-04-25T17:21:19Z

.test-infra/kafka/strimzi/02-kafka-persistent/README.md

+```
+and wait until the cluster is deployed
+```
+kubectl wait kafka beam-testing-cluster --for=condition=Ready


I kept getting a timeout. I didn't have time to investigate this. Either investigate this or recommend using https://k9scli.io/

Added a timeout. Value of 1200 seems long but there are cases when deployment takes longer due to how Autopilot scales.

damondouglas

Thank you for all this work.

add terraform for utility cluster. Add name override to gke

fbc7fcd

github-actions bot added the infra label Apr 4, 2024

volatilemolotov added 3 commits April 4, 2024 11:51

remove whitespace

05fad46

add strimzi version and fix prefix

d308558

add more explanation on the purpose

a5e0ed2

andreydevyatkin reviewed Apr 4, 2024

View reviewed changes

andreydevyatkin approved these changes Apr 4, 2024

View reviewed changes

volatilemolotov marked this pull request as ready for review April 4, 2024 13:48

github-actions bot added the Next Action: Reviewers label Apr 4, 2024

damondouglas requested changes Apr 4, 2024

View reviewed changes

volatilemolotov added 2 commits April 5, 2024 14:25

add node_config block for oauth scopes

d3a57fe

move oauth to auto_provisioning_defaults

74f14d6

andreydevyatkin requested a review from damondouglas April 8, 2024 20:24

damondouglas requested changes Apr 15, 2024

View reviewed changes

move stimzi operator to separate module. Remove utility cluster modul…

173bfce

…e and implement as var file

andreydevyatkin requested a review from damondouglas April 16, 2024 21:56

damondouglas mentioned this pull request Apr 16, 2024

Refactor https://github.com/apache/beam/pull/30847 akvelon/beam#487

Open

damondouglas requested changes Apr 16, 2024

View reviewed changes

volatilemolotov added 2 commits April 19, 2024 12:13

add beam-utility tfvars file

312552a

Refactor apache#30847

42f4d04

volatilemolotov mentioned this pull request Apr 19, 2024

Refactor https://github.com/apache/beam/pull/30847 akvelon/beam#489

Merged

3 tasks

Merge pull request #489 from akvelon/utility_damon

c647c5d

Refactor apache#30847

github-actions bot added build and removed build labels Apr 19, 2024

add name override. add utility vars and backend

c7f26d7

github-actions bot added build and removed build labels Apr 22, 2024

tfbackend file asf licence

a15e358

github-actions bot added build and removed build labels Apr 22, 2024

tfbackend file asf licence

d986673

github-actions bot added build and removed build labels Apr 22, 2024

Merge branch 'master' into utility_cluster

9ae533c

github-actions bot added build and removed build labels Apr 22, 2024

damondouglas requested changes Apr 22, 2024

View reviewed changes

add readme usage

8d69572

github-actions bot added build and removed build labels Apr 23, 2024

remove whitespace

48796d5

github-actions bot added build and removed build labels Apr 23, 2024

damondouglas requested changes Apr 23, 2024

View reviewed changes

liferoad requested a review from damondouglas April 25, 2024 14:09

damondouglas requested changes Apr 25, 2024

View reviewed changes

move kafka to 01-strimzi-operator, rename cluster

2ec1b81

volatilemolotov mentioned this pull request Apr 26, 2024

rename cluster used #31113

Merged

3 tasks

remove beam-utility files

dc51b45

github-actions bot added build and removed build labels Apr 26, 2024

damondouglas approved these changes Apr 26, 2024

View reviewed changes

damondouglas merged commit 28a2682 into apache:master Apr 26, 2024
4 checks passed

add terraform for utility cluster. Add name override to gke #30847

add terraform for utility cluster. Add name override to gke #30847

Conversation

volatilemolotov commented Apr 4, 2024

GitHub Actions Tests Status (on master branch)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreydevyatkin left a comment

Choose a reason for hiding this comment

damccorm commented Apr 4, 2024

github-actions bot commented Apr 4, 2024

damondouglas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damondouglas Apr 15, 2024 • edited

Choose a reason for hiding this comment

volatilemolotov commented Apr 16, 2024

damondouglas left a comment • edited

Choose a reason for hiding this comment

damondouglas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

volatilemolotov commented Apr 22, 2024

volatilemolotov commented Apr 22, 2024 • edited

Choose a reason for hiding this comment

damondouglas commented Apr 23, 2024 • edited

volatilemolotov commented Apr 23, 2024

damondouglas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damondouglas left a comment

Choose a reason for hiding this comment

damondouglas Apr 15, 2024 •

edited

damondouglas left a comment •

edited

volatilemolotov commented Apr 22, 2024 •

edited

damondouglas commented Apr 23, 2024 •

edited