Add DrainJobsMode #2569

Link- · 2023-05-08T14:42:52Z

Context

Fixes https://github.com/github/c2c-actions-runtime/issues/2417

This PR adds Drain Jobs Mode flag (disabled by default) which will prevent the Listener and EphemeramRunnerSet from provisioning new runners whenever the gha-runner-scale-set has been patched while there are running jobs. This mode will force the controller manager to wait until all jobs have been completed before applying the changes.

Contrary to our conversations, this mode is disabled by default and will need to be explicitly enabled.

Side quests

~~This PR also fixes the validate-gha-chart.yaml workflow by using a released version of chart-tester 2.4.0 instead of master branch~~
~~Install kubebuilder-tools instead of a standalone kubebuilder in go.yaml~~
These have been addressed in Fix broken chart validation workflows #2589

Problem details

Reproduction steps:

Setup ARC with runner image tag set to latest
Run N number of jobs
Change the runner image tag to a specific version
Helm upgrade

TODO

This PR needs to be merged as well: Create arc-test-sleepy-matrix.yaml actions-runner-controller/arc_e2e_test_dummy#3
This PR needs to be merged: Fix runtime duration (extended) actions-runner-controller/arc_e2e_test_dummy#5

main.go

cmd/githubrunnerscalesetlistener/autoScalerService.go

nikola-jokic

LGTM

Link- · 2023-05-22T15:49:25Z

.github/actions/execute-assert-arc-e2e/action.yaml

+  wait-to-finish:
+    description: 'Wait for the workflow run to finish'
+    required: true
+    default: "true"
+  wait-to-running:
+    description: 'Wait for the workflow run to start running'
+    required: true
+    default: "false"


These have been added to make the action a bit more generic. Sometimes, we just want to trigger the workflow run and not necessarily wait until it finishes successfully. These inputs allow for that.

Link- · 2023-05-22T15:49:48Z

.github/actions/execute-assert-arc-e2e/action.yaml

+    - name: Wait for workflow to start running
+      if: inputs.wait-to-running == 'true' && inputs.wait-to-finish == 'false'
+      uses: actions/github-script@v6
+      with:
+        script: |
+          function sleep(ms) {
+            return new Promise(resolve => setTimeout(resolve, ms))
+          }
+          const owner = '${{inputs.repo-owner}}'
+          const repo = '${{inputs.repo-name}}'
+          const workflow_run_id = ${{steps.query_workflow.outputs.workflow_run}}
+          const workflow_job_id = ${{steps.query_workflow.outputs.workflow_job}}
+          let count = 0
+          while (count++<10) {
+            await sleep(30 * 1000);
+            let getRunResponse = await github.rest.actions.getWorkflowRun({
+              owner: owner,
+              repo: repo,
+              run_id: workflow_run_id
+            })
+            console.log(`${getRunResponse.data.html_url}: ${getRunResponse.data.status} (${getRunResponse.data.conclusion})`);
+            if (getRunResponse.data.status == 'in_progress') {
+              console.log(`Workflow run is in progress.`)
+              return
+            }
+          }
+          core.setFailed(`The triggered workflow run didn't start properly using ${{inputs.arc-name}}`)
+


This waits until the workflow run is in_progress as opposed to finished

nikola-jokic · 2023-05-23T08:12:56Z

main.go

@@ -131,6 +132,7 @@ func main() {
 	flag.StringVar(&logLevel, "log-level", logging.LogLevelDebug, `The verbosity of the logging. Valid values are "debug", "info", "warn", "error". Defaults to "debug".`)
 	flag.StringVar(&logFormat, "log-format", "text", `The log format. Valid options are "text" and "json". Defaults to "text"`)
 	flag.BoolVar(&autoScalingRunnerSetOnly, "auto-scaling-runner-set-only", false, "Make controller only reconcile AutoRunnerScaleSet object.")
+	flag.StringVar(&updateStrategy, "update-strategy", "immediate", "Immediately or eventually mutate resources on upgrade with running/pending jobs.")


Might be nice to provide more hints in flag docs ☺️

Suggested change

flag.StringVar(&updateStrategy, "update-strategy", "immediate", "Immediately or eventually mutate resources on upgrade with running/pending jobs.")

flag.StringVar(&updateStrategy, "update-strategy", "immediate", `Strategy for mutating resources on upgrade with running/pending jobs. (values: "eventual", "immediate"; default: "immediate")`

Fixed, thank you

nikola-jokic · 2023-05-23T08:17:19Z

main.go

+		if len(updateStrategy) > 0 {
+			log.Info("update-strategy is set to: ", "updateStrategy", updateStrategy)
+		}


Suggested change

if len(updateStrategy) > 0 {

log.Info("update-strategy is set to: ", "updateStrategy", updateStrategy)

}

switch updateStrategy {

case "eventual", "immediate":

default:

log.Info(`Update strategy not recognized. Defaulting to "immediately"`, "updateStrategy", updateStrategy)

updateStrategy = "immediate"

}

nikola-jokic · 2023-05-23T08:22:06Z

controllers/actions.github.com/autoscalingrunnerset_controller.go

@@ -57,6 +57,7 @@ type AutoscalingRunnerSetReconciler struct {
 	ControllerNamespace                           string
 	DefaultRunnerScaleSetListenerImage            string
 	DefaultRunnerScaleSetListenerImagePullSecrets []string
+	UpdateStrategy                                string


It would be awesome if we create a type for update strategy.

type UpdateStrategy string // Update strategies const ( // UpdateStrategyImmediate will not recreate resources until all // pending / running jobs have completed. // This can lead to a larger time to apply the change but it will ensure // that you don't have any overprovisioning of runners. UpdateStrategyImmediate = UpdateStrategy("immediate") // docs for update strategy UpdateStrategyEventual = UpdateStrategy("eventual") )

And then the UpdateStrategy fields is of type UpdateStrategy. We can document the strategies on package level

Link- · 2023-05-23T11:36:11Z

controllers/actions.github.com/autoscalingrunnerset_controller_test.go

+			autoscalingRunnerSetTestTimeout,
+			autoscalingRunnerSetTestInterval,
+		).Should(BeTrue())


@nikola-jokic I added these to this test you wrote. This test was timing out randomly: https://github.com/actions/actions-runner-controller/actions/runs/5047834860/jobs/9073153675

nikola-jokic

LGTM 🚀

Link- added 11 commits April 19, 2023 14:20

Add drain jobs mode

263ad4c

Propagate the drainjobsmode flag to the listener

bd660ad

Add tests to drain jobs mode

84ace84

Add tests to drain jobs mode

70899d7

Fix implementation

d2efe99

Update gitignore

7507be6

Add scaling down patch

5cf7254

Fix path

340b77b

Disable flag by default

58215fb

Reset gitignore

c4d31e4

Disable mode by default

a177887

Link- requested review from mumoshu, toast-gear, a team and nikola-jokic as code owners May 8, 2023 14:42

Link- added 4 commits May 8, 2023 16:43

Merge branch 'master' into Link-/fix-overprovisioning

cb79b81

Fix broken tests

6f22e6f

Remove manual build of chart-testing

f96e5aa

Upgrade chart tester to 2.4.0

a4fe73f

Link- added the gha-runner-scale-set Related to the gha-runner-scale-set mode label May 9, 2023

Link- added 6 commits May 9, 2023 10:10

Upgrade kubebuilder to 3.10.0

692e5b3

Fix kubebuilder installation step

6e290fe

Use kubebuilder tools instead

c4f6c58

Revert formatting changes

0dd38d9

Fix version

41be56d

Merge branch 'master' into Link-/fix-overprovisioning

05cef95

Link- self-assigned this May 9, 2023

Link- commented May 10, 2023

View reviewed changes

main.go Outdated Show resolved Hide resolved

Link- commented May 10, 2023

View reviewed changes

cmd/githubrunnerscalesetlistener/autoScalerService.go Outdated Show resolved Hide resolved

Remove changes from the listener (not needed)

2edc063

Link- added 6 commits May 22, 2023 12:14

Merge branch 'master' into Link-/fix-overprovisioning

9b4396e

Trigger upgrade via env variables instead of runner version

e38b9f1

Merge branch 'master' into Link-/fix-overprovisioning

5263e48

Add wait-to-finish flag

359c81e

Use execute-assert-arc-e2e to trigger workflows

6dba8f2

Add debug to helm upgrade

f9ec144

nikola-jokic previously approved these changes May 22, 2023

View reviewed changes

Get all pods for better visibility

ab217b2

Link- dismissed nikola-jokic’s stale review via ab217b2 May 22, 2023 12:01

Link- added 6 commits May 22, 2023 12:41

Add field selectors for specificity

155a476

Fix field selectors

2ec61c6

Add delay before the patch

5f4eec1

Extend delay

c14cb89

Better handling of waiting for runners

342d921

Fix status

d55f6bc

Link- commented May 22, 2023

View reviewed changes

Cleanup

655e153

nikola-jokic reviewed May 23, 2023

View reviewed changes

Link- added 3 commits May 23, 2023 11:08

Add timeout and test interval to flaky test

8e777af

Add UpdateStrategy type

784cd7f

Add UpdateStrategy type

8982d85

Link- commented May 23, 2023

View reviewed changes

nikola-jokic approved these changes May 23, 2023

View reviewed changes

Link- merged commit 8afef51 into master May 23, 2023
14 checks passed

Link- deleted the Link-/fix-overprovisioning branch May 23, 2023 11:42

Link- added this to the gha-runner-scale-set-0.5.0 milestone Jul 28, 2023

Link- mentioned this pull request Jul 28, 2023

Prepare 0.5.0 release #2783

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DrainJobsMode #2569

Add DrainJobsMode #2569

Link- commented May 8, 2023 •

edited

nikola-jokic left a comment

Link- May 22, 2023

Link- May 22, 2023

nikola-jokic May 23, 2023 •

edited

Link- May 23, 2023

nikola-jokic May 23, 2023 •

edited

Link- May 23, 2023

nikola-jokic May 23, 2023

Link- May 23, 2023

Link- May 23, 2023

nikola-jokic left a comment

	flag.StringVar(&updateStrategy, "update-strategy", "immediate", "Immediately or eventually mutate resources on upgrade with running/pending jobs.")
	flag.StringVar(&updateStrategy, "update-strategy", "immediate", `Strategy for mutating resources on upgrade with running/pending jobs. (values: "eventual", "immediate"; default: "immediate")`

-		if len(updateStrategy) > 0 {
-			log.Info("update-strategy is set to: ", "updateStrategy", updateStrategy)
-		}
+		switch updateStrategy {
+		case "eventual", "immediate":
+		default:
+			log.Info(`Update strategy not recognized. Defaulting to "immediately"`, "updateStrategy", updateStrategy)
+			updateStrategy = "immediate"
+		}

Add DrainJobsMode #2569

Add DrainJobsMode #2569

Conversation

Link- commented May 8, 2023 • edited

Context

Side quests

Problem details

TODO

nikola-jokic left a comment

Choose a reason for hiding this comment

Link- May 22, 2023

Choose a reason for hiding this comment

Link- May 22, 2023

Choose a reason for hiding this comment

nikola-jokic May 23, 2023 • edited

Choose a reason for hiding this comment

Link- May 23, 2023

Choose a reason for hiding this comment

nikola-jokic May 23, 2023 • edited

Choose a reason for hiding this comment

Link- May 23, 2023

Choose a reason for hiding this comment

nikola-jokic May 23, 2023

Choose a reason for hiding this comment

Link- May 23, 2023

Choose a reason for hiding this comment

Link- May 23, 2023

Choose a reason for hiding this comment

nikola-jokic left a comment

Choose a reason for hiding this comment

Link- commented May 8, 2023 •

edited

nikola-jokic May 23, 2023 •

edited

nikola-jokic May 23, 2023 •

edited