Add retries in case jenkins job has a build already enqueued #1203

mrodm · 2023-03-28T17:28:12Z

Probably related because when it tries to trigger the job there is already a enqueued job pending to be started.

Example of the error:

Trigger Jenkins job for signing package package_storage_candidate-0.0.1.zip
go: downloading github.com/bndr/gojenkins v1.1.0
go: downloading golang.org/x/net v0.7.0
2023/03/30 13:38:01 Triggering job: elastic+unified-release+master+sign-artifacts-with-gpg
2023/03/30 13:38:02 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 1)
ERROR: 2023/03/30 13:38:02 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:02 Retrying in 5s..
2023/03/30 13:38:07 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 2)
ERROR: 2023/03/30 13:38:07 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:07 Retrying in 5s..
2023/03/30 13:38:12 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 3)
ERROR: 2023/03/30 13:38:12 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:12 Retrying in 5s..
2023/03/30 13:38:17 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 4)
ERROR: 2023/03/30 13:38:17 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:17 Retrying in 5s..
2023/03/30 13:38:22 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 5)
ERROR: 2023/03/30 13:38:22 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:22 Retrying in 5s..
2023/03/30 13:38:27 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 6)
ERROR: 2023/03/30 13:38:27 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:27 Retrying in 5s..
2023/03/30 13:38:32 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 7)
ERROR: 2023/03/30 13:38:32 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:32 Retrying in 5s..
2023/03/30 13:38:37 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 8)
ERROR: 2023/03/30 13:38:37 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:37 Retrying in 5s..
2023/03/30 13:38:42 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 9)
ERROR: 2023/03/30 13:38:42 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:42 Retrying in 5s..
2023/03/30 13:38:47 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 10)
ERROR: 2023/03/30 13:38:47 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:47 Retrying in 5s..
2023/03/30 13:38:52 Error: failed to build job elastic+unified-release+master+sign-artifacts-with-gpg: max attemps (10) reached
exit status 1

This happens if the value inQueue is true for a specific job. In the case of internal-ci instance, it looks like this happens if there is a job in the queue waiting for a node:

Status of the job queue
inQueue value in Jenkins API response

Example of output when retries are used (Buildkite link):

2023/04/03 14:35:55 Triggering job: elastic+unified-release+master+sign-artifacts-with-gpg
2023/04/03 14:35:56 Building job elastic+unified-release+master+sign-artifacts-with-gpg
ERROR: 2023/04/03 14:35:56 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:35:56 Function failed, retrying in 30s
ERROR: 2023/04/03 14:36:26 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:36:26 Function failed, retrying in 30s
ERROR: 2023/04/03 14:36:56 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:36:56 Function failed, retrying in 30s
ERROR: 2023/04/03 14:37:26 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:37:26 Function failed, retrying in 30s
ERROR: 2023/04/03 14:37:56 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:37:56 Function failed, retrying in 30s
ERROR: 2023/04/03 14:38:26 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:38:26 Function failed, retrying in 30s
2023/04/03 14:40:51 Job triggered elastic+unified-release+master+sign-artifacts-with-gpg/5898
2023/04/03 14:40:51 Waiting to be finished [REDACTED]job/elastic+unified-release+master+sign-artifacts-with-gpg/5898/
2023/04/03 14:40:51 Build still running, waiting for 10s...
2023/04/03 14:41:01 Build still running, waiting for 10s...
2023/04/03 14:41:11 Build still running, waiting for 10s...
2023/04/03 14:41:21 Build still running, waiting for 10s...
2023/04/03 14:41:32 Build still running, waiting for 10s...
2023/04/03 14:41:42 Build still running, waiting for 10s...
2023/04/03 14:41:52 Build still running, waiting for 10s...
2023/04/03 14:42:02 Build still running, waiting for 10s...
2023/04/03 14:42:13 Build still running, waiting for 10s...
2023/04/03 14:42:23 Build still running, waiting for 10s...
2023/04/03 14:42:33 Build still running, waiting for 10s...
2023/04/03 14:42:43 Build [REDACTED]job/elastic+unified-release+master+sign-artifacts-with-gpg/5898/ finished with result: SUCCESS

mrodm · 2023-03-28T18:00:04Z

/test

mrodm · 2023-03-29T08:23:25Z

/test

mrodm · 2023-03-29T08:53:50Z

/test

mrodm · 2023-03-29T09:21:18Z

/test

mrodm · 2023-03-29T09:57:14Z

/test

mrodm · 2023-03-29T10:48:48Z

/test

mrodm · 2023-03-29T11:12:17Z

/test

mrodm · 2023-03-29T13:37:58Z

/test

mrodm · 2023-03-29T14:06:24Z

/test

mrodm · 2023-03-29T15:11:55Z

/test

mrodm · 2023-03-29T15:53:53Z

/test

mrodm · 2023-03-30T09:32:24Z

/test

mrodm · 2023-03-30T10:31:46Z

/test

mrodm · 2023-03-30T13:33:55Z

/test

mrodm · 2023-03-30T15:04:57Z

/test

mrodm · 2023-03-30T15:34:58Z

/test

This reverts commit 4039955.

mrodm · 2023-03-31T14:15:35Z

/test

mrodm · 2023-03-31T17:25:13Z

/test

mrodm · 2023-04-03T14:12:53Z

/test

mrodm · 2023-04-03T14:32:21Z

/test

mrodm · 2023-04-03T14:56:18Z

/test

mrodm · 2023-04-03T15:10:13Z

/test

alexsapran · 2023-04-03T17:10:40Z

.buildkite/scripts/triggerJenkinsJob/jenkins/retry.go

+			log.Printf("Function failed, retrying in %v", delay)
+
+			select {
+			case <-time.After(delay):


Should we have some option similar to exponential backoff? maybe each time we call the retry we pass the attempt (let's assume attempt int) and have the After(attempt * delay)?

Sure!
I'll add two new parameters for this:

--growth-factor: factor to use in the exponential backoff (probably default value around 1.5)

--max-waiting-time: maximum waiting time to wait between each attempt, so it does not get too high the duration between iterations.

I would go for something like this:

growth factor: f

current attempt/retry: a

waiting time: w

maximum waiting time: W

delay to use per retry: d

$$d = min((f^a)*w, W)$$
Using a growth factor 1.5, waiting time of 2 secs each and a maximum waiting time of 1hour:

2023/04/04 10:26:26 Function failed, retrying in 2s 2023/04/04 10:26:27 Function failed, retrying in 2s 2023/04/04 10:26:28 Function failed, retrying in 4s 2023/04/04 10:26:29 Function failed, retrying in 6s 2023/04/04 10:26:30 Function failed, retrying in 10s 2023/04/04 10:26:31 Function failed, retrying in 14s 2023/04/04 10:26:32 Function failed, retrying in 22s 2023/04/04 10:26:33 Function failed, retrying in 34s 2023/04/04 10:26:34 Function failed, retrying in 50s 2023/04/04 10:26:35 Function failed, retrying in 1m16s 2023/04/04 10:26:36 Function failed, retrying in 1m54s 2023/04/04 10:26:37 Function failed, retrying in 2m52s 2023/04/04 10:26:38 Function failed, retrying in 4m18s 2023/04/04 10:26:39 Function failed, retrying in 6m28s 2023/04/04 10:26:40 Function failed, retrying in 9m42s 2023/04/04 10:26:41 Function failed, retrying in 14m34s 2023/04/04 10:26:42 Function failed, retrying in 21m52s 2023/04/04 10:26:43 Function failed, retrying in 32m50s 2023/04/04 10:26:44 Function failed, retrying in 49m14s 2023/04/04 10:26:45 Function failed, retrying in 1h0m0s

elasticmachine · 2023-04-04T10:11:12Z

💚 Build Succeeded

Buildkite Build
Commit: bb46196

History

💚 Build #538 succeeded ad4a6eb
💔 Build #536 failed ad4a6eb
💔 Build #535 failed ad4a6eb
💚 Build #531 succeeded ad4a6eb
💚 Build #530 succeeded 9138a00
💔 Build #529 failed ea569a0

cc @mrodm

mrodm · 2023-04-04T10:38:46Z

.buildkite/scripts/triggerJenkinsJob/main.go

+	waitingTime := flag.Duration("waiting-time", 5*time.Second, fmt.Sprintf("Waiting period between each retry"))
+	growthFactor := flag.Float64("growth-factor", 1.25, fmt.Sprintf("Growth-Factor used for exponential backoff delays"))
+	retries := flag.Int("retries", 20, fmt.Sprintf("Number of retries to trigger the job"))
+	maxWaitingTime := flag.Duration("max-waiting-time", 60*time.Minute, fmt.Sprintf("Maximum waiting time per each retry"))


With these default values, the waiting times would be:

2023/04/04 11:31:02 Function failed, retrying in 5s -> 5.00s 2023/04/04 11:31:03 Function failed, retrying in 6s -> 6.25s 2023/04/04 11:31:04 Function failed, retrying in 7s -> 7.81s 2023/04/04 11:31:05 Function failed, retrying in 9s -> 9.77s 2023/04/04 11:31:06 Function failed, retrying in 12s -> 12.21s 2023/04/04 11:31:07 Function failed, retrying in 15s -> 15.26s 2023/04/04 11:31:08 Function failed, retrying in 19s -> 19.07s 2023/04/04 11:31:09 Function failed, retrying in 23s -> 23.84s 2023/04/04 11:31:10 Function failed, retrying in 29s -> 29.80s 2023/04/04 11:31:11 Function failed, retrying in 37s -> 37.25s 2023/04/04 11:31:12 Function failed, retrying in 46s -> 46.57s 2023/04/04 11:31:13 Function failed, retrying in 58s -> 58.21s 2023/04/04 11:31:14 Function failed, retrying in 1m12s -> 72.76s 2023/04/04 11:31:15 Function failed, retrying in 1m30s -> 90.95s 2023/04/04 11:31:16 Function failed, retrying in 1m53s -> 113.69s 2023/04/04 11:31:17 Function failed, retrying in 2m22s -> 142.11s 2023/04/04 11:31:18 Function failed, retrying in 2m57s -> 177.64s 2023/04/04 11:31:19 Function failed, retrying in 3m42s -> 222.04s 2023/04/04 11:31:20 Function failed, retrying in 4m37s -> 277.56s 2023/04/04 11:31:21 Function failed, retrying in 5m46s -> 346.94s

mrodm · 2023-04-04T11:19:22Z

.buildkite/pipeline.package-storage-publish.yml

    depends_on:
      - build-package
-    timeout_in_minutes: 30
+    timeout_in_minutes: 90


Increase up to 90 minutes, so the retries can be completed

jlind23

Thanks, Mario. LGTM

mrodm self-assigned this Mar 28, 2023

Add retries in case already queued

576582f

mrodm force-pushed the add_retries_queued_jenkins branch from f2264c1 to 576582f Compare March 28, 2023 17:34

Add retry message

beef872

Check error

4d119b6

Wrap error

a651e4f

Increase waiting time

f91f522

mrodm added 3 commits March 30, 2023 18:54

Test no docker-compose

4039955

Revert "Test no docker-compose"

235b127

This reverts commit 4039955.

Increase up to 30 secs per attempt

d7a7f14

Add parameters for retrying

ea569a0

Add license header

9138a00

more generic retry function

ad4a6eb

mrodm marked this pull request as ready for review April 3, 2023 15:47

mrodm requested a review from a team April 3, 2023 15:48

mrodm changed the title ~~Add retries in case already queued~~ Add retries in case jenkins job has a build already enqueued Apr 3, 2023

alexsapran reviewed Apr 3, 2023

View reviewed changes

mrodm added 3 commits April 3, 2023 19:58

Add exponential backoff

91c28f8

Add exponential backoff when retrying triggering jenkins job

e257a38

Increase timeout for sign and publish step

bb46196

mrodm requested a review from alexsapran April 4, 2023 10:14

mrodm commented Apr 4, 2023

View reviewed changes

jlind23 approved these changes Apr 4, 2023

View reviewed changes

alexsapran approved these changes Apr 4, 2023

View reviewed changes

mrodm merged commit dd5c458 into elastic:main Apr 4, 2023

mrodm deleted the add_retries_queued_jenkins branch April 4, 2023 13:56

Add retries in case jenkins job has a build already enqueued #1203

Add retries in case jenkins job has a build already enqueued #1203

Uh oh!

Conversation

mrodm commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrodm commented Mar 28, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 29, 2023

Uh oh!

mrodm commented Mar 30, 2023

Uh oh!

mrodm commented Mar 30, 2023

Uh oh!

mrodm commented Mar 30, 2023

Uh oh!

mrodm commented Mar 30, 2023

Uh oh!

mrodm commented Mar 30, 2023

Uh oh!

mrodm commented Mar 31, 2023

Uh oh!

mrodm commented Mar 31, 2023

Uh oh!

mrodm commented Apr 3, 2023

Uh oh!

mrodm commented Apr 3, 2023

Uh oh!

mrodm commented Apr 3, 2023

Uh oh!

mrodm commented Apr 3, 2023

Uh oh!

alexsapran Apr 3, 2023

Choose a reason for hiding this comment

Uh oh!

mrodm Apr 4, 2023

Choose a reason for hiding this comment

Uh oh!

alexsapran Apr 4, 2023

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Apr 4, 2023

💚 Build Succeeded

History

Uh oh!

mrodm Apr 4, 2023

Choose a reason for hiding this comment

Uh oh!

mrodm Apr 4, 2023

Choose a reason for hiding this comment

Uh oh!

jlind23 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mrodm commented Mar 28, 2023 •

edited

Loading