Skip to content

Conversation

@mrodm
Copy link
Contributor

@mrodm mrodm commented Mar 28, 2023

Closes #1200

Probably related because when it tries to trigger the job there is already a enqueued job pending to be started.

Example of the error:

Trigger Jenkins job for signing package package_storage_candidate-0.0.1.zip
go: downloading github.com/bndr/gojenkins v1.1.0
go: downloading golang.org/x/net v0.7.0
2023/03/30 13:38:01 Triggering job: elastic+unified-release+master+sign-artifacts-with-gpg
2023/03/30 13:38:02 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 1)
ERROR: 2023/03/30 13:38:02 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:02 Retrying in 5s..
2023/03/30 13:38:07 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 2)
ERROR: 2023/03/30 13:38:07 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:07 Retrying in 5s..
2023/03/30 13:38:12 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 3)
ERROR: 2023/03/30 13:38:12 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:12 Retrying in 5s..
2023/03/30 13:38:17 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 4)
ERROR: 2023/03/30 13:38:17 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:17 Retrying in 5s..
2023/03/30 13:38:22 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 5)
ERROR: 2023/03/30 13:38:22 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:22 Retrying in 5s..
2023/03/30 13:38:27 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 6)
ERROR: 2023/03/30 13:38:27 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:27 Retrying in 5s..
2023/03/30 13:38:32 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 7)
ERROR: 2023/03/30 13:38:32 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:32 Retrying in 5s..
2023/03/30 13:38:37 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 8)
ERROR: 2023/03/30 13:38:37 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:37 Retrying in 5s..
2023/03/30 13:38:42 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 9)
ERROR: 2023/03/30 13:38:42 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:42 Retrying in 5s..
2023/03/30 13:38:47 Building job elastic+unified-release+master+sign-artifacts-with-gpg (Attempt 10)
ERROR: 2023/03/30 13:38:47 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/03/30 13:38:47 Retrying in 5s..
2023/03/30 13:38:52 Error: failed to build job elastic+unified-release+master+sign-artifacts-with-gpg: max attemps (10) reached
exit status 1

This happens if the value inQueue is true for a specific job. In the case of internal-ci instance, it looks like this happens if there is a job in the queue waiting for a node:

  • Status of the job queue
    Status of the job queue

  • inQueue value in Jenkins API response
    inQueue value in API response

Example of output when retries are used (Buildkite link):

2023/04/03 14:35:55 Triggering job: elastic+unified-release+master+sign-artifacts-with-gpg
2023/04/03 14:35:56 Building job elastic+unified-release+master+sign-artifacts-with-gpg
ERROR: 2023/04/03 14:35:56 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:35:56 Function failed, retrying in 30s
ERROR: 2023/04/03 14:36:26 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:36:26 Function failed, retrying in 30s
ERROR: 2023/04/03 14:36:56 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:36:56 Function failed, retrying in 30s
ERROR: 2023/04/03 14:37:26 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:37:26 Function failed, retrying in 30s
ERROR: 2023/04/03 14:37:56 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:37:56 Function failed, retrying in 30s
ERROR: 2023/04/03 14:38:26 job.go:428: elastic+unified-release+master+sign-artifacts-with-gpg is already running
2023/04/03 14:38:26 Function failed, retrying in 30s
2023/04/03 14:40:51 Job triggered elastic+unified-release+master+sign-artifacts-with-gpg/5898
2023/04/03 14:40:51 Waiting to be finished [REDACTED]job/elastic+unified-release+master+sign-artifacts-with-gpg/5898/
2023/04/03 14:40:51 Build still running, waiting for 10s...
2023/04/03 14:41:01 Build still running, waiting for 10s...
2023/04/03 14:41:11 Build still running, waiting for 10s...
2023/04/03 14:41:21 Build still running, waiting for 10s...
2023/04/03 14:41:32 Build still running, waiting for 10s...
2023/04/03 14:41:42 Build still running, waiting for 10s...
2023/04/03 14:41:52 Build still running, waiting for 10s...
2023/04/03 14:42:02 Build still running, waiting for 10s...
2023/04/03 14:42:13 Build still running, waiting for 10s...
2023/04/03 14:42:23 Build still running, waiting for 10s...
2023/04/03 14:42:33 Build still running, waiting for 10s...
2023/04/03 14:42:43 Build [REDACTED]job/elastic+unified-release+master+sign-artifacts-with-gpg/5898/ finished with result: SUCCESS

@mrodm mrodm self-assigned this Mar 28, 2023
@mrodm mrodm force-pushed the add_retries_queued_jenkins branch from f2264c1 to 576582f Compare March 28, 2023 17:34
@mrodm
Copy link
Contributor Author

mrodm commented Mar 28, 2023

/test

9 similar comments
@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 29, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 30, 2023

/test

2 similar comments
@mrodm
Copy link
Contributor Author

mrodm commented Mar 30, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 30, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 30, 2023

/test

1 similar comment
@mrodm
Copy link
Contributor Author

mrodm commented Mar 30, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 31, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Mar 31, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Apr 3, 2023

/test

3 similar comments
@mrodm
Copy link
Contributor Author

mrodm commented Apr 3, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Apr 3, 2023

/test

@mrodm
Copy link
Contributor Author

mrodm commented Apr 3, 2023

/test

@mrodm mrodm marked this pull request as ready for review April 3, 2023 15:47
@mrodm mrodm requested a review from a team April 3, 2023 15:48
@mrodm mrodm changed the title Add retries in case already queued Add retries in case jenkins job has a build already enqueued Apr 3, 2023
log.Printf("Function failed, retrying in %v", delay)

select {
case <-time.After(delay):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some option similar to exponential backoff? maybe each time we call the retry we pass the attempt (let's assume attempt int) and have the After(attempt * delay)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!
I'll add two new parameters for this:

  • --growth-factor: factor to use in the exponential backoff (probably default value around 1.5)
  • --max-waiting-time: maximum waiting time to wait between each attempt, so it does not get too high the duration between iterations.

I would go for something like this:

  • growth factor: f
  • current attempt/retry: a
  • waiting time: w
  • maximum waiting time: W
  • delay to use per retry: d
$$d = min((f^a)*w, W)$$

Using a growth factor 1.5, waiting time of 2 secs each and a maximum waiting time of 1hour:

2023/04/04 10:26:26 Function failed, retrying in 2s
2023/04/04 10:26:27 Function failed, retrying in 2s
2023/04/04 10:26:28 Function failed, retrying in 4s
2023/04/04 10:26:29 Function failed, retrying in 6s
2023/04/04 10:26:30 Function failed, retrying in 10s
2023/04/04 10:26:31 Function failed, retrying in 14s
2023/04/04 10:26:32 Function failed, retrying in 22s
2023/04/04 10:26:33 Function failed, retrying in 34s
2023/04/04 10:26:34 Function failed, retrying in 50s
2023/04/04 10:26:35 Function failed, retrying in 1m16s
2023/04/04 10:26:36 Function failed, retrying in 1m54s
2023/04/04 10:26:37 Function failed, retrying in 2m52s
2023/04/04 10:26:38 Function failed, retrying in 4m18s
2023/04/04 10:26:39 Function failed, retrying in 6m28s
2023/04/04 10:26:40 Function failed, retrying in 9m42s
2023/04/04 10:26:41 Function failed, retrying in 14m34s
2023/04/04 10:26:42 Function failed, retrying in 21m52s
2023/04/04 10:26:43 Function failed, retrying in 32m50s
2023/04/04 10:26:44 Function failed, retrying in 49m14s
2023/04/04 10:26:45 Function failed, retrying in 1h0m0s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @mrodm

@mrodm mrodm requested a review from alexsapran April 4, 2023 10:14
Comment on lines +48 to +51
waitingTime := flag.Duration("waiting-time", 5*time.Second, fmt.Sprintf("Waiting period between each retry"))
growthFactor := flag.Float64("growth-factor", 1.25, fmt.Sprintf("Growth-Factor used for exponential backoff delays"))
retries := flag.Int("retries", 20, fmt.Sprintf("Number of retries to trigger the job"))
maxWaitingTime := flag.Duration("max-waiting-time", 60*time.Minute, fmt.Sprintf("Maximum waiting time per each retry"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these default values, the waiting times would be:

2023/04/04 11:31:02 Function failed, retrying in 5s -> 5.00s
2023/04/04 11:31:03 Function failed, retrying in 6s -> 6.25s
2023/04/04 11:31:04 Function failed, retrying in 7s -> 7.81s
2023/04/04 11:31:05 Function failed, retrying in 9s -> 9.77s
2023/04/04 11:31:06 Function failed, retrying in 12s -> 12.21s
2023/04/04 11:31:07 Function failed, retrying in 15s -> 15.26s
2023/04/04 11:31:08 Function failed, retrying in 19s -> 19.07s
2023/04/04 11:31:09 Function failed, retrying in 23s -> 23.84s
2023/04/04 11:31:10 Function failed, retrying in 29s -> 29.80s
2023/04/04 11:31:11 Function failed, retrying in 37s -> 37.25s
2023/04/04 11:31:12 Function failed, retrying in 46s -> 46.57s
2023/04/04 11:31:13 Function failed, retrying in 58s -> 58.21s
2023/04/04 11:31:14 Function failed, retrying in 1m12s -> 72.76s
2023/04/04 11:31:15 Function failed, retrying in 1m30s -> 90.95s
2023/04/04 11:31:16 Function failed, retrying in 1m53s -> 113.69s
2023/04/04 11:31:17 Function failed, retrying in 2m22s -> 142.11s
2023/04/04 11:31:18 Function failed, retrying in 2m57s -> 177.64s
2023/04/04 11:31:19 Function failed, retrying in 3m42s -> 222.04s
2023/04/04 11:31:20 Function failed, retrying in 4m37s -> 277.56s
2023/04/04 11:31:21 Function failed, retrying in 5m46s -> 346.94s

depends_on:
- build-package
timeout_in_minutes: 30
timeout_in_minutes: 90
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increase up to 90 minutes, so the retries can be completed

Copy link
Contributor

@jlind23 jlind23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Mario. LGTM

@mrodm mrodm merged commit dd5c458 into elastic:main Apr 4, 2023
@mrodm mrodm deleted the add_retries_queued_jenkins branch April 4, 2023 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some elastic-package-package-storage-publish are waiting for jenkins build until timeout

4 participants