Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WX-1625 Quota retry #7439

Merged
merged 17 commits into from
May 20, 2024
Merged

WX-1625 Quota retry #7439

merged 17 commits into from
May 20, 2024

Conversation

aednichols
Copy link
Contributor

@aednichols aednichols commented May 16, 2024

Description

Part 2 of #7432. Detects and retries the new fatal quota errors we've been seeing.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@aednichols aednichols requested a review from a team as a code owner May 16, 2024 00:57
workflowName: sleepy_sleep
status: Failed
"failures.0.message": "Workflow failed"
"failures.0.causedBy.0.message": "Task sleepy_sleep.sleep:NA:3 failed. The job was stopped before the command finished. PAPI error code 9. Could not start instance custom-12-11264 due to insufficient quota. Cromwell retries exhausted, task failed. Backend info: Execution failed: allocating: selecting resources: selecting region and zone: no available zones: us-west3: 12 CPUS (10/10 available) quota too low"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously done the same thing for us-west3 when designing AwaitingCloudQuota

Copy link
Contributor

@jgainerdewar jgainerdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -3,6 +3,8 @@ root = "gs://cloud-cromwell-dev-self-cleaning/cromwell_execution/ci"
maximum-polling-interval = 600
concurrent-job-limit = 1000

quota-attempts: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nope, good call. This is a leftover from when I was spamming the key everywhere because I couldn't get it to read and thought maybe my instance was reading the wrong config.

Copy link
Contributor

@THWiseman THWiseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Both the error messaging and retry attempts are huge improvements

@aednichols aednichols merged commit f2b2c30 into develop May 20, 2024
34 checks passed
@aednichols aednichols deleted the aen_wx_1625_part2 branch May 20, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants