Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Suggestion: Add "Cancel Helix Job" Build task #7113

Closed
2 tasks
MattGal opened this issue Mar 16, 2021 · 12 comments
Closed
2 tasks

Feature Suggestion: Add "Cancel Helix Job" Build task #7113

MattGal opened this issue Mar 16, 2021 · 12 comments
Assignees
Labels
Detected By - Ad-Hoc Testing Issue was detected by ad-hoc testing preformed by someone on the engineering team Ops - First Responder

Comments

@MattGal
Copy link
Member

MattGal commented Mar 16, 2021

  • This issue is blocking
  • This issue is causing unreasonable pain

As noted here: https://developercommunity.visualstudio.com/t/ADO-pipeline-timeouts-dont-cancel-the-s/1371617?space=21&entry=myfeedback&viewtype=all, it seems that the automated cancellation feature I built doesn't run as often as we'd expect because it's far more common for a build to time out than be cancelled via push. Perhaps ADO will fix this.

In the mean time, we could make it simpler to cancel jobs in this scenario by:

  • Introduce a build task that, given a cancellation token (could use API Access token too but not necessary since all jobs produce unique cancellation tokens), calls the Helix API for job cancellation.
  • Add ADO logging commands to emit all started jobs and their cancellation tokens for consumption by later steps
@akoeplinger
Copy link
Member

This would be really helpful for the runtime-staging pipeline where we're battling with capacity issues on the Android devices Helix queue which means we run into timeouts quite often when a lot of work gets scheduled at the same time. This in turn exacerbates the issue because work items don't get canceled.

@alexperovich
Copy link
Member

This functionality already exists in the WaitForHelixJobCompletion task, If the task gets canceled the job is canceled.

@MattGal
Copy link
Member Author

MattGal commented Mar 16, 2021

This functionality already exists in the WaitForHelixJobCompletion task, If the task gets canceled the job is canceled.

I know, I wrote it; this is for later one when they just want to explicitly cancel without cancelling the task. Long story, happy to regale you with it any time after 3 PM PST.

@akoeplinger
Copy link
Member

Yeah the problem is when a build runs into the job timeout AzDO will just kill the process, so the cancellation code in WaitForHelixJobCompletion doesn't run.

Until AzDO fixes this we'd like to run a subsequent task that uses condition: canceled() to cancel the pending workitems even on job timeouts.

@markwilkie
Copy link
Member

We are definitely seeing higher pressure on machine queues - and this would likely help. Thoughts from others?

How expensive do you think it would be to do @MattGal ?

@MattGal
Copy link
Member Author

MattGal commented Mar 31, 2021

We are definitely seeing higher pressure on machine queues - and this would likely help. Thoughts from others?

How expensive do you think it would be to do @MattGal ?

SWAG: 1-2 days. All the pieces are already there in the SDK, this would just be stitching them together.

@MattGal
Copy link
Member Author

MattGal commented Apr 1, 2021

[Async Triage] the last poster has a good point.

@riarenas
Copy link
Member

riarenas commented Apr 1, 2021

If this is causing pain and it's not expensive as per Matt's point, this could probably be done as part of FR?

@MattGal
Copy link
Member Author

MattGal commented Apr 1, 2021

If this is causing pain

Yes, if used by the Helix Android runs to work around AzDO's limitation, it would drastically reduce pain in PRs that need Android testing because it would prevent hundreds of hours of usage of limited hardware.

@ilyas1974 ilyas1974 added Detected By - Ad-Hoc Testing Issue was detected by ad-hoc testing preformed by someone on the engineering team Ops - First Responder labels Apr 1, 2021
@MattGal MattGal self-assigned this Apr 2, 2021
@MattGal
Copy link
Member Author

MattGal commented Apr 2, 2021

Grabbing this one as it should be pretty quick.

@MattGal
Copy link
Member Author

MattGal commented Apr 2, 2021

#7185

@MattGal
Copy link
Member Author

MattGal commented Apr 2, 2021

As this is an Arcade change that's now in master, I am closing the issue; natural dependency flow should get it to the runtime team next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Detected By - Ad-Hoc Testing Issue was detected by ad-hoc testing preformed by someone on the engineering team Ops - First Responder
Projects
None yet
Development

No branches or pull requests

6 participants