Using Helix APIs to work around lack of queue availability #83803
Conversation
When a Helix work item exceeds the execution timeout, write a synthetic xUnit XML results file so the failure is visible in AzDO test results. Also add PublishTestResults@2 steps to the Helix test runner yml files so results are published. Additionally display job elapsed time in the polling output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the RunTests Helix execution path to enforce timeouts based on Helix work item execution duration (rather than relying on Azure DevOps job timeouts that are sensitive to Helix queue delays). It also adjusts the Helix test pipeline jobs to allow longer overall job time while still surfacing test results.
Changes:
- Add Helix job/work-item monitoring via Helix REST APIs and generate synthetic xUnit results when a work item exceeds the configured execution timeout.
- Centralize the Helix work-item scheduling/timeout constants and wire scheduling to use them.
- Increase AzDO Helix test job timeouts and publish xUnit XML results from the agent workspace.
Show a summary per file
| File | Description |
|---|---|
| src/Tools/RunTests/Program.cs | Routes Helix runs through a new entry path that no longer uses the existing --timeout mechanism. |
| src/Tools/RunTests/HelixTestRunner.cs | Starts Helix jobs, parses job info from output, polls Helix APIs, and enforces execution-time-based timeouts (plus synthetic results). |
| src/Tools/RunTests/HelixApi.cs | New lightweight Helix REST client + response models for job/work item polling/cancellation. |
| src/Tools/RunTests/AssemblyScheduler.cs | Switches scheduling thresholds to use HelixTestRunner.WorkItemScheduleTime. |
| eng/pipelines/test-windows-job.yml | Raises job timeout and publishes xUnit results. |
| eng/pipelines/test-unix-job.yml | Raises job timeout and publishes xUnit results. |
Copilot's findings
- Files reviewed: 6/6 changed files
- Comments generated: 10
|
@dotnet/roslyn-infrastructure PTAL |
|
The output from the test run here should give you a sense of the problem that we're dealing with right now: |
There was a problem hiding this comment.
have you considered using https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet-eng/NuGet/Microsoft.DotNet.Helix.Client ?
There was a problem hiding this comment.
Yes. Initially I tried that but it required me to update a significant number of dependencies in our repository including several that are part of VS platform. That will take a bit of time to work through. This is a pretty simple REST API, already using it in other tools, so felt comfortable just generating it from the swagger. Eventually though yes I'd rather use the API.
| It's used to print out the HelixJobId and HelixJobCancellationToken properties to the console | ||
| so we can grab them in the process output and setup our helix watching. | ||
| --> | ||
| <Target Name="PrintHelixInfo" AfterTargets="CoreTest"> |
There was a problem hiding this comment.
Doesn't this run only after all the tests have finished? Don't we want to start monitoring the jobs before - when they start running - though?
There was a problem hiding this comment.
It's a bit confusing because there are two targets named CoreTest in a helix build. The first is what calls SendHelixJob and sets the helix properties we need. The second is where it calls WaitForHelixJobCompletion. If you look through the logs you can see that the PrintHelixInfo target is called twice: once for each of these.
I couldn't find a better way to hook onle the SendHelixJob one.
| return DateTimeOffset.UtcNow - started > WorkItemExecutionTimeout; | ||
| } | ||
|
|
||
| static void WriteSyntheticTimeoutResults(string testResultsDirectory, string helixJobId, List<string> timedOutWorkItems) |
There was a problem hiding this comment.
Should we try to cancel other helix jobs when there is a timeout - to free up the queue?
I guess similarly, when we are getting close to the AzDO timeout and helix workitems haven't even started yet, we could cancel them too.
There was a problem hiding this comment.
I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.
As for cancelling when we get close to the AzDO limit. Given the timeout is now 6 hours I'm hoping that is a hypothetical. I wasn't sure if iadding in the complexity of threading through that timeout and then breaking on it was worth it. If you feel it's worth doing though I can add it.
There was a problem hiding this comment.
I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.
Is it not possible to just cancel the windows ones? Wouldn't the linux ones be from a separate pipeline stage and helix job?
Definitely feels like if we're going to fail our job due to timeouts, we should cancel the rest to make space for re-runs.
There was a problem hiding this comment.
The goal of a PR should be to get the maximum amount of information from a given run. If we say cancel Windows_Debug_64 because of a timeout due to a bit of #if DEBUG code and reflectixely cancel Windows_Release_64 then we risk delaying getting the real test failures from that run until the next push.
There was a problem hiding this comment.
Copilot's findings
Comments suppressed due to low confidence (1)
src/Tools/RunTests/AssemblyScheduler.cs:82
- Grammar: "There are {longTests.Count} tests have execution times..." reads incorrectly. Consider rephrasing (e.g., "There are {count} tests that have execution times...") so the warning is clear.
if (longTests.Count > 0)
{
ConsoleUtil.Warning($"There are {longTests.Count} tests have execution times greater than the maximum execution time of {HelixTestRunner.WorkItemScheduleTime:hh\\:mm\\:ss}. These tests will be scheduled in their own individual work items and may indicate tests that should be optimized or removed if they are no longer providing value.");
foreach (var (test, time) in longTests)
{
ConsoleUtil.WriteLine($"\t{test} - {time:hh\\:mm\\:ss}");
}
- Files reviewed: 10/10 changed files
- Comments generated: 5
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Include labeled Helix API URLs in the error message of WriteSyntheticTimeoutResults for easier debugging of timed-out work items. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| arguments: arguments, | ||
| captureOutput: true, | ||
| onOutputDataReceived: (e) => | ||
| var process = new Process |
There was a problem hiding this comment.
no super strong opinion here, but why switch from ProcessRunner?
There was a problem hiding this comment.
Just felt like I was fighting to get the Tasks and output I wanted with that API. I asked copilot to try without that type and the code seemed reasonable.
| return DateTimeOffset.UtcNow - started > WorkItemExecutionTimeout; | ||
| } | ||
|
|
||
| static void WriteSyntheticTimeoutResults(string testResultsDirectory, string helixJobId, List<string> timedOutWorkItems) |
There was a problem hiding this comment.
I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.
Is it not possible to just cancel the windows ones? Wouldn't the linux ones be from a separate pipeline stage and helix job?
Definitely feels like if we're going to fail our job due to timeouts, we should cancel the rest to make space for re-runs.
Added GetWorkItemUrl and GetWorkItemConsoleUrl static methods to HelixApi that include the required ?api-version=2019-06-17 query parameter. Updated WriteSyntheticTimeoutResults to use these helpers instead of hand-crafted URLs that were missing the API version suffix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| if (workItems.Waiting > 0 && elapsed > TimeSpan.FromMinutes(20)) | ||
| { | ||
| ConsoleUtil.Warning($"Helix job {helixJobId} has {details.WorkItems.Waiting} queued work items after {elapsed:hh\\:mm}. This indicates a queue backup"); |
There was a problem hiding this comment.
I didn't mean to suggest to poll less frequently overall. It just feels like warning here every minute is useless - we could just warn "this job has too many queued items, the queue is likely having problems" once per each job?
There was a problem hiding this comment.
Part of the reason I want more than one poll is to keep getting data to give to the core-eng team about the problem. Once we work around this I still want to send regular reports about "we're waiting X minutes". I guess I could change it to be a bit "report when done" but my intent is all about getting data
This PR is an attempt to help manage our Helix timeout situation better.
Whet Helix queues are under pressure it can take a significant amount of time for scheduled work items to start executing. The more pressure, the longer this can take. The issue is that our AzDO job timeout essentially has to take into account time to create helix job + time for last work item to start running + time for a helix work item to complete. Presently the only lever we have is to keep increasing the AzDO job timeout to deal with queue availability. That increases reliability but it also allows for test time to significantly regress without any notice. Because when the queues are available the tests will have more time to run without penaly.
This is an attempt to fix this by switching to enforcing timeouts on the execution of the helix work item. The AzDO job now has a ridiculously high timeout (six hours). The
RunTestsprogram though now enforces a timeout on the actual execution of a Helix Work Item. This means it's independent of queue availability. This should help us increase reliability without having to risk regressions in test execution time.Recent report on the problem space: https://gist.github.com/jaredpar/1517d84efb6f37a65ff047c124299630
@copilot do not leave feedback about any comments or grammar.
Microsoft Reviewers: Open in CodeFlow