Using Helix APIs to work around lack of queue availability by jaredpar · Pull Request #83803 · dotnet/roslyn

jaredpar · 2026-05-20T14:53:55Z

This PR is an attempt to help manage our Helix timeout situation better.

Whet Helix queues are under pressure it can take a significant amount of time for scheduled work items to start executing. The more pressure, the longer this can take. The issue is that our AzDO job timeout essentially has to take into account time to create helix job + time for last work item to start running + time for a helix work item to complete. Presently the only lever we have is to keep increasing the AzDO job timeout to deal with queue availability. That increases reliability but it also allows for test time to significantly regress without any notice. Because when the queues are available the tests will have more time to run without penaly.

This is an attempt to fix this by switching to enforcing timeouts on the execution of the helix work item. The AzDO job now has a ridiculously high timeout (six hours). The RunTests program though now enforces a timeout on the actual execution of a Helix Work Item. This means it's independent of queue availability. This should help us increase reliability without having to risk regressions in test execution time.

Recent report on the problem space: https://gist.github.com/jaredpar/1517d84efb6f37a65ff047c124299630

@copilot do not leave feedback about any comments or grammar.

Microsoft Reviewers: Open in CodeFlow

… work item details

When a Helix work item exceeds the execution timeout, write a synthetic xUnit XML results file so the failure is visible in AzDO test results. Also add PublishTestResults@2 steps to the Helix test runner yml files so results are published. Additionally display job elapsed time in the polling output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates the RunTests Helix execution path to enforce timeouts based on Helix work item execution duration (rather than relying on Azure DevOps job timeouts that are sensitive to Helix queue delays). It also adjusts the Helix test pipeline jobs to allow longer overall job time while still surfacing test results.

Changes:

Add Helix job/work-item monitoring via Helix REST APIs and generate synthetic xUnit results when a work item exceeds the configured execution timeout.
Centralize the Helix work-item scheduling/timeout constants and wire scheduling to use them.
Increase AzDO Helix test job timeouts and publish xUnit XML results from the agent workspace.

Show a summary per file

File	Description
src/Tools/RunTests/Program.cs	Routes Helix runs through a new entry path that no longer uses the existing `--timeout` mechanism.
src/Tools/RunTests/HelixTestRunner.cs	Starts Helix jobs, parses job info from output, polls Helix APIs, and enforces execution-time-based timeouts (plus synthetic results).
src/Tools/RunTests/HelixApi.cs	New lightweight Helix REST client + response models for job/work item polling/cancellation.
src/Tools/RunTests/AssemblyScheduler.cs	Switches scheduling thresholds to use `HelixTestRunner.WorkItemScheduleTime`.
eng/pipelines/test-windows-job.yml	Raises job timeout and publishes xUnit results.
eng/pipelines/test-unix-job.yml	Raises job timeout and publishes xUnit results.

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 10

jaredpar · 2026-05-20T22:00:15Z

@dotnet/roslyn-infrastructure PTAL

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 3

jaredpar · 2026-05-21T00:03:53Z

The output from the test run here should give you a sense of the problem that we're dealing with right now:

Job Time: 00:21 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:21. This indicates a queue backup
Job Time: 00:22 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:22. This indicates a queue backup
Job Time: 00:23 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:23. This indicates a queue backup
Job Time: 00:24 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:24. This indicates a queue backup
Job Time: 00:25 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:25. This indicates a queue backup
Job Time: 00:26 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:26. This indicates a queue backup
Job Time: 00:27 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:27. This indicates a queue backup
Job Time: 00:28 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:28. This indicates a queue backup
Job Time: 00:29 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:29. This indicates a queue backup
Job Time: 00:30 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:30. This indicates a queue backup
Job Time: 00:31 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:31. This indicates a queue backup
Job Time: 00:32 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:32. This indicates a queue backup
Job Time: 00:33 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:33. This indicates a queue backup
Job Time: 00:34 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:34. This indicates a queue backup
Job Time: 00:35 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:35. This indicates a queue backup
Job Time: 00:36 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:36. This indicates a queue backup
Job Time: 00:37 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:37. This indicates a queue backup
Job Time: 00:38 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:38. This indicates a queue backup

jjonescz · 2026-05-21T12:02:07Z

have you considered using https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet-eng/NuGet/Microsoft.DotNet.Helix.Client ?

Yes. Initially I tried that but it required me to update a significant number of dependencies in our repository including several that are part of VS platform. That will take a bit of time to work through. This is a pretty simple REST API, already using it in other tools, so felt comfortable just generating it from the swagger. Eventually though yes I'd rather use the API.

jjonescz · 2026-05-21T12:13:45Z

+                It's used to print out the HelixJobId and HelixJobCancellationToken properties to the console
+                so we can grab them in the process output and setup our helix watching.
+              -->
+              <Target Name="PrintHelixInfo" AfterTargets="CoreTest">                                                                                       


Doesn't this run only after all the tests have finished? Don't we want to start monitoring the jobs before - when they start running - though?

It's a bit confusing because there are two targets named CoreTest in a helix build. The first is what calls SendHelixJob and sets the helix properties we need. The second is where it calls WaitForHelixJobCompletion. If you look through the logs you can see that the PrintHelixInfo target is called twice: once for each of these.

I couldn't find a better way to hook onle the SendHelixJob one.

jjonescz · 2026-05-21T12:17:31Z

+            return DateTimeOffset.UtcNow - started > WorkItemExecutionTimeout;
+        }
+
+        static void WriteSyntheticTimeoutResults(string testResultsDirectory, string helixJobId, List<string> timedOutWorkItems)


Should we try to cancel other helix jobs when there is a timeout - to free up the queue?

I guess similarly, when we are getting close to the AzDO timeout and helix workitems haven't even started yet, we could cancel them too.

I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.

As for cancelling when we get close to the AzDO limit. Given the timeout is now 6 hours I'm hoping that is a hypothetical. I wasn't sure if iadding in the complexity of threading through that timeout and then breaking on it was worth it. If you feel it's worth doing though I can add it.

I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.

Is it not possible to just cancel the windows ones? Wouldn't the linux ones be from a separate pipeline stage and helix job?

Definitely feels like if we're going to fail our job due to timeouts, we should cancel the rest to make space for re-runs.

The goal of a PR should be to get the maximum amount of information from a given run. If we say cancel Windows_Debug_64 because of a timeout due to a bit of #if DEBUG code and reflectixely cancel Windows_Release_64 then we risk delaying getting the real test failures from that run until the next push.

Copilot

Copilot's findings

Comments suppressed due to low confidence (1)

src/Tools/RunTests/AssemblyScheduler.cs:82

Grammar: "There are {longTests.Count} tests have execution times..." reads incorrectly. Consider rephrasing (e.g., "There are {count} tests that have execution times...") so the warning is clear.

            if (longTests.Count > 0)
            {
                ConsoleUtil.Warning($"There are {longTests.Count} tests have execution times greater than the maximum execution time of {HelixTestRunner.WorkItemScheduleTime:hh\\:mm\\:ss}.  These tests will be scheduled in their own individual work items and may indicate tests that should be optimized or removed if they are no longer providing value.");
                foreach (var (test, time) in longTests)
                {
                    ConsoleUtil.WriteLine($"\t{test} - {time:hh\\:mm\\:ss}");
                }

Files reviewed: 10/10 changed files
Comments generated: 5

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 10/10 changed files
Comments generated: 5

Include labeled Helix API URLs in the error message of WriteSyntheticTimeoutResults for easier debugging of timed-out work items. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 10/10 changed files
Comments generated: 8

dibarbet · 2026-05-21T21:47:41Z

-            arguments: arguments,
-            captureOutput: true,
-            onOutputDataReceived: (e) =>
+        var process = new Process


no super strong opinion here, but why switch from ProcessRunner?

Just felt like I was fighting to get the Tasks and output I wanted with that API. I asked copilot to try without that type and the code seemed reasonable.

dibarbet · 2026-05-21T21:49:18Z

+            return DateTimeOffset.UtcNow - started > WorkItemExecutionTimeout;
+        }
+
+        static void WriteSyntheticTimeoutResults(string testResultsDirectory, string helixJobId, List<string> timedOutWorkItems)


I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.

Is it not possible to just cancel the windows ones? Wouldn't the linux ones be from a separate pipeline stage and helix job?

Definitely feels like if we're going to fail our job due to timeouts, we should cancel the rest to make space for re-runs.

Added GetWorkItemUrl and GetWorkItemConsoleUrl static methods to HelixApi that include the required ?api-version=2019-06-17 query parameter. Updated WriteSyntheticTimeoutResults to use these helpers instead of hand-crafted URLs that were missing the API version suffix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jjonescz · 2026-05-22T09:37:23Z


                    if (workItems.Waiting > 0 && elapsed > TimeSpan.FromMinutes(20))
                    {
                        ConsoleUtil.Warning($"Helix job {helixJobId} has {details.WorkItems.Waiting} queued work items after {elapsed:hh\\:mm}. This indicates a queue backup");


I didn't mean to suggest to poll less frequently overall. It just feels like warning here every minute is useless - we could just warn "this job has too many queued items, the queue is likely having problems" once per each job?

Part of the reason I want more than one poll is to keep getting data to give to the core-eng team about the problem. Once we work around this I still want to send regular reports about "we're waiting X minutes". I guess I could change it to be a bit "report when done" but my intent is all about getting data

jaredpar added 2 commits May 19, 2026 23:09

Refactor the code a bit for the runner to make it easier to add helix…

28fdd2a

… work item details

Using the helix apis to manage our work

ecb77f5

github-actions Bot added the Area-Infrastructure label May 20, 2026

jaredpar and others added 4 commits May 20, 2026 17:50

better scheduling.

2f6bf37

more

76c4a68

more

7b969c1

jaredpar marked this pull request as ready for review May 20, 2026 21:10

jaredpar requested a review from a team as a code owner May 20, 2026 21:10

Copilot AI review requested due to automatic review settings May 20, 2026 21:10

Copilot started reviewing on behalf of jaredpar May 20, 2026 21:10 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

pr feedback

2914e48

fix

5630725

Copilot AI review requested due to automatic review settings May 20, 2026 23:00

Copilot started reviewing on behalf of jaredpar May 20, 2026 23:01 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated

jjonescz reviewed May 21, 2026

View reviewed changes

jaredpar added 2 commits May 21, 2026 16:40

PR feedback

c1085c0

PR feedback

3b9a503

Copilot AI review requested due to automatic review settings May 21, 2026 16:44

Copilot started reviewing on behalf of jaredpar May 21, 2026 16:44 View session

jaredpar mentioned this pull request May 21, 2026

Changing timeouts to test the new helix runner code. #83821

Closed

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated

Comment thread src/Tools/RunTests/HelixTestRunner.cs

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated

Comment thread src/Tools/RunTests/Program.cs

Comment thread eng/build.ps1 Outdated

Apply suggestions from code review

0171341

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 21, 2026 18:13

Copilot started reviewing on behalf of jaredpar May 21, 2026 18:13 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread src/Tools/RunTests/Program.cs

Comment thread src/Tools/RunTests/HelixTestRunner.cs

Comment thread src/Tools/RunTests/HelixTestRunner.cs

Comment thread src/Tools/RunTests/HelixTestRunner.cs

Comment thread src/Tools/RunTests/AssemblyScheduler.cs

jaredpar and others added 2 commits May 21, 2026 18:41

stupid typo

50b4c01

Add Work Item and Console URLs to synthetic timeout results

0bf1957

Include labeled Helix API URLs in the error message of WriteSyntheticTimeoutResults for easier debugging of timed-out work items. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 21, 2026 20:35

Copilot started reviewing on behalf of jaredpar May 21, 2026 20:35 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

build-analysis Bot mentioned this pull request May 21, 2026

[Known Build Error] Pool leak detected #83749

Closed

dibarbet reviewed May 21, 2026

View reviewed changes

jaredpar and others added 3 commits May 21, 2026 23:37

timeout cleanup

0ba847d

pr feedback

7a3be93

jjonescz approved these changes May 22, 2026

View reviewed changes

jaredpar merged commit 4eda991 into dotnet:main May 22, 2026
28 checks passed

dotnet-policy-service Bot added this to the Next milestone May 22, 2026

This was referenced May 22, 2026

[release/10.0.4xx] Source code updates from dotnet/roslyn dotnet/dotnet#6728

Merged

[main] Source code updates from dotnet/roslyn dotnet/dotnet#6789

Merged

davidwengier mentioned this pull request May 25, 2026

Update roslyn to 5.8.0-1.26273.3 dotnet/vscode-csharp#9348

Open

Conversation

jaredpar commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredpar commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredpar commented May 21, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredpar commented May 20, 2026 •

edited

Loading