Skip to content

Using Helix APIs to work around lack of queue availability #83803

Merged
jaredpar merged 16 commits into
dotnet:mainfrom
jaredpar:fix-to2
May 22, 2026
Merged

Using Helix APIs to work around lack of queue availability #83803
jaredpar merged 16 commits into
dotnet:mainfrom
jaredpar:fix-to2

Conversation

@jaredpar
Copy link
Copy Markdown
Member

@jaredpar jaredpar commented May 20, 2026

This PR is an attempt to help manage our Helix timeout situation better.

Whet Helix queues are under pressure it can take a significant amount of time for scheduled work items to start executing. The more pressure, the longer this can take. The issue is that our AzDO job timeout essentially has to take into account time to create helix job + time for last work item to start running + time for a helix work item to complete. Presently the only lever we have is to keep increasing the AzDO job timeout to deal with queue availability. That increases reliability but it also allows for test time to significantly regress without any notice. Because when the queues are available the tests will have more time to run without penaly.

This is an attempt to fix this by switching to enforcing timeouts on the execution of the helix work item. The AzDO job now has a ridiculously high timeout (six hours). The RunTests program though now enforces a timeout on the actual execution of a Helix Work Item. This means it's independent of queue availability. This should help us increase reliability without having to risk regressions in test execution time.

Recent report on the problem space: https://gist.github.com/jaredpar/1517d84efb6f37a65ff047c124299630

@copilot do not leave feedback about any comments or grammar.

Microsoft Reviewers: Open in CodeFlow

jaredpar and others added 4 commits May 20, 2026 17:50
When a Helix work item exceeds the execution timeout, write a synthetic
xUnit XML results file so the failure is visible in AzDO test results.
Also add PublishTestResults@2 steps to the Helix test runner yml files
so results are published. Additionally display job elapsed time in the
polling output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jaredpar jaredpar marked this pull request as ready for review May 20, 2026 21:10
@jaredpar jaredpar requested a review from a team as a code owner May 20, 2026 21:10
Copilot AI review requested due to automatic review settings May 20, 2026 21:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the RunTests Helix execution path to enforce timeouts based on Helix work item execution duration (rather than relying on Azure DevOps job timeouts that are sensitive to Helix queue delays). It also adjusts the Helix test pipeline jobs to allow longer overall job time while still surfacing test results.

Changes:

  • Add Helix job/work-item monitoring via Helix REST APIs and generate synthetic xUnit results when a work item exceeds the configured execution timeout.
  • Centralize the Helix work-item scheduling/timeout constants and wire scheduling to use them.
  • Increase AzDO Helix test job timeouts and publish xUnit XML results from the agent workspace.
Show a summary per file
File Description
src/Tools/RunTests/Program.cs Routes Helix runs through a new entry path that no longer uses the existing --timeout mechanism.
src/Tools/RunTests/HelixTestRunner.cs Starts Helix jobs, parses job info from output, polls Helix APIs, and enforces execution-time-based timeouts (plus synthetic results).
src/Tools/RunTests/HelixApi.cs New lightweight Helix REST client + response models for job/work item polling/cancellation.
src/Tools/RunTests/AssemblyScheduler.cs Switches scheduling thresholds to use HelixTestRunner.WorkItemScheduleTime.
eng/pipelines/test-windows-job.yml Raises job timeout and publishes xUnit results.
eng/pipelines/test-unix-job.yml Raises job timeout and publishes xUnit results.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 10

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/AssemblyScheduler.cs
Comment thread src/Tools/RunTests/HelixApi.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/Program.cs
@jaredpar
Copy link
Copy Markdown
Member Author

@dotnet/roslyn-infrastructure PTAL

Copilot AI review requested due to automatic review settings May 20, 2026 23:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 3

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
@jaredpar
Copy link
Copy Markdown
Member Author

The output from the test run here should give you a sense of the problem that we're dealing with right now:

Job Time: 00:21 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:21. This indicates a queue backup
Job Time: 00:22 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:22. This indicates a queue backup
Job Time: 00:23 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:23. This indicates a queue backup
Job Time: 00:24 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:24. This indicates a queue backup
Job Time: 00:25 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:25. This indicates a queue backup
Job Time: 00:26 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:26. This indicates a queue backup
Job Time: 00:27 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:27. This indicates a queue backup
Job Time: 00:28 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:28. This indicates a queue backup
Job Time: 00:29 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:29. This indicates a queue backup
Job Time: 00:30 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:30. This indicates a queue backup
Job Time: 00:31 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:31. This indicates a queue backup
Job Time: 00:32 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:32. This indicates a queue backup
Job Time: 00:33 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:33. This indicates a queue backup
Job Time: 00:34 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:34. This indicates a queue backup
Job Time: 00:35 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:35. This indicates a queue backup
Job Time: 00:36 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:36. This indicates a queue backup
Job Time: 00:37 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:37. This indicates a queue backup
Job Time: 00:38 Work Item States Running: 0 Unscheduled: 0 Waiting: 13 Finished: 1
##[warning]Helix job c22da4f2-b699-41a1-9bf1-eb0921107045 has 13 queued work items after 00:38. This indicates a queue backup

Comment thread eng/pipelines/test-unix-job.yml Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Initially I tried that but it required me to update a significant number of dependencies in our repository including several that are part of VS platform. That will take a bit of time to work through. This is a pretty simple REST API, already using it in other tools, so felt comfortable just generating it from the swagger. Eventually though yes I'd rather use the API.

Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
It's used to print out the HelixJobId and HelixJobCancellationToken properties to the console
so we can grab them in the process output and setup our helix watching.
-->
<Target Name="PrintHelixInfo" AfterTargets="CoreTest">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this run only after all the tests have finished? Don't we want to start monitoring the jobs before - when they start running - though?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing because there are two targets named CoreTest in a helix build. The first is what calls SendHelixJob and sets the helix properties we need. The second is where it calls WaitForHelixJobCompletion. If you look through the logs you can see that the PrintHelixInfo target is called twice: once for each of these.

I couldn't find a better way to hook onle the SendHelixJob one.

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
return DateTimeOffset.UtcNow - started > WorkItemExecutionTimeout;
}

static void WriteSyntheticTimeoutResults(string testResultsDirectory, string helixJobId, List<string> timedOutWorkItems)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to cancel other helix jobs when there is a timeout - to free up the queue?

I guess similarly, when we are getting close to the AzDO timeout and helix workitems haven't even started yet, we could cancel them too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.

As for cancelling when we get close to the AzDO limit. Given the timeout is now 6 hours I'm hoping that is a hypothetical. I wasn't sure if iadding in the complexity of threading through that timeout and then breaking on it was worth it. If you feel it's worth doing though I can add it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.

Is it not possible to just cancel the windows ones? Wouldn't the linux ones be from a separate pipeline stage and helix job?

Definitely feels like if we're going to fail our job due to timeouts, we should cancel the rest to make space for re-runs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of a PR should be to get the maximum amount of information from a given run. If we say cancel Windows_Debug_64 because of a timeout due to a bit of #if DEBUG code and reflectixely cancel Windows_Release_64 then we risk delaying getting the real test failures from that run until the next push.

Comment thread src/Tools/RunTests/Program.cs
Copilot AI review requested due to automatic review settings May 21, 2026 16:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

src/Tools/RunTests/AssemblyScheduler.cs:82

  • Grammar: "There are {longTests.Count} tests have execution times..." reads incorrectly. Consider rephrasing (e.g., "There are {count} tests that have execution times...") so the warning is clear.
            if (longTests.Count > 0)
            {
                ConsoleUtil.Warning($"There are {longTests.Count} tests have execution times greater than the maximum execution time of {HelixTestRunner.WorkItemScheduleTime:hh\\:mm\\:ss}.  These tests will be scheduled in their own individual work items and may indicate tests that should be optimized or removed if they are no longer providing value.");
                foreach (var (test, time) in longTests)
                {
                    ConsoleUtil.WriteLine($"\t{test} - {time:hh\\:mm\\:ss}");
                }
  • Files reviewed: 10/10 changed files
  • Comments generated: 5

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/Program.cs
Comment thread eng/build.ps1 Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 18:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 10/10 changed files
  • Comments generated: 5

Comment thread src/Tools/RunTests/Program.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/AssemblyScheduler.cs
jaredpar and others added 2 commits May 21, 2026 18:41
Include labeled Helix API URLs in the error message of
WriteSyntheticTimeoutResults for easier debugging of timed-out work items.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 20:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 10/10 changed files
  • Comments generated: 8

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/AssemblyScheduler.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
arguments: arguments,
captureOutput: true,
onOutputDataReceived: (e) =>
var process = new Process
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no super strong opinion here, but why switch from ProcessRunner?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just felt like I was fighting to get the Tasks and output I wanted with that API. I asked copilot to try without that type and the code seemed reasonable.

return DateTimeOffset.UtcNow - started > WorkItemExecutionTimeout;
}

static void WriteSyntheticTimeoutResults(string testResultsDirectory, string helixJobId, List<string> timedOutWorkItems)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to cancel the other jobs because it's possible they're making progress. Example: it's possible the Windows queue is backed up but Linux is just fine. Want to let the Linux jobs run to completion so we get results from them.

Is it not possible to just cancel the windows ones? Wouldn't the linux ones be from a separate pipeline stage and helix job?

Definitely feels like if we're going to fail our job due to timeouts, we should cancel the rest to make space for re-runs.

Comment thread src/Tools/RunTests/HelixTestRunner.cs Outdated
jaredpar and others added 3 commits May 21, 2026 23:37
Added GetWorkItemUrl and GetWorkItemConsoleUrl static methods to HelixApi
that include the required ?api-version=2019-06-17 query parameter.
Updated WriteSyntheticTimeoutResults to use these helpers instead of
hand-crafted URLs that were missing the API version suffix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

if (workItems.Waiting > 0 && elapsed > TimeSpan.FromMinutes(20))
{
ConsoleUtil.Warning($"Helix job {helixJobId} has {details.WorkItems.Waiting} queued work items after {elapsed:hh\\:mm}. This indicates a queue backup");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to suggest to poll less frequently overall. It just feels like warning here every minute is useless - we could just warn "this job has too many queued items, the queue is likely having problems" once per each job?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reason I want more than one poll is to keep getting data to give to the core-eng team about the problem. Once we work around this I still want to send regular reports about "we're waiting X minutes". I guess I could change it to be a bit "report when done" but my intent is all about getting data

@jaredpar jaredpar merged commit 4eda991 into dotnet:main May 22, 2026
28 checks passed
@dotnet-policy-service dotnet-policy-service Bot added this to the Next milestone May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants