Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents very slow #8380

Closed
3 of 10 tasks
alminveh opened this issue Sep 25, 2023 · 20 comments
Closed
3 of 10 tasks

Agents very slow #8380

alminveh opened this issue Sep 25, 2023 · 20 comments
Labels
bug report OS: Windows question Further information is requested

Comments

@alminveh
Copy link

Description

Since September 21 agents are very slow. Even simple tasks that usually take 1-2 seconds now take 25-30 seconds. A comparison of some tasks from two runs can be seen below:
image
On the left side we have a run before September 21 (simple tasks taking 1-2 seconds to complete), on the right side we have the same tasks running on September 25 (same tasks taking 25-30 seconds to complete.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • macOS 13
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Issue started from version 20230918.1.0

Is it regression?

20230912.1.0

Expected behavior

Tasks are fast to execute

Actual behavior

Task execution is very slow

Repro steps

@ilia-shipitsin
Copy link
Contributor

can you please provide repro steps ? what is "fetch step-template commit id" ?

@mikhailkoliada
Copy link
Member

Hey, as of what can be seen, it looks like it is a networking issue most likely and not the image issue, could you provide repro steps?

@alminveh
Copy link
Author

alminveh commented Sep 25, 2023

@ilia-shipitsin
Taking "Create temp-dir folder and set the current year" as example, it went from 1s to 28s, and the task is this:
- powershell: | md $(_tempFolder)\versionFile; $year=(Get-Date).ToString('yyyy') Write-Host "##vso[task.setvariable variable=currentYear;]$year" displayName: 'Create temp-dir folder and set the current year'
image

@mikhailkoliada
Does not seem like a networking issue. We have a run from last Wednesday (Sept 20), six agents executing the same jobs in parallel, five of them are agents with image version 20230912.1.0 and one with 20230918.1.0 -> five agents with 20230912.1.0 have similar execution time for all tasks, one with 20230918.1.0 takes way more time for almost all tasks.

@ilia-shipitsin
Copy link
Contributor

@alminveh , I added step like you described to my pipeline

GUI shows ~35s

image

@alminveh
Copy link
Author

@ilia-shipitsin is that with 20230918.1.0 image? Because with all the runs I have with 20230912.1.0 it took either one or two seconds.

@ilia-shipitsin
Copy link
Contributor

yes, it is mentioned image. I do not believe image is related, but we can reproduce and investigate.

image

@ilia-shipitsin
Copy link
Contributor

I collected logs, I see that "windows update" takes 15 sec

image

it did not suppose to happen, because we disable windows update:

https://github.com/actions/runner-images/blob/main/images/win/scripts/Installers/Finalize-VM.ps1#L94

interesting thing is "windows update medic service" which seems to enable windows update back

image

@iglendd
Copy link

iglendd commented Sep 26, 2023

I second that (alminveh filing).

Previously (20230910 runner) would take 10-20 sec to execute 8 times a Powershell script which changes ACL of a file, but since Sep 21 the new runner 20230918 takes 3 min (and timeout) running the same tests.

Thank you

@ilia-shipitsin
Copy link
Contributor

@iglendd , sorry, I cannot say either your issue is similar or not. Feel free to open separate issue with full repro steps. We'll investigate and mention if issues are related.

@ilia-shipitsin
Copy link
Contributor

I planned to disable WaasMedicSvc like we usually disable services, seems it will require more attention ((

image

@QiuyongWang
Copy link

Hi team, is there any ETA plan to fix this issue? thanks.

@ilia-shipitsin
Copy link
Contributor

let me summarize findings

  1. WaasMedicSvc is not expected here, but it is not the root cause, we'll address it later.

what was found

something has changed in powershell.exe (which is Powershell 5.1) behaviour. It "analyzes" all possible modules during startup and builds a cache. Somehow it started to take ~20 sec (we did not change modules a lot, maybe few minor changes if any at all)

https://learn.microsoft.com/en-us/powershell/scripting/windows-powershell/wmf/whats-new/release-notes?view=powershell-7.3#module-analysis-cache - this is eating ~20 sec

if you can switch to "pwsh" - it would be a workaround (in short, if you use GHA - you can, in ADO tasks are built on top of "powershell.exe", not possible to switch).

we'll try to escalate to powershell team. no idea (yet) how to fix that.

@ilia-shipitsin
Copy link
Contributor

I compared current image against previous, I haven't found anything that looks suspicious

win22/20230910.1...win22/20230918.1

@alminveh
Copy link
Author

What we see in our runs is not that something adds 20-30 seconds (or minutes) to execution time but overall execution is just way slower. Our test run tasks (WebDriverIO tests) that use to take between 2.5 and 3 hours now consistently run for 6+ hours (and fail with agent timeout every time), shorter test runs doubled in time needed to complete (~10 min -> ~20 min, ~1 hour -> 2+ hours, etc.).

@ilia-shipitsin
Copy link
Contributor

@alminveh , initially you reported that "even simple Powershell task takes 20-30 sec instead of 1 sec". I confirm that. If your tests under the hood invoke powershell many times, I suspect it is the same issue. If not, please provide repro steps (better in separate issue)

@ilia-shipitsin
Copy link
Contributor

good news, 20230924.1.0 (which is currently being deployed) seem to be not affected

@alminveh
Copy link
Author

I confirm that everything is back to normal with 20230924.1.0. Thank you @ilia-shipitsin .

@jessehouwing
Copy link
Contributor

jessehouwing commented Sep 28, 2023

Current image version: '20230924.1.0'

Gives me random slowdowns of tasks. I've seen:

  • Checkout take 2+ minutes longer on very small repo (normally 8s)
  • az cli take 6+ minutes to reindex its command index (normally 45s)

Explicitly stopping StorSVC prior to checkout normalizes things:

steps:
  - script:
      net stop storsvc
  - checkout: self

if StorSVC wants to stop, usually it does, but I'm seeing a few instances of:

The Storage Service service is stopping........
The Storage Service service could not be stopped.

I understand from @ilia-shipitsin that stopping Windows Update will take a bit more force.

See: https://jessehouwing.net/az-cli-performance-azure-pipelines-and-github-runner/

@ilia-shipitsin
Copy link
Contributor

feel free to play with disabling StorSvc for measurements.

we plan to release an image with disabled StorSvc next week: #8388

@mikhailkoliada
Copy link
Member

Underlying problem must be solved, closing now, feel free to open separate tickets for something else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug report OS: Windows question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants
@jessehouwing @QiuyongWang @mikhailkoliada @iglendd @ilia-shipitsin @alminveh and others