Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasm-opt can take way too long #100424

Open
sbomer opened this issue Mar 28, 2024 · 16 comments
Open

Wasm-opt can take way too long #100424

sbomer opened this issue Mar 28, 2024 · 16 comments
Labels
arch-wasm WebAssembly architecture area-Build-mono os-browser Browser variant of arch-wasm
Milestone

Comments

@sbomer
Copy link
Member

sbomer commented Mar 28, 2024

I'm seeing this job time out during the "Build product" step, in a way that doesn't get reported to GitHub. On GitHub it looks like the job is still running forever.

Hit in multiple PRs, for example:

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=622511&view=logs&jobId=63c2d0c8-fec2-5788-81c8-f3ac95e8841f
Build error leg or test failing: browser-wasm linux Release LibraryTests Build product

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "Agent was purged, cancelling the pipeline",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=622511
Error message validated: [Agent was purged, cancelling the pipeline]
Result validation: ❌ Known issue did not match with the provided build.
Validation performed at: 4/1/2024 10:15:39 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@sbomer sbomer added the Known Build Error Use this to report build issues in the .NET Helix tab label Mar 28, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 28, 2024
@agocke
Copy link
Member

agocke commented Mar 28, 2024

@sbomer Looks like this didn't work: ❌ Known issue did not match with the provided build.

I can't find the string in the test log.

@vcsjones
Copy link
Member

@vcsjones
Copy link
Member

vcsjones commented Mar 28, 2024

Actually I think this is a duplicate of #99888. Build analysis correctly picked that one up for my PR.

@sbomer
Copy link
Member Author

sbomer commented Mar 28, 2024

This is not about the test failure in "browser-wasm linux Release LibraryTests_Threading" (build analysis identified that as #99888), but about the product build failure in "browser-wasm linux Release LibraryTests":
Screenshot 2024-03-28 at 15 10 32

@lewing
Copy link
Member

lewing commented Mar 29, 2024

Given the way it is failing this is probably an infrastructure issue

@riarenas
Copy link
Member

riarenas commented Apr 1, 2024

https://helix.dot.net/BuildAnalysis/SearchTimeline?error=Agent%20was%20purged,%20cancelling%20the%20pipeline&dateType=Day(s)&dateValue=7&pageNumber=3

Seems to show that this is a runtime specific issue. It seems like instances are consistently happening in the wasm legs (along with catching a few other stray disconnects)

(Maybe you can try to change the error message to Agent was purged, cancelling the pipeline? but I'm not sure if that will be picked up by the known issues infra as it's only being surfaced as a warning)

@sbomer
Copy link
Member Author

sbomer commented Apr 1, 2024

Thanks for the suggestion, didn't seem to work. :(

@kg
Copy link
Contributor

kg commented Apr 1, 2024

I'll note that when I watched the stdout of the build step for one of my PRs, it was building, even after 2 hours. So it seems like the builds for this lane are extremely slow. For comparison, the windows equivalent lane seems to build in around 40 minutes.

@pavelsavara
Copy link
Member

Actually I think this is a duplicate of #99888. Build analysis correctly picked that one up for my PR.

#99888 is during test run on helix not related to this.

This is build/agent issue. Probably infra.
AzDo kills the agent and deletes the log file!

Like this
image

This is before the agent was killed, something is terribly slow
image

The running step is CPU+memory heavy, maybe the agent is also swapping to disk.

This problem is blocking me

@pavelsavara
Copy link
Member

How would it look like if the agent failed with OOM ?
We had similar issues on Helix last year, but Ankit got valid log files about it See #51961.

@lewing
Copy link
Member

lewing commented Apr 2, 2024

The cancellations are very strange Steve and I watched it happen in a recent run.

cc @steveisok

@rzikm
Copy link
Member

rzikm commented Apr 2, 2024

I have seen the same issue on other CI legs as well

image

@riarenas
Copy link
Member

riarenas commented Apr 2, 2024

From the engineering services side, we don't have a lot more details on what is going on. While I try to track some information on what is happening to these agents and why these workloads would break them in such a way, I've reverted the version of the build.ubuntu.2204.amd64.open image to the previous version.

Whether that helps or not will be a useful data point.

@kg
Copy link
Contributor

kg commented Apr 2, 2024

Possibly related:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=627899&view=logs&jobId=f4616eca-4cb2-53ea-b86a-d7a1235a32a0
https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-100386-merge-2eeaae49b1db47e284/LibraryImportGenerator.Tests/1/console.ef4e1575.log?helixlogtype=result

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Tests (found 124 of 128 test cases)
  Starting:    LibraryImportGenerator.Tests (parallel test collections = on [2 threads], stop on fail = off)
/root/helix/work/workitem/e
./RunTests.sh: line 179:    25 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Tests.deps.json xunit.console.dll LibraryImportGenerator.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
----- end Tue Apr 2 23:28:13 UTC 2024 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed either due to out of memory/resources (see /var/log/messages) or by explicit kill.

Does this indicate OOM?

@pavelsavara
Copy link
Member

I merged #100517 to reduce/avoid the problem

@lewing lewing changed the title browser-wasm linux Release LibraryTests timeout Wasm-opt can take way too long Apr 24, 2024
@lewing lewing removed the Known Build Error Use this to report build issues in the .NET Helix tab label Apr 24, 2024
@lewing lewing added this to the Future milestone Apr 24, 2024
@lewing lewing added arch-wasm WebAssembly architecture and removed area-Infrastructure-libraries untriaged New issue has not been triaged by the area owner labels Apr 24, 2024
@lewing lewing added area-Build-mono os-browser Browser variant of arch-wasm labels Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly architecture area-Build-mono os-browser Browser variant of arch-wasm
Projects
None yet
Development

No branches or pull requests

9 participants