Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iOS - APP_LAUNCH_TIMEOUT / WORKLOAD TIMED OUT #10820

Closed
karelz opened this issue Sep 13, 2022 · 22 comments
Closed

iOS - APP_LAUNCH_TIMEOUT / WORKLOAD TIMED OUT #10820

karelz opened this issue Sep 13, 2022 · 22 comments
Assignees

Comments

@karelz
Copy link
Member

karelz commented Sep 13, 2022

Starting on 9/9 Rolling run 11121 there are 2x-4x failure in each run:

  • Platform: net7.0-iOS-Release-arm64-Mono_Release-OSX.1200.Amd64.Iphone.Open
Day Run Details
9/12 12571 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/11 12413 (7.0) 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/11 12028 1x APP_LAUNCH_TIMEOUT & 2x WORKLOAD TIMED OUT
9/11 11973 (7.0) 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/11 11877 1x APP_LAUNCH_TIMEOUT & 2x WORKLOAD TIMED OUT
9/10 11727 3x APP_LAUNCH_TIMEOUT
9/10 11508 2x APP_LAUNCH_TIMEOUT & 1x WORKLOAD TIMED OUT
9/9 11121 4x APP_LAUNCH_TIMEOUT
9/9 11454 (7.0) 4x APP_LAUNCH_TIMEOUT
9/9 10781 (7.0) 4x APP_LAUNCH_TIMEOUT

Example of logs

APP_LAUNCH_TIMEOUT

Console

[03:16:43] dbug: Xamarin.Hosting: Connected to DNCENGOSX117 (00008030-000804482E23402E) in 00:00:00.0018453
[03:16:47] dbug: 95976 Execution timed out after 60 seconds and the process was killed.
[03:16:47] dbug: Process mlaunch exited with 0
[03:16:48] fail: Application launch timed out before the test execution has started
[03:16:48] info: Uninstalling the application 'net.dot.System.Runtime.Numerics.Tests' from 'DNCENGOSX117'
[03:16:48] dbug: 
[03:16:48] dbug: Running /private/tmp/helix/working/AC9B09AB/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22456.1/tools/net7.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode_13.4.app --uninstalldevbundleid net.dot.System.Runtime.Numerics.Tests --devname 00008030-000804482E23402E -v -v -v -v -v
[03:16:55] dbug: TCP tunnel still has not connected
[03:17:27] dbug: TCP connection hasn't started in time (00:02:00). Stopped listening.
[03:17:27] dbug: Failed to read TCP data: System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (61): Connection refused [::1]:54809
                    at Microsoft.DotNet.XHarness.iOS.Shared.Listeners.SimpleTcpListener.StartTcpTunnel() in /_/src/Microsoft.DotNet.XHarness.iOS.Shared/Listeners/SimpleTcpListener.cs:line 164
[03:17:27] dbug: Tests have finished executing
[03:19:49] dbug: Process 95987 didn't exit within 00:03:00 and will be killed
[03:19:49] dbug: Killing process tree of 95987...
[03:19:49] dbug: Pids to kill: 95987
[03:19:49] dbug: Running lldb diagnostics for pid 95987
[03:19:49] dbug: Printing backtrace for pid=95987
[03:19:49] dbug: (lldb) command source -s 0 '/tmp/helix/working/AC9B09AB/t/tmpvvopx2.tmp'
[03:19:49] dbug: Executing commands in '/tmp/helix/working/AC9B09AB/t/tmpvvopx2.tmp'.
[03:19:49] dbug: (lldb) process attach --pid 95987
[03:20:09] dbug: 95987 Execution timed out after 180 seconds and the process was killed.
[03:20:10] dbug: Failed to get ExitCode: System.InvalidOperationException: Process must exit before requested information can be determined.
                    at System.Diagnostics.Process.EnsureState(State state)
                    at Microsoft.DotNet.XHarness.Common.Execution.ProcessManager.RunAsyncInternal(Process process, ILog log, ILog stdout, ILog stderr, Action`2 kill, Func`3 getChildProcessIds, Nullable`1 timeout, Dictionary`2 environmentVariables, Nullable`1 cancellationToken, Nullable`1 diagnostics) in /_/src/Microsoft.DotNet.XHarness.Common/Execution/ProcessManager.cs:line 357
[03:20:10] fail: Failed to uninstall the app bundle! Check logs for more details!
[03:20:10] dbug: Saving diagnostics data to '/tmp/helix/working/AC9B09AB/w/B92F09FA/e/diagnostics.json'
XHarness exit code: 90 (APP_LAUNCH_TIMEOUT)

WORKLOAD TIMED OUT

Console

--------------------------------------------------------------------------------------
MSBuild version 17.4.0-preview-22368-02+c8492483a for .NET
/private/tmp/helix/working/AC9B09AB/p/dotnet-cli/sdk/7.0.100-preview.7.22377.5/MSBuild.dll -maxcpucount -verbosity:m /bl:/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/xharness-output/AOTBuild.binlog /p:XHARNESS_EXECUTION_DIR=/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e /p:RunAOTCompilation=true /p:TargetOS=iOS /p:TargetArchitecture=arm64 /p:MonoForceInterpreter= /p:MonoEnableLLVM=true /p:DevTeamProvisioning=- /p:UsePortableRuntimePack=true /p:Configuration=Release publish/ProxyProjectForAOTOnHelix.proj
  ** Building a proxy for the original test project, to AOT on helix. In order to do that, this recreates the original inputs for the *iOS/tvOS* part of the build. See /private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/publish/ProxyProjectForAOTOnHelix.proj, and /private/tmp/helix/working/AC9B09AB/w/A6F108BA/e/publish/ProxyProjectForAOTOnHelix.props. **
ERROR: WORKLOAD TIMED OUT - Killing user command..
./xharness-runner.apple.sh: line 173:  5367 Terminated: 15          source command.sh
Removing empty log files:
ls: /tmp/helix/working/AC9B09AB/w/A6F108BA/uploads/xunit-*.xml: No such file or directory
+ exit_code=143
+ '[' -f /tmp/helix/working/AC9B09AB/w/A6F108BA/e/.retry ']'
+ '[' -f /tmp/helix/working/AC9B09AB/w/A6F108BA/e/.reboot ']'
+ exit 143
+ export _commandExitCode=143
+ _commandExitCode=143
+ /usr/local/bin/python3 -u /tmp/helix/working/AC9B09AB/w/A6F108BA/u/xharness-event-processor.py
WARNING: Diagnostics file not found at `/tmp/helix/working/AC9B09AB/w/A6F108BA/e/diagnostics.json`
+ /usr/local/bin/python3 /tmp/helix/working/AC9B09AB/p/reporter/run.py https://dev.azure.com/dnceng-public/ public 247120 eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im9PdmN6NU1fN3AtSGpJS2xGWHo5M3VfVjBabyJ9.eyJuYW1laWQiOiJjNzczZjJjMi01MTIwLTQyMDctYWZlMi1hZmFmMzVhOGJjMGEiLCJzY3AiOiJhcHBfdG9rZW4iLCJhdWkiOiIyMGQ4N2I4MC1lMGVjLTRkYzgtOTFmMi00M2I0MmI4ODU0NTkiLCJzaWQiOiIxZjY5ZTIwZS1iMmRlLTQ2NGQtOGUzNi0yOWQ3ZjY4NzI2MzAiLCJCdWlsZElkIjoiY2JiMTgyNjEtYzQ4Zi00YWJiLTg2NTEtOGNkY2I1NDc0NjQ5OzEyNTcxIiwicHBpZCI6InZzdGZzOi8vL0J1aWxkL0J1aWxkLzEyNTcxIiwib3JjaGlkIjoiMzMwMWIyZGUtOTc1NS00MmZmLTg1NzEtM2M3NGJlNzAwNzFkLmJ1aWxkX2lvc19hcm02NF9yZWxlYXNlX2FsbHN1YnNldHNfbW9uby5fX2RlZmF1bHQiLCJyZXBvSWRzIjoiIiwiaXNzIjoiYXBwLnZzdG9rZW4udmlzdWFsc3R1ZGlvLmNvbSIsImF1ZCI6ImFwcC52c3Rva2VuLnZpc3VhbHN0dWRpby5jb218dnNvOjZmY2M5MmU1LTczYTctNGY4OC04ZDEzLWQ5MDQ1YjQ1ZmIyNyIsIm5iZiI6MTY2Mjk3MjcwNSwiZXhwIjoxNjYyOTg0NzA1fQ.Qp3s3PvwKjcJXb6Rad_3Zfh3ofZX7yHtqiefyYK_10hDxLzOELvqNHnNY906FwNIuT1XhyYj06nzfVKlrnvcxhw5wMh5wB87rXIQwj2bRY75oLwTiUqDD440lJI-J52tGYiDQuoVQfA-GwCshaRfCTTv6Q6SV0nv_X5IBDSpesxICqywV2JASrEIqB96XfH1s542NXYezR4RPwwhIDR2oHgb4KBcqHnlJ1KBbplkZjOGlIgiG-OQi37qnn5RLiC1uKgTV3LFaKy9TzDLdQS9b_HohYwA5XFEzJL4Ey7Lm_LBjO1g_r_meI2exhIKvBFJ89hHvf_5SxZi5jac4G3PCg
2022-09-12T10:24:50.253Z	INFO   	run.py	run(48)	main	Beginning reading of test results.
2022-09-12T10:24:50.255Z	INFO   	run.py	__init__(42)	read_results	Searching '/private/tmp/helix/working/AC9B09AB/w/A6F108BA/e' for test results files
2022-09-12T10:24:50.262Z	INFO   	run.py	__init__(42)	read_results	Searching '/tmp/helix/working/AC9B09AB/w/A6F108BA/uploads' for test results files
2022-09-12T10:24:50.262Z	WARNING	run.py	__init__(55)	read_results	No results file found in any of the following formats: xunit, junit, trx
2022-09-12T10:24:50.263Z	INFO   	run.py	packing_test_reporter(30)	report_results	Packing 0 test reports to '/tmp/helix/working/AC9B09AB/w/A6F108BA/e/__test_report.json'
2022-09-12T10:24:50.263Z	INFO   	run.py	packing_test_reporter(33)	report_results	Packed 1414 bytes
+ exit 143
['System.Runtime.Tests' END OF WORK ITEM LOG: Command exited with 143]
@premun
Copy link
Member

premun commented Sep 13, 2022

Last week, we started seeing an uptick in problems with TCP connections between the device and XHarness. There were no infrastructure changes so we're weighing a possibility of some networking regression in runtime.

cc @steveisok

@steveisok
Copy link
Member

@premun Before we fish in the runtime, can we fully reboot the devices (turn them off and back on) to see if that fixes the problem?

@premun
Copy link
Member

premun commented Sep 13, 2022

@dotnet/dnceng we see issues with the 12.00 iPhone queue. Can we:

  1. Reboot the machines (ideally via Helix work items that call this)
  2. Ask DDFUN to reboot the phones themselves?
  3. Re-run the failing stage in this build to verify it got better: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results

@premun
Copy link
Member

premun commented Sep 13, 2022

This is probably related to this: #10827

@steveisok
Copy link
Member

@dotnet/dnceng we see issues with the 12.00 iPhone queue. Can we:

  1. Reboot the machines (ideally via Helix work items that call this)
  2. Ask DDFUN to reboot the phones themselves?
  3. Re-run the failing stage in this build to verify it got better: https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results

I've seen it on all the iOS and tvOS queues (OSX.1015 and OSX.1200). Can we address them all?

@premun
Copy link
Member

premun commented Sep 13, 2022

Let's try on the 50 iPhones as I have a 100% repro in the build now it seems and then do the other devices. It's a substantial amount of devices. Also it's much faster to reboot iPhones for DDFUN than AppleTVs, I think?

@oleksandr-didyk
Copy link
Contributor

As per @premun 's instructions, I restarted all the machines in the osx.1200.amd64.iphone.open queue through a Helix job and re-ran the failing stage in the linked build (https://dev.azure.com/dnceng-public/public/_build/results?buildId=12516&view=results). The stage is now green, so the restart seemed to have helped

@premun
Copy link
Member

premun commented Sep 14, 2022

@oleksandr-didyk can you do the same for the appletv queues too?

@premun
Copy link
Member

premun commented Sep 14, 2022

@steveisok btw we could start running the full BCL on the iPhone 12.00 queue as we have ~50 devices there

@oleksandr-didyk
Copy link
Contributor

Of course, will do and update the status here. Is there some pipeline I could check if the issue is fixed?

@premun
Copy link
Member

premun commented Sep 14, 2022

@oleksandr-didyk you can use any dotnet/xharness build and re-run the AppleTV E2E test stage. E.g. this one:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=12703&view=results

However, we don't have a 100% repro there

@oleksandr-didyk
Copy link
Contributor

@premun restarted the machines in 1100 and 1015 queues, re-ran the pipeline stage you recommended on main -> tests passed

@steveisok
Copy link
Member

@premun restarted the machines in 1100 and 1015 queues, re-ran the pipeline stage you recommended on main -> tests passed

Were the devices rebooted as well?

@oleksandr-didyk
Copy link
Contributor

@steveisok No, did not create any IcM for that, just the machines

@carlossanlop
Copy link
Member

Hit in a release/7.0 runtime PR in the runtime-extra-platforms pipeline: dotnet/runtime#76052

You can expand the Tests -> Artifacts tabs for any of the two tvOS runs:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=29620&view=ms.vss-test-web.build-test-results-tab
Then look for the log files that contain the name of the test class.
You won't find any [FAIL] results, yet the process seems to have troubles at the end, and that's probably why the whole run is reported as failed.

@premun
Copy link
Member

premun commented Sep 29, 2022

@dotnet/dnceng can you please open a ticket to reboot all Apple TV devices in the osx.1015.amd64.appletv.open and osx.1100.amd64.appletv.open queue?

@andriipatsula
Copy link
Member

created a ticket to reboot all devices in osx.1100.amd64.appletv.open queue.
DDFun ticket: https://portal.microsofticm.com/imp/v3/incidents/details/338271514/home

cc @premun, @steveisok

@andriipatsula
Copy link
Member

from the ticket:

The MLS team has rebooted the Apple TVs attached to these machines as you requested. Could you please check the TCP connection issue on your end and confirm the status please.

@steveisok , could you please check if the issue is solved for osx.1100.amd64.appletv.open queue

@premun premun self-assigned this Oct 3, 2022
@steveisok
Copy link
Member

from the ticket:

The MLS team has rebooted the Apple TVs attached to these machines as you requested. Could you please check the TCP connection issue on your end and confirm the status please.

@steveisok , could you please check if the issue is solved for osx.1100.amd64.appletv.open queue

No, it still exists, unfortunately.

@ilyas1974
Copy link
Contributor

Per conversation with @akoeplinger, I am closing 11210 and 11224 in favor of this issue.
Issue #11224 - machine DNCENGMAC144.local was taken offline to address this issue.
Issue #11210 is also related to this issue.

@ilyas1974
Copy link
Contributor

@steveisok any progress on this? Should we keep this issue open or should we just open a new one whenever you are ready to engage with us again?

@steveisok
Copy link
Member

Closing in favor of #11683. We are enabling retries as a potential way to mitigate this. We'll see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants