Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The random access performance of macOS runners is incredibly slow #2707

Closed
smorimoto opened this issue Feb 14, 2021 · 15 comments
Closed

The random access performance of macOS runners is incredibly slow #2707

smorimoto opened this issue Feb 14, 2021 · 15 comments
Assignees
Labels
investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS

Comments

@smorimoto
Copy link
Contributor

smorimoto commented Feb 14, 2021

As the title says. However, this does not always happen, and it seems to be a random problem with some runners. This makes everything incredibly slow if you build on the problem runner.

According to the GitHub documentation, it says "GitHub hosts macOS runners in GitHub's own macOS Cloud" but somehow it seems that there are runners managed on 54112 and 36459, are slow on either of them? (Sorry, I don't have time to spend investigating this and you probably have more information.)

Sequential Read Ubuntu macOS
File operations - reads/s 269254.15 312796.42
Throughput - read (MiB/s) 4207.10 4887.44
Latency - min (ms) 0.00 0.00
Latency - avg (ms) 0.00 0.00
Latency - max (ms) 6.64 9.22
Latency - 95th percentile (ms) 0.00 0.00
Latency - min (ms) 9388.44 9117.90
Sequential Write Ubuntu macOS
File operations - writes/s 4258.20 6345.27
Throughput - written (MiB/s) 66.53 99.14
Latency - min (ms) 0.00 0.00
Latency - avg (ms) 0.00 0.07
Latency - max (ms) 84.80 44.82
Latency - 95th percentile (ms) 0.00 0.01
Latency - min (ms) 10012.84 9948.61
Combined Random Read/Write Ubuntu macOS
File operations - reads/s 5092.69 375.47
File operations - writes/s 3395.13 250.31
File operations - fsyncs/s 10857.02 810.84
Throughput - read (MiB/s) 79.57 5.87
Throughput - written (MiB/s) 53.05 3.91
Latency - min (ms) 0.00 0.00
Latency - avg (ms) 0.05 0.70
Latency - max (ms) 29.27 25.96
Latency - 95th percentile (ms) 0.00 5.88
Latency - min (ms) 9902.59 9989.68
@LeonidLapshin LeonidLapshin added investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS and removed needs triage labels Feb 15, 2021
@miketimofeev
Copy link
Contributor

Hi @smorimoto!
Indeed, the Mac cloud consists of many environments and probably the issue affected only some of them.
Could you please provide a few examples of "slow" and "normal" builds, it should help us find out particular environments that affected by the issue? And it would be even better if you can provide a benchmark to use.

@smorimoto
Copy link
Contributor Author

Here are the hardware and software informations for the slow machine. This is probably helpful for identifying one by serial number.

2021-02-15T17:14:20.9697180Z Hardware:
2021-02-15T17:14:20.9697400Z 
2021-02-15T17:14:20.9697710Z     Hardware Overview:
2021-02-15T17:14:20.9697960Z 
2021-02-15T17:14:20.9698290Z       Model Name: Apple device
2021-02-15T17:14:20.9698710Z       Model Identifier: VMware7,1
2021-02-15T17:14:20.9699150Z       Processor Speed: 3.33 GHz
2021-02-15T17:14:20.9699560Z       Number of Processors: 1
2021-02-15T17:14:20.9699950Z       Total Number of Cores: 3
2021-02-15T17:14:20.9700340Z       L2 Cache (per Core): 256 KB
2021-02-15T17:14:20.9700670Z       L3 Cache: 12 MB
2021-02-15T17:14:20.9700990Z       Memory: 14 GB
2021-02-15T17:14:20.9701410Z       Boot ROM Version: VMW71.00V.13989454.B64.1906190538
2021-02-15T17:14:20.9702180Z       Apple ROM Info: [MS_VM_CERT/SHA1/27d66596a61c48dd3dc7216fd715126e33f59ae7]Welcome to the Virtual Machine
2021-02-15T17:14:20.9702850Z       SMC Version (system): 2.8f0
2021-02-15T17:14:20.9703290Z       Serial Number (system): VMcwz4J2GHCN
2021-02-15T17:14:20.9704310Z       Hardware UUID: 4203018E-580F-C1B5-9525-B745CECA79EB
2021-02-15T17:14:26.6539330Z Software:
2021-02-15T17:14:26.6539540Z 
2021-02-15T17:14:26.6539860Z     System Software Overview:
2021-02-15T17:14:26.6540130Z 
2021-02-15T17:14:26.6540450Z       System Version: macOS 10.15.7 (19H512)
2021-02-15T17:14:26.6540860Z       Kernel Version: Darwin 19.6.0
2021-02-15T17:14:26.6541260Z       Boot Volume: Macintosh HD
2021-02-15T17:14:26.6541630Z       Boot Mode: Normal
2021-02-15T17:14:26.6542420Z       Computer Name: Mac-1613409099589.local
2021-02-15T17:14:26.6542880Z       User Name: runner (runner)
2021-02-15T17:14:26.6543290Z       Secure Virtual Memory: Enabled
2021-02-15T17:14:26.6543760Z       System Integrity Protection: Enabled
2021-02-15T17:14:26.6544230Z       Time since boot: 3 minutes
2021-02-15T17:14:26.6544740Z Storage:
2021-02-15T17:14:26.6544940Z 
2021-02-15T17:14:26.6545560Z     Macintosh HD - Data:
2021-02-15T17:14:26.6545820Z 
2021-02-15T17:14:26.6546120Z       Free: 81.78 GB (81,784,434,688 bytes)
2021-02-15T17:14:26.6546490Z       Capacity: 407.81 GB (407,812,136,960 bytes)
2021-02-15T17:14:26.6546910Z       Mount Point: /System/Volumes/Data
2021-02-15T17:14:26.6547320Z       File System: APFS
2021-02-15T17:14:26.6547650Z       Writable: Yes
2021-02-15T17:14:26.6548000Z       Ignore Ownership: No
2021-02-15T17:14:26.6548360Z       BSD Name: disk1s1
2021-02-15T17:14:26.6549210Z       Volume UUID: 2A30A6B7-DA7B-3585-83F7-DB9932BBA018
2021-02-15T17:14:26.6549720Z       Physical Drive:
2021-02-15T17:14:26.6550140Z           Device Name: VMware Virtual SATA Hard Drive
2021-02-15T17:14:26.6550770Z           Media Name: AppleAPFSMedia
2021-02-15T17:14:26.6551180Z           Medium Type: SSD
2021-02-15T17:14:26.6551710Z           Protocol: PCI
2021-02-15T17:14:26.6552050Z           Internal: Yes
2021-02-15T17:14:26.6552420Z           Partition Map Type: Unknown
2021-02-15T17:14:26.6552700Z 
2021-02-15T17:14:26.6552980Z     Macintosh HD:
2021-02-15T17:14:26.6553200Z 
2021-02-15T17:14:26.6553500Z       Free: 81.78 GB (81,784,434,688 bytes)
2021-02-15T17:14:26.6553870Z       Capacity: 407.81 GB (407,812,136,960 bytes)
2021-02-15T17:14:26.6554210Z       Mount Point: /
2021-02-15T17:14:26.6554560Z       File System: APFS
2021-02-15T17:14:26.6554890Z       Writable: No
2021-02-15T17:14:26.6555240Z       Ignore Ownership: No
2021-02-15T17:14:26.6555600Z       BSD Name: disk1s5
2021-02-15T17:14:26.6556460Z       Volume UUID: 7108BDC1-B808-49E6-9443-A70B1CFFC639
2021-02-15T17:14:26.6556930Z       Physical Drive:
2021-02-15T17:14:26.6557360Z           Device Name: VMware Virtual SATA Hard Drive
2021-02-15T17:14:26.6557850Z           Media Name: AppleAPFSMedia
2021-02-15T17:14:26.6558400Z           Medium Type: SSD
2021-02-15T17:14:26.6558770Z           Protocol: PCI
2021-02-15T17:14:26.6559110Z           Internal: Yes
2021-02-15T17:14:26.6559500Z           Partition Map Type: Unknown

I ran a simple benchmark with sysbench. If you're not familiar with how to do that, here's how:
https://github.com/smorimoto/actions-ve-repro-2707/blob/main/.github/workflows/workflow.yml

@smorimoto
Copy link
Contributor Author

In my experience, the slowdown on some machines seems to have started suddenly several months ago. MacStadium sometimes uses fairly old machines as hosts, so consider that possibility.

@miketimofeev
Copy link
Contributor

@smorimoto I'm afraid the serial number is the same across all the VMs, so it won't help us. We can identify a particular machine by a link to the run.

@smorimoto
Copy link
Contributor Author

Slow one:
https://github.com/smorimoto/actions-ve-repro-2707/runs/1905124520
Normal one:
https://github.com/smorimoto/actions-ve-repro-2707/runs/1905124540

@miketimofeev
Copy link
Contributor

@smorimoto thanks, I see some of them are about 1-2min and some 4+. We'll take a look more carefully.

@smorimoto
Copy link
Contributor Author

Are you referring to the system profiler? I'm not familiar with macOS, so I don't know, but the time it takes doesn't seem to have much to do with the storage performance of the machine.

@miketimofeev
Copy link
Contributor

Sorry, I should have checked carefully. By the way, you provided the same link both for the slow and normal ones.

@smorimoto
Copy link
Contributor Author

Hmm? I think I've provided the 8th and 10th jobs.

@miketimofeev miketimofeev self-assigned this Feb 18, 2021
@miketimofeev
Copy link
Contributor

@smorimoto I've checked your run and there are 2 slow examples — 7&8, both were run in the data center, which is currently being deprecated. It'll take 2-3 more months to finish infrastructure migration, so I suggest bringing up the issue again if it persists after this time.
I'm going to close the issue for now, but please feel free to contact us if you have any concerns.
Thank you!

@smorimoto
Copy link
Contributor Author

Got it, thanks for the investigation. I hope all slow instances won't be used as soon as possible.

@jpike88
Copy link

jpike88 commented Mar 8, 2021

@miketimofeev shouldn't this issue remain open until the infrastructure migration has completed?

@smorimoto
Copy link
Contributor Author

@miketimofeev Many VMs still seem to have the issue. Especially, the write performance is often terrible: https://github.com/smorimoto/actions-ve-repro-2707/actions/runs/752151426

@andreineculau
Copy link

@miketimofeev Don't know if you'd prefer a new issue, but now it's months after, and I'm seeing the same/similar. It's random, but often: 10-25% of the runs are slow.

A good run can "uninstall homebrew" (i.e. mostly disk i/o) in 184 seconds https://github.com/rokmoln/support-firecloud/runs/3295490478?check_suite_focus=true#step:3:190

A slow run can "uninstall homebrew" 2.5 times slower, in 483 seconds
https://github.com/rokmoln/support-firecloud/runs/3295490501?check_suite_focus=true#step:3:190

Similarly, reinstalling homebrew (disk, network and cpu bound) happens 1.8 times slower, in 593 seconds https://github.com/rokmoln/support-firecloud/runs/3295490501?check_suite_focus=true#step:3:447 instead of 333 seconds https://github.com/rokmoln/support-firecloud/runs/3295490478?check_suite_focus=true#step:3:456

Overall, my builds time out even with a 2x-than-normal timeout (normally 23minutes, timeout 45 minutes).

@miketimofeev
Copy link
Contributor

@andreineculau yes, create a new issue, please. It would be great if you can provide the minimal repro steps so we will be able to make a few runs and gather data that can be escalated to the engineering team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS
Projects
None yet
Development

No branches or pull requests

5 participants