-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up workflows to run on Kubernetes / autoscale runners #1356
Comments
/run dummy-job |
Dummy job for testing workflow succeeded ✅🆔 25109039407 |
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Maybe let's wait for the scheduled profiling jobs to run tomorrow, but this should be working after #1358. |
The scheduled profiling workflow failed unfortunately. Weirdly it seems to have successfully completed running the profiling and saving the profiling output post-run, but the |
Since the overall build stopped after almost exactly 12 hours, I wonder if there's a (hopefully configurable) 12-hour timeout somewhere in the ARC configuration. |
Quick comment: I can confirm the virtual machine is down, I still have this feeling there are some settings somewhere taking down a VM after 12 hours, but I have no idea where this is coming from, I can't see relevant settings anywhere. Side note, looking at the timings
this seems to be have taken much longer than previous jobs, is this concerning? |
Ok, I did some more investigation:
All in all, my understanding is that the failure we've seen is indeed due to an OOM, which isn't unlikely due to the workload, according to both Matt and Will. Note that the new autoscaling runners the aforementioned (dedicated) Standard_DS2_v2 machines, while previously they were running on Standard_F16s_v2 machines which have 32 GiB of memory (although these are shared machines with other workflows, scheduled jobs on Saturday are likely running at a quiet time). Side note, the Standard_DS2_v2 vs Standard_F16s_v2 difference should also explain why the job took longer with the new setup. I'm not really sure what we can do: reduce the profiling workload to make it fit in a Standard_DS2_v2 box, or get a pool of beefier machines (maybe one of the memory-optimised machines, CC @tamuri)? |
Actually, this may not be a too bad option: I'm comparing different machines with the Azure Pricing Calculator and unless I'm reading it wrong, Standard_E2_v4 (16 GiB of memory for 2 quite recent vPCUs) seems to be slightly cheaper than Standard_DS2_v2 (7 GiB of memory for 2 older generation vCPUs) (see also the page about pricing of virtual machines) |
We use the Standard D11 v2 for batch-submit |
But we should check we're using the machine with best value for money. |
Yeah, I noticed the machines I suggested don't have temporary storage after I posted the message, the VM pricing page is much clearer about this, but some E-series machines do have temporary storage, just not all of them. Here's a comparison of D2 v2 with some of the memory-optimised machines (pricing refers to UK South region), all of them have 2 vCPUs:
D2 v2 seem to be a bit expensive overall, maybe because they're in the general purpose category and in high demand? E2a v4 isn't bad if 50 GiB of local storage is enough, and we could even consider E2pds v5 if using ARM CPUs is an option. |
For the record, I set up a new autoscaling Kubernetes cluster with Standard_E2_v4 machines, which gives us better CPUs than Standard_DS2_v2, more memory, for less money but also with less storage (which shouldn't be a problem for our use case though). I restarted the scheduled job that failed on Saturday, and this time it was successful in about 8 hours, a time close to previous runs. So I think this is a net improvement compared to the first setup I attempted. The only thing is that for reason I still don't understand we can't use more than 12GiB of memory even though the machine has nominally 16 GiB (more likely something between 15 and 16, but definitely larger than 12): when I try to use there values larger than 12 GiB (e.g. 14) no GitHub Actions runner is ever started at all because of insufficient memory. But in any case 12 GiB should be plenty of memory for the profiling jobs and similar workloads: I had a look at past runs of profiling jobs, and maximum total memory usage on the node was about 4 GiB. Edit: the ~12 GiB limit seems to come from AKS: the maximum allocatable memory on the node is about 12 GiB, even if the nodes we're requesting have 16 GiB: % kubectl describe node aks-agentpool-...
[...]
Allocatable:
cpu: 1900m
ephemeral-storage: 119703055367
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 12881168Ki
pods: 110 This would also explain why I can't put more than 12 GiB in the |
Alright, I finally figured out the problem with missing memory: AKS forcibly restricts the total amount of allocatable memory, to reserve some space on the node to the Kubernetes service. At the moment, the formula used to restrict the available memory depends on the version of Kubernetes used, and in particular the formula used for Kubernetes v1.28- reserves much more memory than the one used with Kubernetes v1.29. We're currently using v1.28 (latest "stable" version in AKS), while v1.29 should become "stable" around August-September according to AKS docs about supported kubernetes versions. At the moment with Kubernetes v1.28 we have % kubectl describe node aks-agentpool-16076257-vmss000000
[...]
Capacity:
cpu: 2
ephemeral-storage: 129886128Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16375056Ki
pods: 110
Allocatable:
cpu: 1900m
ephemeral-storage: 119703055367
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 12881168Ki
pods: 110 Note the large difference between memory capacity and allocatable memory, over 20%! Let's do some maths: julia> total_memory = 16375056
16375056
julia> hard_eviction = 750 * 2 ^ 10
768000
julia> kube_reserved = round(Int, 0.25 * 4 * 2 ^ 20 + 0.20 * 4 * 2 ^ 20 + 0.10 * (total_memory - 8 * 2 ^ 20))
2686082
julia> allocatable_memory = total_memory - (hard_eviction + kube_reserved)
12920974 I tried to follow the formula reported in the docs for Kubernetes v1.28-, I got an expected allocatable memory of 12920974 KiB, in reality it's 12881168 KiB, but it's close enough (error is about 0.3%). With Kubernetes v1.29 and 50 maximum pods we expect to have: julia> total_memory = 16375056
16375056
julia> k129_reserved(max_pods) = (100 * 2 ^ 10 + max_pods * 20 * 2 ^ 10)
k129_reserved (generic function with 1 method)
julia> total_memory - k129_reserved(50)
15248656 Also in this case the actual allocatable memory reported by Kubernetes is off by ~0.3% (sorry, I don't have the precise number to share!), but the number above is good enough for ballpark estimation. I tried this setup and ran a test CI job where I allocated an array of about 13 GiB, which together with the baseline of used memory on the node brought total usage to well over 14 GiB: To summarise, now we have an AKS deployment with the following properties:
|
This is great, thanks for digging into it. Wonder whether we can move the Batch pool VMs over to this too. Everything is bundled into a docker container, not sure how much storage we'll need. |
To apply to
The text was updated successfully, but these errors were encountered: