Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up workflows to run on Kubernetes / autoscale runners #1356

Open
matt-graham opened this issue May 17, 2024 · 18 comments
Open

Set up workflows to run on Kubernetes / autoscale runners #1356

matt-graham opened this issue May 17, 2024 · 18 comments
Assignees

Comments

@matt-graham
Copy link
Collaborator

To apply to

  • Scheduled profiling workflows
  • All comment triggered workflows
@matt-graham matt-graham created this issue from a note in Issue management (To do) May 17, 2024
@giordano
Copy link
Member

/run dummy-job

@ucl-comment-bot
Copy link

Dummy job for testing workflow succeeded ✅

🆔 25109039407
⏲️ 0.00 minutes
#️⃣ a95e916

@giordano

This comment was marked as outdated.

1 similar comment
@giordano

This comment was marked as outdated.

@ucl-comment-bot

This comment was marked as outdated.

@ucl-comment-bot

This comment was marked as outdated.

@giordano
Copy link
Member

Maybe let's wait for the scheduled profiling jobs to run tomorrow, but this should be working after #1358.

@matt-graham
Copy link
Collaborator Author

The scheduled profiling workflow failed unfortunately. Weirdly it seems to have successfully completed running the profiling and saving the profiling output post-run, but the Run profiling in dev environment step is still shown with an 'in-progress' yellow/amber spinner (though the overall job shows as failed) and the subsequent Save results as artifact step did not appear to start running 😕

image

@giordano
Copy link
Member

Since the overall build stopped after almost exactly 12 hours, I wonder if there's a (hopefully configurable) 12-hour timeout somewhere in the ARC configuration.

@giordano
Copy link
Member

Quick comment: I can confirm the virtual machine is down, I still have this feeling there are some settings somewhere taking down a VM after 12 hours, but I have no idea where this is coming from, I can't see relevant settings anywhere.

Side note, looking at the timings

[00:22:16:INFO] Starting profiling runs
[12:03:13:INFO] Profiling runs complete

this seems to be have taken much longer than previous jobs, is this concerning?

@giordano
Copy link
Member

giordano commented May 21, 2024

Ok, I did some more investigation:

  • the 12-hour mark was probably a nice coincidence, but a red herring: I ran a 14-hour dummy job and it ended without problems
  • prompted by an offline discussion with Matt I looked into the possibility of the machine being killed due to an OOM: I started a job which would allocate a total of 10 GiB of memory (we're currently using Standard_DS2_v2 machines which have 7 GiB of memory), and we have similar symptoms to the scheduled profiling job which died the other day:
    • the step is seemingly still running in the GitHub UI
    • there's no message whatsoever printed to screen about errors or anything
    • in the Kubernetes log I see similar error messages around the times when the jobs supposedly died:
      2024/05/18 12:03:18 http: TLS handshake error from 10.244.0.19:32960: EOF
      [...]
      2024/05/21 10:30:45 http: TLS handshake error from 10.244.0.19:49832: EOF
      
      I don't see similar messages in the log. While this error message isn't exactly clear, the coincidence with the two events is remarkable. Maybe this means the Kubernetes service is trying to contact the machines but not getting an answer back?

All in all, my understanding is that the failure we've seen is indeed due to an OOM, which isn't unlikely due to the workload, according to both Matt and Will. Note that the new autoscaling runners the aforementioned (dedicated) Standard_DS2_v2 machines, while previously they were running on Standard_F16s_v2 machines which have 32 GiB of memory (although these are shared machines with other workflows, scheduled jobs on Saturday are likely running at a quiet time). Side note, the Standard_DS2_v2 vs Standard_F16s_v2 difference should also explain why the job took longer with the new setup. I'm not really sure what we can do: reduce the profiling workload to make it fit in a Standard_DS2_v2 box, or get a pool of beefier machines (maybe one of the memory-optimised machines, CC @tamuri)?

@giordano
Copy link
Member

giordano commented May 21, 2024

or get a pool of beefier machines (maybe one of the memory-optimised machines, CC @tamuri)

Actually, this may not be a too bad option: I'm comparing different machines with the Azure Pricing Calculator and unless I'm reading it wrong, Standard_E2_v4 (16 GiB of memory for 2 quite recent vPCUs) seems to be slightly cheaper than Standard_DS2_v2 (7 GiB of memory for 2 older generation vCPUs) (see also the page about pricing of virtual machines)

@tamuri
Copy link
Collaborator

tamuri commented May 21, 2024

We use the Standard D11 v2 for batch-submit --more-memory option. The E-series doesn't come with any disk storage.

@tamuri
Copy link
Collaborator

tamuri commented May 21, 2024

But we should check we're using the machine with best value for money.

@giordano
Copy link
Member

giordano commented May 21, 2024

The E-series doesn't come with any disk storage.

Yeah, I noticed the machines I suggested don't have temporary storage after I posted the message, the VM pricing page is much clearer about this, but some E-series machines do have temporary storage, just not all of them.

Here's a comparison of D2 v2 with some of the memory-optimised machines (pricing refers to UK South region), all of them have 2 vCPUs:

Name CPU Memory Temporary storage Cost ($/month)
D2 v2 Intel® Xeon® Platinum 8272CL processor (second generation Intel® Xeon® Scalable processors), Intel® Xeon® 8171M 2.1GHz (Skylake), Intel® Xeon® E5-2673 v4 2.3 GHz (Broadwell) or the Intel® Xeon® E5-2673 v3 2.4 GHz (Haswell) processors 7 GiB 100 GiB 128.4800
E2ads v5 AMD EPYC 7763v 16 GiB 75 GiB 112.4200
E2a v4 2.35 Ghz AMD EPYC 7452 16 GiB 50 GiB 108.0400
E2pds v5 Ampere® Altra® Arm-based 16 GiB 75 GiB 98.5500

D2 v2 seem to be a bit expensive overall, maybe because they're in the general purpose category and in high demand? E2a v4 isn't bad if 50 GiB of local storage is enough, and we could even consider E2pds v5 if using ARM CPUs is an option.

@giordano
Copy link
Member

giordano commented May 23, 2024

For the record, I set up a new autoscaling Kubernetes cluster with Standard_E2_v4 machines, which gives us better CPUs than Standard_DS2_v2, more memory, for less money but also with less storage (which shouldn't be a problem for our use case though). I restarted the scheduled job that failed on Saturday, and this time it was successful in about 8 hours, a time close to previous runs. So I think this is a net improvement compared to the first setup I attempted.

The only thing is that for reason I still don't understand we can't use more than 12GiB of memory even though the machine has nominally 16 GiB (more likely something between 15 and 16, but definitely larger than 12): when I try to use there values larger than 12 GiB (e.g. 14) no GitHub Actions runner is ever started at all because of insufficient memory. But in any case 12 GiB should be plenty of memory for the profiling jobs and similar workloads: I had a look at past runs of profiling jobs, and maximum total memory usage on the node was about 4 GiB. Edit: the ~12 GiB limit seems to come from AKS: the maximum allocatable memory on the node is about 12 GiB, even if the nodes we're requesting have 16 GiB:

% kubectl describe node aks-agentpool-...
[...]
Allocatable:
  cpu:                1900m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12881168Ki
  pods:               110

This would also explain why I can't put more than 12 GiB in the spec.resources.requests.memory property of the Kubernetes cluster.

@giordano
Copy link
Member

Alright, I finally figured out the problem with missing memory: AKS forcibly restricts the total amount of allocatable memory, to reserve some space on the node to the Kubernetes service. At the moment, the formula used to restrict the available memory depends on the version of Kubernetes used, and in particular the formula used for Kubernetes v1.28- reserves much more memory than the one used with Kubernetes v1.29. We're currently using v1.28 (latest "stable" version in AKS), while v1.29 should become "stable" around August-September according to AKS docs about supported kubernetes versions.

At the moment with Kubernetes v1.28 we have

 % kubectl describe node aks-agentpool-16076257-vmss000000
[...]
Capacity:
  cpu:                2
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16375056Ki
  pods:               110
Allocatable:
  cpu:                1900m
  ephemeral-storage:  119703055367
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12881168Ki
  pods:               110

Note the large difference between memory capacity and allocatable memory, over 20%! Let's do some maths:

julia> total_memory = 16375056
16375056

julia> hard_eviction = 750 * 2 ^ 10
768000

julia> kube_reserved = round(Int, 0.25 * 4 * 2 ^ 20 + 0.20 * 4 * 2 ^ 20 + 0.10 * (total_memory - 8 * 2 ^ 20))
2686082

julia> allocatable_memory = total_memory - (hard_eviction + kube_reserved)
12920974

I tried to follow the formula reported in the docs for Kubernetes v1.28-, I got an expected allocatable memory of 12920974 KiB, in reality it's 12881168 KiB, but it's close enough (error is about 0.3%).

With Kubernetes v1.29 and 50 maximum pods we expect to have:

julia> total_memory = 16375056
16375056

julia> k129_reserved(max_pods) = (100 * 2 ^ 10 + max_pods * 20 * 2 ^ 10)
k129_reserved (generic function with 1 method)

julia> total_memory - k129_reserved(50)
15248656

Also in this case the actual allocatable memory reported by Kubernetes is off by ~0.3% (sorry, I don't have the precise number to share!), but the number above is good enough for ballpark estimation. I tried this setup and ran a test CI job where I allocated an array of about 13 GiB, which together with the baseline of used memory on the node brought total usage to well over 14 GiB:

This amount of memory usages would systematically OOM the machine in all my previous attempts when using Kubernetes 1.28, so this is a significant improvement, especially in my understanding of Kubernetes and AKS 😅 In any case, at the moment I don't think we're concerned about running out of memory with 12 GiB of allocatable memory, so I think we can stay with Kubernetes 1.28.

To summarise, now we have an AKS deployment with the following properties:

  • Standard_E2_v4 machines: cheaper, faster, and with less storage than Standard_DS2_v2 (I think storage is a non-negligible fraction of the cost of this service)
  • automatic upgrade to latest "stable" version of Kubernets according to AKS policy. Around the end of the summer we should get automatically Kubernetes 1.29, which would also enable using more memory on a single node
  • maximum 50 pods (in normal conditions I haven't seen pods with more than 16 pods, so 50 should be plenty)
  • maximum 8 concurrent runners

@tamuri
Copy link
Collaborator

tamuri commented May 26, 2024

This is great, thanks for digging into it. Wonder whether we can move the Batch pool VMs over to this too. Everything is bundled into a docker container, not sure how much storage we'll need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants