ARC in k8s mode not working with ResourceQuotas. Jobs fail instead of queuing. #3630
Open
4 tasks done
Labels
bug
Something isn't working
gha-runner-scale-set
Related to the gha-runner-scale-set mode
needs triage
Requires review from the maintainers
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
In k8s mode using the container hook template, you need to request more cpu,memory,storage etc. than the resource quota. For example:
You can also do this with a single job that goes over the resource quota. But above is more likely scenario.
Describe the bug
In the example provided, one job will run successfully, the other will fail when trying to create the workflow pod as the resource quota is temporarily being exceeded.
In general, jobs fail due to quota being temporarily being exceeded.
![image](https://private-user-images.githubusercontent.com/5762764/344589680-9e10bb04-efbb-4c2f-93eb-d580e60b9022.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEwNDM0NzQsIm5iZiI6MTcyMTA0MzE3NCwicGF0aCI6Ii81NzYyNzY0LzM0NDU4OTY4MC05ZTEwYmIwNC1lZmJiLTRjMmYtOTNlYi1kNTgwZTYwYjkwMjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcxNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MTVUMTEzMjU0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YTM5ZmMzYTkwNDNiOGE1ZDI1NzA3MzZjZjk1Y2ZiODI4YzhlYmU4MWNmYzZlYjg5MDIyOTlkMDE3N2EwZGJmMSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.f4CjyCQUvz6yqXUX_ZXziYeEwrkOQUcGLE-O0hEy-hk)
Describe the expected behavior
In the example provided, one job should run at a time and queue properly and complete one after the other. Leading to a successful build.
In general, when quota is temporarily exceeded, we should try again after a while preferably through a queue implementation. Though it is also bit stupid that the runner starts even when there is no room for the workflow pod but I am not sure what could be done without changing the architecture of ARC a lot.
Additional Context
This can be considered separate issue than #3629 as I believe the code needs to change in a different spot and it behaves a little bit differently. This is why I created two issues.
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: