-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-1559] Add a resource of pods=1
on all pods during scheduling
#558
[YUNIKORN-1559] Add a resource of pods=1
on all pods during scheduling
#558
Conversation
Codecov Report
@@ Coverage Diff @@
## master #558 +/- ##
==========================================
+ Coverage 69.71% 69.97% +0.26%
==========================================
Files 45 46 +1
Lines 7782 7881 +99
==========================================
+ Hits 5425 5515 +90
- Misses 2155 2163 +8
- Partials 202 203 +1
... and 7 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
e2e failures are due to YUNIKORN-1666, need to re-run after that change is done before we commit |
Kicked off recheck as YUNIKORN-1666 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
waiting for the build to not fail e2e compilation
/recheck |
What is this PR for?
In my cluster I am seeing occasional events like this on some of my pods which causes them to get stuck.
54m Warning OutOfpods pod/tg-spark-executor-640b4349263cc74570ae3a1e-0 Node didn't have enough resource: pods, requested: 1, used: 12, capacity: 12
From what I can tell, nodes all have a
pods
resource value on them. Because this value already exists on nodes, my PR adds a pods=1 value to add pods inside the scheduler. Then yunikorn uses the same resource scheduling logic used for memory, cpu, ect. to limit the number of pods running on a node.What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-1559
How should this be tested?
I am not 100% sure how best to test this. I fixed the tests including the e2e tests. I am testing this on my kubernetes clusters and it seems to have fixed the issue but I am not 100% sure if there is anything I should be worried about.
Screenshots (if appropriate)
Questions: