Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-1559] Add a resource of pods=1 on all pods during scheduling #558

Closed

Conversation

elihschiff
Copy link
Contributor

What is this PR for?

In my cluster I am seeing occasional events like this on some of my pods which causes them to get stuck.

54m Warning OutOfpods pod/tg-spark-executor-640b4349263cc74570ae3a1e-0 Node didn't have enough resource: pods, requested: 1, used: 12, capacity: 12

From what I can tell, nodes all have a pods resource value on them. Because this value already exists on nodes, my PR adds a pods=1 value to add pods inside the scheduler. Then yunikorn uses the same resource scheduling logic used for memory, cpu, ect. to limit the number of pods running on a node.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-1559

How should this be tested?

I am not 100% sure how best to test this. I fixed the tests including the e2e tests. I am testing this on my kubernetes clusters and it seems to have fixed the issue but I am not 100% sure if there is anything I should be worried about.

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

go.mod Outdated Show resolved Hide resolved
@elihschiff elihschiff marked this pull request as ready for review March 23, 2023 15:26
@codecov
Copy link

codecov bot commented Mar 23, 2023

Codecov Report

Merging #558 (7a744d4) into master (551f6b7) will increase coverage by 0.26%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master     #558      +/-   ##
==========================================
+ Coverage   69.71%   69.97%   +0.26%     
==========================================
  Files          45       46       +1     
  Lines        7782     7881      +99     
==========================================
+ Hits         5425     5515      +90     
- Misses       2155     2163       +8     
- Partials      202      203       +1     
Impacted Files Coverage Δ
pkg/common/resource.go 81.77% <85.71%> (-0.53%) ⬇️

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

go.mod Outdated Show resolved Hide resolved
go.sum Outdated Show resolved Hide resolved
pkg/common/resource.go Show resolved Hide resolved
@wilfred-s wilfred-s self-requested a review March 28, 2023 15:46
@wilfred-s
Copy link
Contributor

e2e failures are due to YUNIKORN-1666, need to re-run after that change is done before we commit

@craigcondit
Copy link
Contributor

Kicked off recheck as YUNIKORN-1666 is merged.

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
waiting for the build to not fail e2e compilation

@craigcondit
Copy link
Contributor

/recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants