[YUNIKORN-1559] Add a resource of `pods=1` on all pods during scheduling #558

elihschiff · 2023-03-23T15:20:54Z

What is this PR for?

In my cluster I am seeing occasional events like this on some of my pods which causes them to get stuck.

54m Warning OutOfpods pod/tg-spark-executor-640b4349263cc74570ae3a1e-0 Node didn't have enough resource: pods, requested: 1, used: 12, capacity: 12

From what I can tell, nodes all have a pods resource value on them. Because this value already exists on nodes, my PR adds a pods=1 value to add pods inside the scheduler. Then yunikorn uses the same resource scheduling logic used for memory, cpu, ect. to limit the number of pods running on a node.

What type of PR is it?

Todos

- This change only works once the core repo is updated [YUNIKORN-1559] Fix gang scheduling for missing resource from queue yunikorn-core#521. Therefore that change will need to be merged first.

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-1559

How should this be tested?

I am not 100% sure how best to test this. I fixed the tests including the e2e tests. I am testing this on my kubernetes clusters and it seems to have fixed the issue but I am not 100% sure if there is anything I should be worried about.

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

go.mod

codecov · 2023-03-23T20:26:26Z

Codecov Report

Merging #558 (7a744d4) into master (551f6b7) will increase coverage by 0.26%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master     #558      +/-   ##
==========================================
+ Coverage   69.71%   69.97%   +0.26%     
==========================================
  Files          45       46       +1     
  Lines        7782     7881      +99     
==========================================
+ Hits         5425     5515      +90     
- Misses       2155     2163       +8     
- Partials      202      203       +1

Impacted Files	Coverage Δ
pkg/common/resource.go	`81.77% <85.71%> (-0.53%)`	⬇️

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

go.mod

go.sum

pkg/common/resource.go

wilfred-s · 2023-03-28T17:10:06Z

e2e failures are due to YUNIKORN-1666, need to re-run after that change is done before we commit

craigcondit · 2023-03-28T17:11:38Z

Kicked off recheck as YUNIKORN-1666 is merged.

wilfred-s

LGTM
waiting for the build to not fail e2e compilation

craigcondit · 2023-03-28T18:45:14Z

/recheck

elihschiff and others added 11 commits March 14, 2023 09:26

Add pods=1 resource value to all pod resources

608f237

Add pod limit to mock nodes to fix tests

f5c7f62

Merge branch 'apache:master' into eli.schiff/add_pods_resource_count

68a07b8

Add "pods" resource to placeholders that dont have them

8d802a9

Fix test

7a42526

Bump to test core version

147fa3c

Try and fix replace statement

f96e6b6

Bump core version

d0439af

Bump core version

02a9209

bump core

b2dc88c

bump core

0f705cf

elihschiff mentioned this pull request Mar 23, 2023

[YUNIKORN-1559] Fix gang scheduling for missing resource from queue apache/yunikorn-core#521

Closed

10 tasks

elihschiff commented Mar 23, 2023

View reviewed changes

go.mod Outdated Show resolved Hide resolved

elihschiff marked this pull request as ready for review March 23, 2023 15:26

wilfred-s assigned elihschiff Mar 27, 2023

wilfred-s requested changes Mar 28, 2023

View reviewed changes

go.mod Outdated Show resolved Hide resolved

go.sum Outdated Show resolved Hide resolved

pkg/common/resource.go Show resolved Hide resolved

Update cache.TestNewPlaceholder() code and remove yunikorn core update

af5ed6d

wilfred-s self-requested a review March 28, 2023 15:46

wilfred-s approved these changes Mar 28, 2023

View reviewed changes

Build trigger

7a744d4

craigcondit closed this in fd25e8f Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-1559] Add a resource of `pods=1` on all pods during scheduling #558

[YUNIKORN-1559] Add a resource of `pods=1` on all pods during scheduling #558

elihschiff commented Mar 23, 2023

codecov bot commented Mar 23, 2023 •

edited

Loading

wilfred-s commented Mar 28, 2023

craigcondit commented Mar 28, 2023

wilfred-s left a comment

craigcondit commented Mar 28, 2023

[YUNIKORN-1559] Add a resource of pods=1 on all pods during scheduling #558

[YUNIKORN-1559] Add a resource of pods=1 on all pods during scheduling #558

Conversation

elihschiff commented Mar 23, 2023

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented Mar 23, 2023 • edited Loading

Codecov Report

wilfred-s commented Mar 28, 2023

craigcondit commented Mar 28, 2023

wilfred-s left a comment

Choose a reason for hiding this comment

craigcondit commented Mar 28, 2023

[YUNIKORN-1559] Add a resource of `pods=1` on all pods during scheduling #558

[YUNIKORN-1559] Add a resource of `pods=1` on all pods during scheduling #558

codecov bot commented Mar 23, 2023 •

edited

Loading