Skip to content

Conversation

geropl
Copy link
Member

@geropl geropl commented Dec 16, 2020

This:

  1. enables asynchronous binding of pod to node
  2. moves the duty of deleting a ghost workspace for each REGULAR non-GHOST workspace start from ws-manager to ws-scheduler
  3. hides ghost workspaces from scheduler strategies: this relies on the fact that we do 2. ☝️

TBD:

  • add tests
  • load test

@geropl
Copy link
Member Author

geropl commented Dec 16, 2020

/werft run

👍 started the job as gitpod-build-gpl-2513-scheduler-deletes-ghosts.3

@geropl
Copy link
Member Author

geropl commented Dec 16, 2020

/werft run

👍 started the job as gitpod-build-gpl-2513-scheduler-deletes-ghosts.4

🤞

@geropl geropl force-pushed the gpl/2513-scheduler-deletes-ghosts branch from 7133da8 to 130b2e0 Compare December 16, 2020 12:09
@geropl
Copy link
Member Author

geropl commented Dec 16, 2020

/werft run

👍 started the job as gitpod-build-gpl-2513-scheduler-deletes-ghosts.6

@geropl
Copy link
Member Author

geropl commented Dec 16, 2020

I just performed a load test, which sadly triggered some OutOfMemory errors again. Will revise the code again, but not before tomorrow.

@geropl
Copy link
Member Author

geropl commented Jan 6, 2021

Actually I think the current issue is an instance of what be discussed above: In rare cases the 1 second is not enough and kubelet and ws-scheduler are out-of-sync, resulting in out-of-memory issues.

Next attempt would be to increase the (max) grace period to maximum (30s AFAIR).

Positively speaking: During the load test only in ~4% of the cases the delete took longer than 1s! 🙃

@geropl
Copy link
Member Author

geropl commented Jan 13, 2021

Set deletionGracePeriod to 30s. That worked well for a 100 ws test on staging. 🎉

@geropl geropl marked this pull request as ready for review January 14, 2021 07:40
@geropl geropl changed the title [ws-scheduler] Make ghost workspaces more effective by integrating them with scheduler [wip][ws-scheduler] Make ghost workspaces more effective by integrating them with scheduler Jan 14, 2021
@geropl geropl changed the title [wip][ws-scheduler] Make ghost workspaces more effective by integrating them with scheduler [ws-scheduler] Make ghost workspaces more effective by integrating them with scheduler Jan 14, 2021
@geropl geropl requested a review from csweichel January 14, 2021 07:42
@geropl
Copy link
Member Author

geropl commented Jan 14, 2021

Just coming back and think maybe it's better to use a ctx.WithTimeout to ghost.Delete (new api available since k8s API bump) to make sure we never run over the gracePeriod. This would have the effect that we skip this round of scheduling, but at most for scanning interval (2s currently). I think that would be more robust and also more responsive. Assuming pod.Delete(ctx, ...) handles cancel grateful, that is.

@csweichel

@geropl geropl force-pushed the gpl/2513-scheduler-deletes-ghosts branch from cf0daa9 to 5a04770 Compare January 20, 2021 09:43
@geropl
Copy link
Member Author

geropl commented Jan 20, 2021

@csweichel ping

@geropl geropl force-pushed the gpl/2513-scheduler-deletes-ghosts branch from 5a04770 to 7d162a1 Compare January 26, 2021 09:10
@geropl geropl force-pushed the gpl/2513-scheduler-deletes-ghosts branch 2 times, most recently from 35f1119 to f13f335 Compare January 27, 2021 13:35
@geropl
Copy link
Member Author

geropl commented Jan 27, 2021

Another load test with 200 workspaces, beec427 and 4297176 went well.

for _, n := range nodes {
nds[n.Name] = &Node{
Node: n,
Services: make(map[string]struct{}),
}
nodeToPod[n.Name] = make(map[string]struct{})
nodeToPod[n.Name] = &ntp{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make this part of nds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I left it as I found it. 🙃
But the reasons seems to be that the book keeping is done during the process of adding pods from pds to each node in nds here. So it make sense to keep them separate.

@csweichel
Copy link
Contributor

Didn't dive into the logic of things, hence the vanity comments

Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the offline discussed changes and squash

@geropl geropl force-pushed the gpl/2513-scheduler-deletes-ghosts branch from f13f335 to 5553fc1 Compare January 27, 2021 15:53
@geropl geropl merged commit 54b314b into master Jan 27, 2021
@geropl geropl deleted the gpl/2513-scheduler-deletes-ghosts branch January 27, 2021 16:07
pavan-tri pushed a commit to trilogy-group/gitpod that referenced this pull request Apr 28, 2021
…em with scheduler (gitpod-io#2552)

* [ws-manager] Do not delete ghost workspace on start

* [ws-scheduler] Enable asynchronous binding of pods

* [ws-scheduler] Introduce ghosts

 - remove a ghost before binding a regular workspace
 - make ghosts "invisible" to strategy

* [scheduler] Wait longer on ghost deletion to prevent OOM errors

* [scheduler] Make isRegularWorkspace -> makeGhostsInvisible explicit

* [scheduler] cancel ghost.Delete if it takes too long (5s)

* [ws-scheduler] Add tests for ghost-sepcific state computation

* [scheduler] Make sure ghost are only selected for deletion once

* [scheduler] delete ghosts: ctxDeleteTimeout > gracePeriod

* [scheduler] Don't bind terminated pods

* [scheduler] Make all non-ghost workspaces replace ghosts

* [scheduler] review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants