multithreaded event processor #317

universam1 · 2020-12-11T14:41:39Z

Issue #310

Description of changes:

This replaces the single process execution of events to parallel processing, solving the issue that happens when NTH is busy / blocked (retrying to evict) and eventually will miss to process events for other nodes going down the same time time.

Example:
3 nodes roll at a time because of batchSize or spot interruption.
A deployment has a pdb limit of maxUnavailable of 1 - that will block NTH in a eviction retry loop and it will miss the third node eviction.

The amount of workers are capped to prevent a memory runnaway.

universam1 · 2020-12-11T14:42:31Z

should solve #310

What do you think @haugenj

haugenj

I have mixed feelings about this

Certainly for IMDS mode, this is unnecessary. The event queue is restricted to each node and doesn't change often.

For Queue Processor mode, I see the benefit mostly for medium-sized clusters.

For small clusters I wouldn't think the are so many events coming in that this is very helpful. I'm not sure this scales well for large clusters either, to me the better solution would be to run multiple replicas of the pod.

I wasn't as involved with Queue processor mode, though, so let me sync up with @bwagner5 next week to get more details. In the meantime, if you can give me some more details into the setup you're running (size of your cluster, time seen for events to be processed, etc.) maybe I'll be more convinced

cmd/node-termination-handler.go

pkg/interruptioneventstore/interruption-event-store_test.go

pkg/interruptioneventstore/interruption-event-store.go

universam1 · 2020-12-14T08:31:05Z

I have mixed feelings about this

Certainly for IMDS mode, this is unnecessary. The event queue is restricted to each node and doesn't change often.

For Queue Processor mode, I see the benefit mostly for medium-sized clusters.

For small clusters I wouldn't think the are so many events coming in that this is very helpful. I'm not sure this scales well for large clusters either, to me the better solution would be to run multiple replicas of the pod.

I wasn't as involved with Queue processor mode, though, so let me sync up with @bwagner5 next week to get more details. In the meantime, if you can give me some more details into the setup you're running (size of your cluster, time seen for events to be processed, etc.) maybe I'll be more convinced

Thank you for your reply.
Obviously the problem is not yet clear, but it happens on every cluster of only couple of nodes or of course clusters with hundreds of nodes we run.

Think one fact that was not taken into account when going to queue that a daemonset was effectively multithreaded execution with n*worker where n==node count. However queue is strictly single thread which will not cope with when the loop is busy. And there are couple of reasons that the loop is busy or blocked with retries (to evict a limiting pdb etc.) and does eventually ignore other events.

I will post an example to reproduce later. Simply roll more than one node at a time (batchSize) and have one pod unevictable, reproduces this issue.

Maybe AWS doesn’t operate this itself yet?

universam1 · 2020-12-21T10:13:30Z

@haugenj simplified the approach by calling the goroutine directly from main, requires only little changes.

Hopefully the underlying single threaded problem is more clear now

haugenj · 2020-12-22T17:26:18Z

@universam1 thank you for your patience, this looks good to me. Can you also add the config variable to the Readme and Helm charts? Example: https://github.com/aws/aws-node-termination-handler/pull/312/files

universam1 · 2020-12-23T10:37:37Z

@universam1 thank you for your patience, this looks good to me. Can you also add the config variable to the Readme and Helm charts? Example: https://github.com/aws/aws-node-termination-handler/pull/312/files

@haugenj Thank you - certainly, updated the Helm chart and Readme - please let me know if anything missing!

bwagner5

I'd suggest adding a WaitGroup to track the drainOrCordonIfNecessary go-routines.

Everything else looks great! Thanks for taking this on and contributing! 🚀

cmd/node-termination-handler.go

This replaces the single process execution of events to parallel processing, solving the issue that happens when NTH is busy / blocked (retrying to evict) and eventually will miss to process events for other nodes going down the same time time. Example: 3 nodes roll at a time because of batchSize or spot interruption. A deployment has a pdb limit of maxUnavailable of 1 - that will block NTH in a eviction retry loop and it will miss the third node eviction. The amount of workers are capped to prevent a memory runnaway

bwagner5

LGTM! Thanks!

This replaces the single process execution of events to parallel processing, solving the issue that happens when NTH is busy / blocked (retrying to evict) and eventually will miss to process events for other nodes going down the same time time. Example: 3 nodes roll at a time because of batchSize or spot interruption. A deployment has a pdb limit of maxUnavailable of 1 - that will block NTH in a eviction retry loop and it will miss the third node eviction. The amount of workers are capped to prevent a memory runnaway

haugenj reviewed Dec 12, 2020

View reviewed changes

cmd/node-termination-handler.go Outdated Show resolved Hide resolved

pkg/interruptioneventstore/interruption-event-store_test.go Outdated Show resolved Hide resolved

pkg/interruptioneventstore/interruption-event-store.go Outdated Show resolved Hide resolved

universam1 marked this pull request as draft December 14, 2020 08:32

universam1 changed the title ~~draft: multithreaded event processor~~ multithreaded event processor Dec 14, 2020

universam1 marked this pull request as ready for review December 21, 2020 10:09

universam1 requested a review from haugenj December 21, 2020 10:10

universam1 force-pushed the workers branch 2 times, most recently from f9f57f4 to 583e2b0 Compare December 23, 2020 10:36

bwagner5 requested changes Dec 23, 2020

View reviewed changes

cmd/node-termination-handler.go Outdated Show resolved Hide resolved

universam1 force-pushed the workers branch from 583e2b0 to 325cd82 Compare December 28, 2020 10:53

universam1 requested a review from bwagner5 December 28, 2020 10:54

bwagner5 approved these changes Dec 28, 2020

View reviewed changes

bwagner5 merged commit d8460d2 into aws:main Dec 28, 2020

universam1 deleted the workers branch December 28, 2020 16:07

bwagner5 mentioned this pull request Dec 29, 2020

hangs: SQS messages not being processed #309

Closed

ec2-bot mentioned this pull request Jan 5, 2021

🥳 node-termination-handler v1.12.0 Automated Release! 🥑 aws/eks-charts#409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multithreaded event processor #317

multithreaded event processor #317

universam1 commented Dec 11, 2020 •

edited

universam1 commented Dec 11, 2020

haugenj left a comment

universam1 commented Dec 14, 2020

universam1 commented Dec 21, 2020

haugenj commented Dec 22, 2020

universam1 commented Dec 23, 2020

bwagner5 left a comment

bwagner5 left a comment

multithreaded event processor #317

multithreaded event processor #317

Conversation

universam1 commented Dec 11, 2020 • edited

universam1 commented Dec 11, 2020

haugenj left a comment

Choose a reason for hiding this comment

universam1 commented Dec 14, 2020

universam1 commented Dec 21, 2020

haugenj commented Dec 22, 2020

universam1 commented Dec 23, 2020

bwagner5 left a comment

Choose a reason for hiding this comment

bwagner5 left a comment

Choose a reason for hiding this comment

universam1 commented Dec 11, 2020 •

edited