-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for draining workers #12178
Support for draining workers #12178
Conversation
622d733
to
5eea4cb
Compare
/pulsarbot run-failure-checks |
5eea4cb
to
52557aa
Compare
/pulsarbot run-failure-checks |
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Show resolved
Hide resolved
...worker/src/main/java/org/apache/pulsar/functions/worker/rest/api/v2/WorkerApiV2Resource.java
Outdated
Show resolved
Hide resolved
...worker/src/main/java/org/apache/pulsar/functions/worker/rest/api/v2/WorkerApiV2Resource.java
Outdated
Show resolved
Hide resolved
...worker/src/main/java/org/apache/pulsar/functions/worker/rest/api/v2/WorkerApiV2Resource.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/Worker.java
Outdated
Show resolved
Hide resolved
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Outdated
Show resolved
Hide resolved
long startTime = System.nanoTime(); | ||
int numRemovedWorkerIds = 0; | ||
|
||
if (drainOpStatusMap.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know drainOpStatusMap is a concurrent map but we are reading and writing from different different threads. This makes me nervous. Is there a reason we should synchronize this with drain operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite understand the question: "Is there a reason we should synchronize this with drain operation?"
updateWorkerDrainMap() does a periodic cleanup of stale information about drained workers. We need to do the operation some time after the drain has finished, when the drained worker is removed from the cluster. Since there is currently no hook (that I know of) into the SchedulerManager when a worker is added to, or removed from the cluster, the cleanup is done through a periodic poll (updateWorkerDrainMap).
The drain operation adds a record into the concurrent map [drainOpStatusMap] when a worker is drained. An implicit assumption in the system is that the drained worker will be removed from the system, soon, by an external orchestrator. When the drained worker is seen to be removed from the system, the drainOpStatusMap is cleaned up of stale information in the updateWorkerDrainMap() code.
PLMK if I misunderstood the qs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaushik-develop there are two independent threads read and updating drainOpStatusMap. One is a schedule periodic task:
The other is when a drain operation is triggered. Two independent actors can be read and modifying the map concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is by design. The external orchestrator is expected to trigger the draining of a worker, and check for drain status. It is also expected to remove the worker after a drain operation completes, and not re-add a worker with the same name as the drained/removed worker for one period (configurable, nominally 60 seconds) after the worker-removal. This will ensure that the two actors work on different entries of the concurrent map. But PLMK if you foresee problems in any specific scenario.
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Outdated
Show resolved
Hide resolved
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Show resolved
Hide resolved
e38f76c
to
2b46004
Compare
/pulsarbot run-failure-checks |
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Outdated
Show resolved
Hide resolved
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Outdated
Show resolved
Hide resolved
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Outdated
Show resolved
Hide resolved
pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/SchedulerManager.java
Outdated
Show resolved
Hide resolved
2b46004
to
9321c3d
Compare
/pulsarbot run-failure-checks |
Co-authored-by: Kaushik Ghosh <kaushikg@splunk.com>
Co-authored-by: Kaushik Ghosh <kaushikg@splunk.com>
(If this PR fixes a github issue, please add
Fixes #<xyz>
.)Fixes #
(or if this PR is one task of a github issue, please add
Master Issue: #<xyz>
to link to the master issue.)Master Issue: #
Motivation
Support for draining function workers, to scale in a cluster.
Explain here the context, and why you're making that change. What is the problem you're trying to solve.
Modifications
Describe the modifications you've done.
Verifying this change
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changes[new endpoints were added under "admin/v2/worker/", related to draining nodes, and getting status of a drain op]
Documentation
Check the box below and label this PR (if you have committer privilege).
Need to update docs?
doc-required
(If you need help on updating docs, create a doc issue)
no-need-doc
(Please explain why)
doc
(If this PR contains doc changes)