Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46888][CORE] Fix Master to reject /workers/kill/ requests if decommission is disabled #44915

Closed
wants to merge 3 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 28, 2024

What changes were proposed in this pull request?

This PR aims to fix Master to reject /workers/kill/ request if spark.decommission.enabled is false in order to fix the dangling worker issue.

Why are the changes needed?

Currently, spark.decommission.enabled is false by default. So, when a user asks to decommission, only Master marked it DECOMMISSIONED while the worker is alive.

$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1

Master UI
Screenshot 2024-01-27 at 6 19 18 PM

Worker Log

24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled.

To be consistent with the existing Worker behavior which ignores the request.

private[deploy] def decommissionSelf(): Unit = {
if (conf.get(config.DECOMMISSION_ENABLED) && !decommissioned) {
decommissioned = true
logInfo(s"Decommission worker $workerId.")
} else if (decommissioned) {
logWarning(s"Worker $workerId already started decommissioning.")
} else {
logWarning(s"Receive decommission request, but decommission feature is disabled.")
}
}

Does this PR introduce any user-facing change?

No, this is a bug fix.

How was this patch tested?

Pass the CI with the newly added test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-46888][CORE] Fix Master to reject worker kill request if decommission is disabled [SPARK-46888][CORE] Fix Master to reject /workers/kill/ requests if decommission is disabled Jan 28, 2024
@dongjoon-hyun
Copy link
Member Author

Also, could you review this dangled Workers bug @viirya ?

@dongjoon-hyun
Copy link
Member Author

Oh, wait. I'm revising the test case.

@dongjoon-hyun
Copy link
Member Author

Thank you so much again, @viirya .

dongjoon-hyun added a commit that referenced this pull request Jan 28, 2024
…if decommission is disabled

This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue.

Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive.
```
$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1
```

**Master UI**
![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be)

**Worker Log**
```
24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled.
```

To be consistent with the existing `Worker` behavior which ignores the request.

https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868

No, this is a bug fix.

Pass the CI with the newly added test case.

No.

Closes #44915 from dongjoon-hyun/SPARK-46888.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 20b5938)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Jan 28, 2024
…if decommission is disabled

This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue.

Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive.
```
$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1
```

**Master UI**
![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be)

**Worker Log**
```
24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled.
```

To be consistent with the existing `Worker` behavior which ignores the request.

https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868

No, this is a bug fix.

Pass the CI with the newly added test case.

No.

Closes #44915 from dongjoon-hyun/SPARK-46888.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 20b5938)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member Author

Merged to master/3.5/3.4.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-46888 branch January 28, 2024 04:33
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…if decommission is disabled

This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue.

Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive.
```
$ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1
```

**Master UI**
![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be)

**Worker Log**
```
24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled.
```

To be consistent with the existing `Worker` behavior which ignores the request.

https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868

No, this is a bug fix.

Pass the CI with the newly added test case.

No.

Closes apache#44915 from dongjoon-hyun/SPARK-46888.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 20b5938)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants