Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45245][PYTHON][CONNECT] PythonWorkerFactory: Timeout if worker does not connect back. #43023

Closed
wants to merge 9 commits into from

Conversation

rangadi
Copy link
Contributor

@rangadi rangadi commented Sep 21, 2023

What changes were proposed in this pull request?

createSimpleWorker() method in PythonWorkerFactory waits forever if the worker fails to connect back to the server.

This is because it calls accept() without a timeout. If the worker does not connect back, accept() waits forever. There is supposed to be 10 seconds timeout, but it was not implemented correctly.

This PR adds a 10 second timeout.

Why are the changes needed?

Otherwise create method could be stuck forever.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Unit test
  • Manual

Was this patch authored or co-authored using generative AI tooling?

Generated-by: ChatGPT 4.0
Asked ChatGPT to generate sample code to do non-blocking accept() on a socket channel in Java.

@rangadi
Copy link
Contributor Author

rangadi commented Sep 21, 2023

cc: @HyukjinKwon, @WweiL

@@ -184,10 +185,20 @@ private[spark] class PythonWorkerFactory(
redirectStreamsToStderr(workerProcess.getInputStream, workerProcess.getErrorStream)

// Wait for it to connect to our socket, and validate the auth secret.
serverSocketChannel.socket().setSoTimeout(10000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was supposed to be 10 second timeout. But this call does not seem to affect serverSocketChannel.accept().
This set might only take effect if we did serverSocketChannel.socket().accept(), but hat returns a socket, not a channel.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin

// return (accept() used to be blocking), the test doesn't hang for a long time.
val createFuture = Future {
val ex = intercept[SparkException] {
workerFactory.createSimpleWorker(blockingMode = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know if this blockingMode=true would affect serverSocketChannel.configureBlocking(false) above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. It is for client socket. Shall I remove it?

@WweiL
Copy link
Contributor

WweiL commented Sep 21, 2023

Seems the tests are failing because of format, maybe try

./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect/common -pl connector/connect/server -pl connector/connect/client/jvm

@HyukjinKwon HyukjinKwon changed the title [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back. [SPARK-45245][PYTHON][CONNECT] PythonWorkerFactory: Timeout if worker does not connect back. Sep 22, 2023
@rangadi
Copy link
Contributor Author

rangadi commented Oct 30, 2023

@HyukjinKwon the test failures seem unrelated. I tried multiple times. different types of tests are failing. Do you think we can merge this?

@HyukjinKwon
Copy link
Member

Yeah I am merging it to master but I think we should switch this to use the daemonized worker instead of simple workers soon.

Merged to master.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 30, 2023

I think we should switch this to use the daemonized worker instead of simple workers soon.

I see. I have been looking for reasons doing so. This seems to be one of them.

import org.apache.spark.util.ThreadUtils

// Tests for PythonWorkerFactory.
class PythonWorkerFactorySuite extends SparkFunSuite with Matchers with SharedSparkContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this suite have to extend Matchers? @HyukjinKwon can you double check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Made a followup PR: #45459

dongjoon-hyun pushed a commit that referenced this pull request Mar 11, 2024
…it in the test

### What changes were proposed in this pull request?

This PR is a followup of #43023 that addresses a post-review comment.

### Why are the changes needed?

It is unnecessary. It also matters with Scala compatibility so should better remove if unused.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45459 from HyukjinKwon/SPARK-45245-folllowup.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
jpcorreia99 pushed a commit to jpcorreia99/spark that referenced this pull request Mar 12, 2024
…it in the test

### What changes were proposed in this pull request?

This PR is a followup of apache#43023 that addresses a post-review comment.

### Why are the changes needed?

It is unnecessary. It also matters with Scala compatibility so should better remove if unused.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45459 from HyukjinKwon/SPARK-45245-folllowup.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
…it in the test

### What changes were proposed in this pull request?

This PR is a followup of apache#43023 that addresses a post-review comment.

### Why are the changes needed?

It is unnecessary. It also matters with Scala compatibility so should better remove if unused.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45459 from HyukjinKwon/SPARK-45245-folllowup.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants