New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34726][SQL][2.4] Fix collectToPython timeouts #31818
[SPARK-34726][SQL][2.4] Fix collectToPython timeouts #31818
Conversation
cc @HyukjinKwon |
cc @attilapiros |
Hmmm, this is exactly one reason why we've added this: #30389 |
Since I've seen a thread about 2.4.8 it would be good to solve this there, right @HyukjinKwon @dongjoon-hyun ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, the fix looks making sense to me.
cc @viirya FYI |
I would like to torture this a bit but I agree with the main intention too. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Validated manually and passed, additionally don't see high risk in it.
Jenkins test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good. But I think you can come up with a good unit test where a longer running onSuccess
method is given with the listener (longer than a socket timeout...).
WDYT?
You mean measuring time as assert criteria? |
No. Just having a If this turn out to be too complex I can let it go. But let's check it first. |
Thanks, added in 6b18cc7 |
override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = {} | ||
|
||
override def onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit = { | ||
// Longer than 15s in `PythonServer.setupOneConnectionServer` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract the 15000 into a private[spark] var
as member:
serverSocket.setSoTimeout(15000) |
And add a comment above // visible for testing
.
Then we can speed up this single test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
authHelper.authToServer(socket) | ||
Source.fromInputStream(socket.getInputStream) | ||
|
||
spark.listenerManager.unregister(listener) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this into finally to be on the safe side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, added in 8f6b811
|
||
override def onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit = { | ||
// Longer than 15s in `PythonServer.setupOneConnectionServer` | ||
Thread.sleep(20 * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding does this mean the test waits 15 seconds to pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to wait a bit longer than the timeout. Without the fix in Dataset.collectToPython
this UT fails.
Change-Id: I61cd8a8ea5eadbc72b38a3ecbdf1cd41556d5de6
Change-Id: I3d8077ba9e65c0367e401b3fdd8590fd8a83cf73
Let's see jenkins... |
Thank you for pinging me, @gaborgsomogyi . Yes, it's a good timing to discuss this and the decision is up to the release manager, @viirya .
BTW, we should be more careful because 2.4.8 is the last release. |
Test build #136006 has finished for PR 31818 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #136010 has finished for PR 31818 at commit
|
Hmm, I'm not sure about this. It sounds like if any people rely |
I think it is very rare that a |
Oh, I see. If only socket server creation is not caught by the listener, although it is still a behavior change however seems less risky. Okay it looks making sense. I'm not against this but would like to have more eyes on this. cc @cloud-fan @maropu |
@viirya you see it well and you're right, it's a tradeoff which should be discussed in-depth to come up with balanced decision.
Con:
Couple of heavy users complained about this but that said |
Moreover |
True, so backport needed such case. I guess it wouldn't be brain surgery though. |
@cloud-fan, @maropu could you please review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll merge in few days if there are no more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this fix if no one raises objection.
Thanks all. Merging to 2.4. |
### What changes were proposed in this pull request? One of our customers frequently encounters `"serve-DataFrame" java.net.SocketTimeoutException: Accept timed` errors in PySpark because `DataSet.collectToPython()` in Spark 2.4 does the following: 1. Collects the results 2. Opens up a socket server that is then listening to the connection from Python side 3. Runs the event listeners as part of `withAction` on the same thread as SPARK-25680 is not available in Spark 2.4 4. Returns the address of the socket server to Python 5. The Python side connects to the socket server and fetches the data As the customer has a custom, long running event listener the time between 2. and 5. is frequently longer than the default connection timeout and increasing the connect timeout is not a good solution as we don't know how long running the listeners can take. ### Why are the changes needed? This PR simply moves the socket server creation (2.) after running the listeners (3.). I think this approach has has a minor side effect that errors in socket server creation are not reported as `onFailure` events, but currently errors happening during opening the connection from Python side or data transfer from JVM to Python are also not reported as events so IMO this is not a big change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT + manual test. Closes #31818 from peter-toth/SPARK-34726-fix-collectToPython-timeouts-2.4. Lead-authored-by: Peter Toth <ptoth@cloudera.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Thanks all for the review. |
BTW, the JIRA description is empty. I just copied the description here to the JIRA. |
What changes were proposed in this pull request?
One of our customers frequently encounters
"serve-DataFrame" java.net.SocketTimeoutException: Accept timed
errors in PySpark becauseDataSet.collectToPython()
in Spark 2.4 does the following:withAction
on the same thread as SPARK-25680 is not available in Spark 2.4As the customer has a custom, long running event listener the time between 2. and 5. is frequently longer than the default connection timeout and increasing the connect timeout is not a good solution as we don't know how long running the listeners can take.
Why are the changes needed?
This PR simply moves the socket server creation (2.) after running the listeners (3.). I think this approach has has a minor side effect that errors in socket server creation are not reported as
onFailure
events, but currently errors happening during opening the connection from Python side or data transfer from JVM to Python are also not reported as events so IMO this is not a big change.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT + manual test.