-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6955] Perform port retries at NettyBlockTransferService level #5575
Conversation
Test build #30526 has started for PR 5575 at commit |
Test build #30526 has finished for PR 5575 at commit
|
Test FAILed. |
cc @SaintBacchus @andrewor13 @vanzin This patch aims to resolve SPARK-6955 while keeping the fix for SPARK-5444. |
err, cc @andrewor14, not @andrewor13. |
Test build #30527 has started for PR 5575 at commit |
Test build #30527 timed out for PR 5575 at commit |
Test FAILed. |
Jenkins, retest this please. |
Test build #30548 has started for PR 5575 at commit |
Test build #30548 timed out for PR 5575 at commit |
Test FAILed. |
Jenkins, retest this please. |
Test build #30568 has started for PR 5575 at commit |
Test build #30568 timed out for PR 5575 at commit |
Test FAILed. |
Test build #30569 has started for PR 5575 at commit |
Test build #30569 has finished for PR 5575 at commit
|
Test PASSed. |
I seems to be a better way. Thanks for taking time to solve it. |
val port = 17634 | ||
service0 = createService(port) | ||
service0.port should be >= port | ||
service0.port should be <= (port + 10) // avoid testing equality in case of simultaneous tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems flaky. Isn't the previous check enough? Otherwise, instead of 10
I'd use 1024
which is the actual limit in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can increase the number no problem, though I guess we should use Utils.portMaxRetries
instead. However, note that even 10 should be very unlikely to be hit, that would mean there's 5-10 concurrently running tests at this exact part of the build, or else we're very, very unlucky.
LGTM aside from the potentially flaky test. |
@@ -43,6 +43,9 @@ class ExternalShuffleServiceSuite extends ShuffleSuite with BeforeAndAfterAll { | |||
conf.set("spark.shuffle.manager", "sort") | |||
conf.set("spark.shuffle.service.enabled", "true") | |||
conf.set("spark.shuffle.service.port", server.getPort.toString) | |||
|
|||
// local-cluster mode starts a Worker which would start its own shuffle service without this: | |||
conf.set("spark.worker.shouldHostShuffleServiceIfEnabled", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually ever want to start an external shuffle service in a local cluster? If not I think it makes more sense to just set spark.shuffle.service.enabled
to false in LocalSparkCluster
(we already do this for the REST submission server for Master)
Test build #31002 has started for PR 5575 at commit |
Merged build triggered. |
Test build #31568 timed out for PR 5575 at commit |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #31584 has started for PR 5575 at commit |
Test build #31584 has finished for PR 5575 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behvior for the Yarn and Standalone shuffle services.
cc @andrewor14 @pwendell We should consider merging this into the 1.4 branch in case there is another RC. It has been an outstanding issue for a while. |
Merged build triggered. |
Merged build started. |
Test build #32254 has started for PR 5575 at commit |
Test build #32254 has finished for PR 5575 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Nice, I merge. |
master 1.4 |
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services. Author: Aaron Davidson <aaron@databricks.com> Closes #5575 from aarondav/port-bind and squashes the following commits: 3c2d6ed [Aaron Davidson] Oops, never do it. a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite 59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level (cherry picked from commit ffdc40c) Signed-off-by: Andrew Or <andrew@databricks.com>
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services. Author: Aaron Davidson <aaron@databricks.com> Closes apache#5575 from aarondav/port-bind and squashes the following commits: 3c2d6ed [Aaron Davidson] Oops, never do it. a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite 59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services. Author: Aaron Davidson <aaron@databricks.com> Closes apache#5575 from aarondav/port-bind and squashes the following commits: 3c2d6ed [Aaron Davidson] Oops, never do it. a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite 59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services. Author: Aaron Davidson <aaron@databricks.com> Closes apache#5575 from aarondav/port-bind and squashes the following commits: 3c2d6ed [Aaron Davidson] Oops, never do it. a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite 59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.