Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. #5537

Closed
wants to merge 2 commits into from

Conversation

SaintBacchus
Copy link
Contributor

It's better to let the NodeManager get down rather than take a port retry when spark.shuffle.service.port has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port.

@SaintBacchus SaintBacchus changed the title Do not let Yarn Shuffle Server retry its server port. [SPARK-6955][NETWORK]Do not let Yarn Shuffle Server retry its server port. Apr 16, 2015
@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30391 has finished for PR 5537 at commit 7126547.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@srowen
Copy link
Member

srowen commented Apr 16, 2015

CC @andrewor14 That makes some sense. I hope there aren't side effects to setting this globally in this process.

@SaintBacchus
Copy link
Contributor Author

@srowen This code was used in YarnShuffleService, any spark application can't get this property.

@andrewor14
Copy link
Contributor

@SaintBacchus It's fine to set it to 0, but maybe it makes more sense to just remove the logic that retries port in the first place since the shuffle service is the only place that binds to ports in the whole subproject. More specifically I'm referring to TransportServer#bindRightPort and TransportConf#portMaxRetries. I believe this is also a problem for standalone mode, so a general fix seems more correct.

@vanzin
Copy link
Contributor

vanzin commented Apr 16, 2015

Given that the network-common library (where the retry logic is) is supposed to be generic, and not just for shuffle, I'd rather keep the retry logic there, so this change looks better to me. I would just avoid setting the value in the passed configuration, preferring to clone it instead (or some other way that does not modify the input conf).

Perhaps a similar change in the non-YARN service (does that exist?) would be needed too.

That being said, the javadoc for TransportServer.bindRightPort seems wrong; it says it will try to bind to the given port multiple times, but it actually tries different ports.

@SaintBacchus
Copy link
Contributor Author

@andrewor14 TransportServer#bindRightPort will use in Netty network, in that case have a retry mechanism is a better way.
@vanzin I have clone the configuration and modify the javadoc and also set no retry in StandaloneWorkerShuffleService. Please have a check.

@SparkQA
Copy link

SparkQA commented Apr 17, 2015

Test build #30473 has finished for PR 5537 at commit 962770c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@@ -43,8 +43,12 @@ object SparkTransportConf {
* @param numUsableCores if nonzero, this will restrict the server and client threads to only
* use the given number of cores, rather than all of the machine's cores.
* This restriction will only occur if these properties are not already set.
* @param disablePortRetry if true, server will not retry its port. It's better for the long-run
* server to disable it since the server and client had the agreement of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: alignment is off by 1

@aarondav
Copy link
Contributor

The changes as they currently stand LGTM, let's just fix up the style as @vanzin pointed out.

@andrewor14
Copy link
Contributor

I see, it makes sense to remove it for services set up independently of the Spark application, but not for Netty because the executors will still be able to find the port. This workaround seems appropriate then.

@@ -43,8 +43,12 @@ object SparkTransportConf {
* @param numUsableCores if nonzero, this will restrict the server and client threads to only
* use the given number of cores, rather than all of the machine's cores.
* This restriction will only occur if these properties are not already set.
* @param disablePortRetry if true, server will not retry its port. It's better for the long-run
* server to disable it since the server and client had the agreement of
* the specific port.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if true, the server will not retry to bind to a different port than the one configured.
This is better for long-running servers since they may have agreed upon specific
ports with the clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, since the only place we ever set this to true is in StandaloneWorkerShuffleService, does it make sense to just move the conf setting there along with this comment? (similar to what you did in YARN)

@aarondav
Copy link
Contributor

Ah, I realize now that the original change that brought this bug up (#4240) was perhaps not the right solution. The issue that #4240 solved was actually brought up beforehand (#3688 (comment)). I think my recommended fix then would still be good, as it would not affect the Yarn or Standalone Worker shuffle services.

I created a patch which aims to resolve both #4240 and this PR's issues: #5575.

@SaintBacchus SaintBacchus deleted the YarnShufflePortRetry branch December 26, 2015 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants