-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21494][network] Use correct app id when authenticating to external service. #18706
Conversation
…rnal service. There was some code based on the old SASL handler in the new auth client that was incorrectly using the SASL user as the user to authenticate against the external shuffle service. This caused the external service to not be able to find the correct secret to authenticate the connection, failing the connection. In the course of debugging, I found that some log messages from the YARN shuffle service were a little noisy, so I silenced some of them, and also added a couple of new ones that helped find this issue. On top of that, I found that a check in the code that recording app secrets was wrong, causing more log spam and also using an O(n) operation instead of an O(1) call. Also added a new integration suite for the YARN shuffle service with auth on, and verified it failed before, and passes now.
Test build #79846 has finished for PR 18706 at commit
|
@@ -243,7 +243,6 @@ public void initializeApplication(ApplicationInitializationContext context) { | |||
String appId = context.getApplicationId().toString(); | |||
try { | |||
ByteBuffer shuffleSecret = context.getApplicationDataForService(); | |||
logger.info("Initializing application {}", appId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is to verbose personally. We do log when initializeContainer so I guess its ok to remove this. But if you remove this I think we should remove the one in stopApplication as well since they go hand in hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebuilt my cluster this morning so I don't have the logs anymore... but what I noticed is that this callback gets called multiple times (once per container belonging to the app), while I only remember the stop callback being called when the application actually stops. So it didn't seem like the behavior was as symmetric as one would expect.
But the secret manager also prints an info message on unregistration, so I guess the important things are still logged even if these two messages are removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, here's an edited portion of the log (without this patch, so notice the duplicate registration messages too):
2017-07-24 14:38:11,839 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000002
2017-07-24 14:38:11,839 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:11,864 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:13,506 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000008
2017-07-24 14:38:13,506 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:13,507 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:13,542 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000007
2017-07-24 14:38:13,542 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:13,542 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:14,393 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000010
2017-07-24 14:38:14,394 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:14,394 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:14,394 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000011
2017-07-24 14:38:14,394 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:14,395 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:39:13,454 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000007
2017-07-24 14:39:13,512 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000011
2017-07-24 14:39:13,614 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000008
2017-07-24 14:39:13,614 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000010
2017-07-24 14:39:13,616 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000002
2017-07-24 14:39:13,627 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_1500931912895_0001
@@ -67,7 +67,7 @@ public void registerApp(String appId, ByteBuffer shuffleSecret) { | |||
* This is called when the application terminates. | |||
*/ | |||
public void unregisterApp(String appId) { | |||
if (shuffleSecretMap.contains(appId)) { | |||
if (shuffleSecretMap.containsKey(appId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, this is actually fixing a memory leak...
+1 pending jenkins. |
Test build #79917 has finished for PR 18706 at commit
|
Merging to master / 2.2. |
…rnal service. There was some code based on the old SASL handler in the new auth client that was incorrectly using the SASL user as the user to authenticate against the external shuffle service. This caused the external service to not be able to find the correct secret to authenticate the connection, failing the connection. In the course of debugging, I found that some log messages from the YARN shuffle service were a little noisy, so I silenced some of them, and also added a couple of new ones that helped find this issue. On top of that, I found that a check in the code that records app secrets was wrong, causing more log spam and also using an O(n) operation instead of an O(1) call. Also added a new integration suite for the YARN shuffle service with auth on, and verified it failed before, and passes now. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18706 from vanzin/SPARK-21494. (cherry picked from commit 300807c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…rnal service. There was some code based on the old SASL handler in the new auth client that was incorrectly using the SASL user as the user to authenticate against the external shuffle service. This caused the external service to not be able to find the correct secret to authenticate the connection, failing the connection. In the course of debugging, I found that some log messages from the YARN shuffle service were a little noisy, so I silenced some of them, and also added a couple of new ones that helped find this issue. On top of that, I found that a check in the code that records app secrets was wrong, causing more log spam and also using an O(n) operation instead of an O(1) call. Also added a new integration suite for the YARN shuffle service with auth on, and verified it failed before, and passes now. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18706 from vanzin/SPARK-21494. (cherry picked from commit 300807c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
There was some code based on the old SASL handler in the new auth client that
was incorrectly using the SASL user as the user to authenticate against the
external shuffle service. This caused the external service to not be able to
find the correct secret to authenticate the connection, failing the connection.
In the course of debugging, I found that some log messages from the YARN shuffle
service were a little noisy, so I silenced some of them, and also added a couple
of new ones that helped find this issue. On top of that, I found that a check
in the code that records app secrets was wrong, causing more log spam and also
using an O(n) operation instead of an O(1) call.
Also added a new integration suite for the YARN shuffle service with auth on,
and verified it failed before, and passes now.