Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21494][network] Use correct app id when authenticating to external service. #18706

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 21, 2017

There was some code based on the old SASL handler in the new auth client that
was incorrectly using the SASL user as the user to authenticate against the
external shuffle service. This caused the external service to not be able to
find the correct secret to authenticate the connection, failing the connection.

In the course of debugging, I found that some log messages from the YARN shuffle
service were a little noisy, so I silenced some of them, and also added a couple
of new ones that helped find this issue. On top of that, I found that a check
in the code that records app secrets was wrong, causing more log spam and also
using an O(n) operation instead of an O(1) call.

Also added a new integration suite for the YARN shuffle service with auth on,
and verified it failed before, and passes now.

…rnal service.

There was some code based on the old SASL handler in the new auth client that
was incorrectly using the SASL user as the user to authenticate against the
external shuffle service. This caused the external service to not be able to
find the correct secret to authenticate the connection, failing the connection.

In the course of debugging, I found that some log messages from the YARN shuffle
service were a little noisy, so I silenced some of them, and also added a couple
of new ones that helped find this issue. On top of that, I found that a check
in the code that recording app secrets was wrong, causing more log spam and also
using an O(n) operation instead of an O(1) call.

Also added a new integration suite for the YARN shuffle service with auth on,
and verified it failed before, and passes now.
@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79846 has finished for PR 18706 at commit 4e6cc53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 24, 2017

@zsxwing @tgravescs

@@ -243,7 +243,6 @@ public void initializeApplication(ApplicationInitializationContext context) {
String appId = context.getApplicationId().toString();
try {
ByteBuffer shuffleSecret = context.getApplicationDataForService();
logger.info("Initializing application {}", appId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is to verbose personally. We do log when initializeContainer so I guess its ok to remove this. But if you remove this I think we should remove the one in stopApplication as well since they go hand in hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebuilt my cluster this morning so I don't have the logs anymore... but what I noticed is that this callback gets called multiple times (once per container belonging to the app), while I only remember the stop callback being called when the application actually stops. So it didn't seem like the behavior was as symmetric as one would expect.

But the secret manager also prints an info message on unregistration, so I guess the important things are still logged even if these two messages are removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, here's an edited portion of the log (without this patch, so notice the duplicate registration messages too):

2017-07-24 14:38:11,839 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000002
2017-07-24 14:38:11,839 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:11,864 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:13,506 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000008
2017-07-24 14:38:13,506 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:13,507 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:13,542 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000007
2017-07-24 14:38:13,542 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:13,542 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:14,393 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000010
2017-07-24 14:38:14,394 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:14,394 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001
2017-07-24 14:38:14,394 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1500931912895_0001_01_000011
2017-07-24 14:38:14,394 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing application application_1500931912895_0001
2017-07-24 14:38:14,395 INFO org.apache.spark.network.sasl.ShuffleSecretManager: Registered shuffle secret for application application_1500931912895_0001



2017-07-24 14:39:13,454 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000007
2017-07-24 14:39:13,512 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000011
2017-07-24 14:39:13,614 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000008
2017-07-24 14:39:13,614 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000010
2017-07-24 14:39:13,616 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1500931912895_0001_01_000002
2017-07-24 14:39:13,627 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_1500931912895_0001

@@ -67,7 +67,7 @@ public void registerApp(String appId, ByteBuffer shuffleSecret) {
* This is called when the application terminates.
*/
public void unregisterApp(String appId) {
if (shuffleSecretMap.contains(appId)) {
if (shuffleSecretMap.containsKey(appId)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, this is actually fixing a memory leak...

@tgravescs
Copy link
Contributor

+1 pending jenkins.

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79917 has finished for PR 18706 at commit 8b3c91d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 26, 2017

Merging to master / 2.2.

asfgit pushed a commit that referenced this pull request Jul 26, 2017
…rnal service.

There was some code based on the old SASL handler in the new auth client that
was incorrectly using the SASL user as the user to authenticate against the
external shuffle service. This caused the external service to not be able to
find the correct secret to authenticate the connection, failing the connection.

In the course of debugging, I found that some log messages from the YARN shuffle
service were a little noisy, so I silenced some of them, and also added a couple
of new ones that helped find this issue. On top of that, I found that a check
in the code that records app secrets was wrong, causing more log spam and also
using an O(n) operation instead of an O(1) call.

Also added a new integration suite for the YARN shuffle service with auth on,
and verified it failed before, and passes now.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18706 from vanzin/SPARK-21494.

(cherry picked from commit 300807c)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 300807c Jul 26, 2017
@vanzin vanzin deleted the SPARK-21494 branch August 7, 2017 20:09
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…rnal service.

There was some code based on the old SASL handler in the new auth client that
was incorrectly using the SASL user as the user to authenticate against the
external shuffle service. This caused the external service to not be able to
find the correct secret to authenticate the connection, failing the connection.

In the course of debugging, I found that some log messages from the YARN shuffle
service were a little noisy, so I silenced some of them, and also added a couple
of new ones that helped find this issue. On top of that, I found that a check
in the code that records app secrets was wrong, causing more log spam and also
using an O(n) operation instead of an O(1) call.

Also added a new integration suite for the YARN shuffle service with auth on,
and verified it failed before, and passes now.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#18706 from vanzin/SPARK-21494.

(cherry picked from commit 300807c)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants