[SPARK-19139][core] New auth mechanism for transport library. #16521

vanzin · 2017-01-09T22:01:32Z

This change introduces a new auth mechanism to the transport library,
to be used when users enable strong encryption. This auth mechanism
has better security than the currently used DIGEST-MD5.

The new protocol uses symmetric key encryption to mutually authenticate
the endpoints, and is very loosely based on ISO/IEC 9798.

The new protocol falls back to SASL when it thinks the remote end is old.
Because SASL does not support asking the server for multiple auth protocols,
which would mean we could re-use the existing SASL code by just adding a
new SASL provider, the protocol is implemented outside of the SASL API
to avoid the boilerplate of adding a new provider.

Details of the auth protocol are discussed in the included README.md
file.

This change partly undos the changes added in SPARK-13331; AES encryption
is now decoupled from SASL authentication. The encryption code itself,
though, has been re-used as part of this change.

How was this patch tested?

Unit tests
Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled

This change introduces a new auth mechanism to the transport library, to be used when users enable strong encryption. This auth mechanism has better security than the currently used DIGEST-MD5. The new protocol uses symmetric key encryption to mutually authenticate the endpoints, and is very loosely based on ISO/IEC 9798. The new protocol falls back to SASL when it thinks the remote end is old. Because SASL does not support asking the server for multiple auth protocols, which would mean we could re-use the existing SASL code by just adding a new SASL provider, the protocol is implemented outside of the SASL API to avoid the boilerplate of adding a new provider. Details of the auth protocol are discussed in the included README.md file. This change partly undos the changes added in SPARK-13331; AES encryption is now decoupled from SASL authentication. The encryption code itself, though, has been re-used as part of this change.

SparkQA · 2017-01-10T00:07:50Z

Test build #71095 has finished for PR 16521 at commit e219c8e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-01-10T00:17:39Z

Hmm, my final cleanup broke some tests, let me fix those...

SparkQA · 2017-01-10T03:00:05Z

Test build #71104 has finished for PR 16521 at commit a2b3ff6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-01-10T21:00:01Z

/cc @zsxwing @squito @andrewor14 (in case he's still looking at things)

yhuai · 2017-01-10T22:32:30Z

@vanzin I have not reviewed this PR yet. Just have two level questions. Is there any change to existing behaviors and settings (compared with Spark 2.1)? Also, does our doc have enough contents to explain how to set those confs and how those work? Thanks!

vanzin · 2017-01-10T22:35:24Z

Is there any change to existing behaviors and settings (compared with Spark 2.1)?

No. I added some new config names that replace old ones (to have more generic names), but the old names still work.

Also, does our doc have enough contents to explain how to set those confs and how those work?

I think so. I added docs for the new configs, the important old ones were already documented.

SparkQA · 2017-01-11T01:22:35Z

Test build #71167 has finished for PR 16521 at commit 3894a02.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito

I'm not able to sign off on this completely, but this looks great to me. The README.md is really good at explaining what is going on here.

I just have really minor comments.

squito · 2017-01-11T17:09:26Z

common/network-common/src/main/java/org/apache/spark/network/crypto/README.md

+- ANONCE: the nonce used as the salt when generating the auth key.
+- ENC(): an encryption function that uses the cipher and the generated key. This function
+  will also be used in the definition of other messages below.
+- CCHALLENGE: a byte sequence used as a challenge to the server.


typo: CHALLENGE

squito · 2017-01-11T17:12:05Z

common/network-common/src/main/java/org/apache/spark/network/crypto/README.md

+
+Where:
+
+- CRESPONSE: the server's response to the client challenge.


typo: RESPONSE

squito · 2017-01-11T17:13:05Z

common/network-common/src/main/java/org/apache/spark/network/crypto/README.md

+The default KDF is "PBKDF2WithHmacSHA1". Users should be able to select any algorithm
+from those supported by the `javax.crypto.SecretKeyFactory` class, as long as they support
+PBEKeySpec when generating keys. The default number of iterations is calculated to take a
+resonable amount of time on modern CPUs. See the documentation in TransportConf for more


squito · 2017-01-11T20:10:23Z

docs/configuration.md

-    enabled. This is supported by the block transfer service and the
-    RPC endpoints.
+    Enable encryption using the commons-crypto library for RPC and block transfer service.
+    Requires <code>spark.authenticate</code> to be enabled.


if spark.authenticate=false, what happens if this is true? It looks like it is just ignored, I think fail-fast would be ideal.

I added some for validating that in SparkConf; right now the config keys are scattered all over the code, I'll file a separate bug for cleaning those up.

SparkQA · 2017-01-14T00:23:38Z

Test build #71347 has finished for PR 16521 at commit ee9d232.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-01-14T00:28:05Z

retest this please

SparkQA · 2017-01-14T03:01:38Z

Test build #71355 has finished for PR 16521 at commit ee9d232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-14T03:08:00Z

Test build #71356 has finished for PR 16521 at commit ee9d232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-01-17T19:06:37Z

Will leave this open for a couple more days, but would appreciate more eyes.

jsoltren

This seems reasonable to me.

jsoltren · 2017-01-18T19:49:02Z

common/network-common/src/main/java/org/apache/spark/network/crypto/README.md

+- Encrypting AES session keys with 3DES doesn't solve the issue, since the weakest link
+  in the negotiation would still be MD5 and 3DES.
+
+The protocol assumes that the shared secret is generated and distributed in a secure manner.


It might be helpful to mention the current shared secret generation and distribution mechanisms to drive the point that these are, hopefully, stronger than DIGEST-MD5 or possibly even AES.

That is discussed in SecurityManager.scala.

zsxwing · 2017-01-20T22:49:50Z

Sorry for the delay. Looking at it now.

zsxwing

Need to leave now. Left some nits. The core changes look good to me. Nice PR. Will finish my review at the weekend.

zsxwing · 2017-01-20T23:15:11Z

core/src/main/scala/org/apache/spark/SparkConf.scala

+
+    val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED)
+    require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED),
+      s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.")


It's unclear in the doc that what will be used when both spark.network.crypto.enabled and spark.authenticate.enableSaslEncryption are true. It's better to just disable this case.

Disable what case?

You need to be able to configure them separately, and if for some weird reason you want RPC encryption but don't want shuffle encryption when talking to an old shuffle service, these settings allow that.

Makes sense.

zsxwing · 2017-01-20T23:21:46Z

common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java

+      // OK to switch back to SASL (because the server doesn't speak the new protocol). So
+      // try it anyway, and in the worst case things will fail again.
+      if (conf.saslFallback()) {
+        LOG.debug("New auth protocol failed, trying SASL.", e);


nit: sometimes, it's just because the server config is wrong and a warning is better to help the user find out it.

zsxwing · 2017-01-20T23:32:46Z

common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java

+      LOG.debug("Received new auth challenge for client {}.", channel.remoteAddress());
+    } catch (RuntimeException e) {
+      if (conf.saslFallback()) {
+        LOG.debug("Failed to parse new auth challenge, reverting to SASL for client {}.",


nit: debug -> warn

zsxwing · 2017-01-20T23:57:30Z

common/network-common/src/main/java/org/apache/spark/network/crypto/ServerResponse.java

+    ByteBuf buf = Unpooled.wrappedBuffer(buffer);
+
+    if (buf.readByte() != TAG_BYTE) {
+      throw new IllegalArgumentException("Expected ServerChallenge, received something else.");


nit: ServerChallenge -> ServerResponse

zsxwing · 2017-01-20T23:58:50Z

common/network-common/src/main/java/org/apache/spark/network/crypto/ClientChallenge.java

+
+  public final String appId;
+  public final String kdf;
+  public int iterations;


nit: miss final

zsxwing · 2017-01-21T01:06:46Z

common/network-common/src/main/java/org/apache/spark/network/crypto/README.md

+  malicious "proxy" between endpoints, the attacker won't be able to read any of the data exchanged
+  between client and server, nor insert arbitrary commands for the server to execute.
+
+* Replay attacks: the use of nonces when generating keys prevents an attacker from being able to


The server doesn't verify a nonce was used or not, so it don't prevents replay attacks. Right?

This is explained in the paragraph after the bullet list. The server always generates new nonces for sessions, so replaying the challenge will not allow an attacker to establish a session.

Yeah. I didn't read the codes correctly.

zsxwing · 2017-01-21T01:14:44Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

-  public int maxSaslEncryptedBlockSize() {
-    return Ints.checkedCast(JavaUtils.byteStringAsBytes(
-      conf.get("spark.network.sasl.maxEncryptedBlockSize", "64k")));
+  public boolean aesEncryptionEnabled() {


nit: rename this method to a general name?

zsxwing · 2017-01-21T01:24:30Z

common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java

+
+    TransportClient client;
+    TransportServer server;
+    Channel serverChannel;


nit: volatile

zsxwing · 2017-01-21T01:24:37Z

common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java

+    TransportClient client;
+    TransportServer server;
+    Channel serverChannel;
+    AuthRpcHandler authRpcHandler;


nit: volatile

zsxwing · 2017-01-21T01:27:31Z

common/network-common/src/test/java/org/apache/spark/network/crypto/AuthIntegrationSuite.java

+
+    ByteBuffer reply = ctx.client.sendRpcSync(JavaUtils.stringToBytes("Ping"), 5000);
+    assertEquals("Pong", JavaUtils.bytesToString(reply));
+    assertTrue(ctx.authRpcHandler.doDelegate);


nit: please also check delegate type to ensure it doesn't use sasl

zsxwing · 2017-01-23T07:12:03Z

Made one pass. Looks good overall. Just some nits.

SparkQA · 2017-01-23T19:18:08Z

Test build #71869 has finished for PR 16521 at commit 718247e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-01-23T19:21:02Z

Weird error, the code is there... retest this please

SparkQA · 2017-01-23T19:28:10Z

Test build #71870 has finished for PR 16521 at commit 718247e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-01-23T19:41:48Z

need to rebase to current master...

zsxwing · 2017-01-23T21:56:42Z

LGTM pending tests

SparkQA · 2017-01-23T22:29:59Z

Test build #71871 has finished for PR 16521 at commit 39df4b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-01-24T18:43:42Z

Thanks. Merging to master.

This change introduces a new auth mechanism to the transport library, to be used when users enable strong encryption. This auth mechanism has better security than the currently used DIGEST-MD5. The new protocol uses symmetric key encryption to mutually authenticate the endpoints, and is very loosely based on ISO/IEC 9798. The new protocol falls back to SASL when it thinks the remote end is old. Because SASL does not support asking the server for multiple auth protocols, which would mean we could re-use the existing SASL code by just adding a new SASL provider, the protocol is implemented outside of the SASL API to avoid the boilerplate of adding a new provider. Details of the auth protocol are discussed in the included README.md file. This change partly undos the changes added in SPARK-13331; AES encryption is now decoupled from SASL authentication. The encryption code itself, though, has been re-used as part of this change. ## How was this patch tested? - Unit tests - Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled - Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#16521 from vanzin/SPARK-19139.

Fix sasl fallback, add an extra test.

a2b3ff6

Fix a comment in README.md.

3894a02

squito approved these changes Jan 11, 2017

View reviewed changes

Review feedback.

ee9d232

jsoltren reviewed Jan 18, 2017

View reviewed changes

zsxwing requested changes Jan 21, 2017

View reviewed changes

Feedback.

718247e

Merge branch 'master' into SPARK-19139

f8783aa

Fix after merge with master.

39df4b3

zsxwing approved these changes Jan 23, 2017

View reviewed changes

asfgit closed this in 8f3f73a Jan 24, 2017

vanzin deleted the SPARK-19139 branch January 27, 2017 01:16


		Where:

		- CRESPONSE: the server's response to the client challenge.

[SPARK-19139][core] New auth mechanism for transport library. #16521

[SPARK-19139][core] New auth mechanism for transport library. #16521

Conversation

vanzin commented Jan 9, 2017

How was this patch tested?

SparkQA commented Jan 10, 2017

vanzin commented Jan 10, 2017

SparkQA commented Jan 10, 2017

vanzin commented Jan 10, 2017

yhuai commented Jan 10, 2017

vanzin commented Jan 10, 2017

SparkQA commented Jan 11, 2017

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 14, 2017

vanzin commented Jan 14, 2017

SparkQA commented Jan 14, 2017

SparkQA commented Jan 14, 2017

vanzin commented Jan 17, 2017

jsoltren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Jan 20, 2017

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Jan 23, 2017

SparkQA commented Jan 23, 2017

vanzin commented Jan 23, 2017

SparkQA commented Jan 23, 2017

vanzin commented Jan 23, 2017

zsxwing commented Jan 23, 2017

SparkQA commented Jan 23, 2017

zsxwing commented Jan 24, 2017