Skip to content

[CELEBORN-1351] Introduce SSLFactory and enable TLS support#2438

Closed
mridulm wants to merge 7 commits intoapache:mainfrom
mridulm:add-sslfactory-and-related-changes
Closed

[CELEBORN-1351] Introduce SSLFactory and enable TLS support#2438
mridulm wants to merge 7 commits intoapache:mainfrom
mridulm:add-sslfactory-and-related-changes

Conversation

@mridulm
Copy link
Contributor

@mridulm mridulm commented Apr 2, 2024

What changes were proposed in this pull request?

Add SSLFactory, and wire up TLS support with rest of Celeborn to enable secure over the wire communication.

Why are the changes needed?

Add support for TLS to secure wire communication.
This is the last PR to add basic support for TLS.
There will be a follow up for CELEBORN-1356 and documentation ofcourse !

Does this PR introduce any user-facing change?

Yes, completes basic support for TLS in Celeborn.

How was this patch tested?

Existing tests, augmented with additional unit tests.

@mridulm mridulm changed the title [CELEBORN-1351] Add support for SSLFactory, and enable TLS support [CELEBORN-1351] Add SSLFactory, and enable TLS support Apr 2, 2024
@mridulm
Copy link
Contributor Author

mridulm commented Apr 2, 2024

The test failure does not look related to this PR (and the same tests passed here):

- celeborn spark integration test - pushdata timeout will add to pushExcludedWorkers *** FAILED ***

@mridulm
Copy link
Contributor Author

mridulm commented Apr 2, 2024

@SteNicholas
Copy link
Member

SteNicholas commented Apr 2, 2024

@mridulm, there are unnecessary changes of document in this pull request. Please check the changes of document.

@mridulm
Copy link
Contributor Author

mridulm commented Apr 2, 2024

Oh no, this is some weird artifact of pulling from my other branch - let me revert those, sorry about that !


if (channelsLimiter != null) {
pipeline.addLast("limiter", channelsLimiter);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate limiter is something which does not exist in spark ... this looks fine to me, but please do take a look at this reviewers !


@Override
public X509Certificate[] getAcceptedIssuers() {
return EMPTY_CERT_ARRAY;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This fixes a bug in spark, and I will backport it there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this surfaced as part of the usage of wantClientAuth? What was the bug?

Copy link
Contributor Author

@mridulm mridulm Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The api expects an empty array and not null to be returned.
As part of testing various combinations, this surfaced, yes: though I dont recall if it was specifically related to want auth - it probably was

We need something similar to ssl connectivity test in Spark as well, which can help drive the matrix of combinations for testing (there are a lot more combinations in Spark)

public SSLEngine createSSLEngine(boolean isClient, ByteBufAllocator allocator) {
SSLEngine engine = createEngine(isClient, allocator);
engine.setUseClientMode(isClient);
engine.setWantClientAuth(true);
Copy link
Contributor Author

@mridulm mridulm Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in spark, where clients do not have ssl cert, here we allow for ssl client to present their certificate - when available.
This is mainly relevant when master and workers communicate, and not so much between applications and master/worker (not common for cert to be present for spark/flink/mr apps) - providing for master/worker to validate each others identity. In future, this can be the basis for SASL EXTERNAL between s2s components !

See SslConnectivitySuiteJ.testUntrustedClientCertFails where this validation failure is tested.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in spark, where clients do not have ssl cert, here we allow for ssl client to present their certificate - when available.

Would be great to do this on Spark too. I had a PR to do this that I never got around to submitting as part of other feature tests, but I wasn't sure if it would have value in isolation beyond the other changes I had in mind (those proved infeasible).

Copy link
Contributor Author

@mridulm mridulm Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure if there is a lot of value in it for spark - in Celeborn this is mainly useful for s2s auth between master and workers - it is not that common for client apps to have signed certs

clientFactory.close();
context.close();
testData.cleanup();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing test suites, like RpcIntegrationSuite, TransportClientFactorySuite, etc are modified such that a SSL subclass can test all the cases - but over SSL.

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Left minor comments. BTW, I found that the description of SSL configs does not explain what the module user could configure. Could you improve the config description?

pipeline.addLast("loggingHandler", nettyLogger.getLoggingHandler());
}
if (sslEncryptionEnabled()) {
if (!isClient && !sslFactory.hasKeyManagers()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the check !sslFactory.hasKeyManagers move into createSslFactory? Meanwhile, does this need to throw exception when not a client connection? What about not add SslHandler for non-client conection?

Copy link
Contributor Author

@mridulm mridulm Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood your query - for non client connections, we need a private key in order for SSL encryption.
A client connection can also optionally provide a certificate, but it is not required.
In other words, at client side, keys are not required (will use server cert provided as part of the handshake) - but for server, keys are mandatory and is used for encryption/decryption - until a session key is negotiated.

Given this, we cannot pull this into createSslFactory as it is common for both client and servers. Please let me know if I am missing something here.

@SteNicholas SteNicholas changed the title [CELEBORN-1351] Add SSLFactory, and enable TLS support [CELEBORN-1351] Introduce SSLFactory and enable TLS support Apr 2, 2024
@mridulm
Copy link
Contributor Author

mridulm commented Apr 2, 2024

To answer query about which modules are relevant, all modules can be secured using TLS.
At a minimum, given shared secret is negotiated when LifecycleManager registers with Master, it is advisible to secure rpc calls - towards that end, we will split the rpc calls within the application (driver/executors) and from client to server's (see CELEBORN-1356), so that users can independently configure those.

Will add more details on this in the docs PR (wip)

@codecov-commenter
Copy link

codecov-commenter commented Apr 4, 2024

Codecov Report

Attention: Patch coverage is 72.98851% with 47 lines in your changes are missing coverage. Please review.

Project coverage is 40.17%. Comparing base (6153ba4) to head (ead0789).
Report is 3 commits behind head on main.

❗ Current head ead0789 differs from pull request most recent head e5155e5. Consider uploading reports for the commit e5155e5 to get more accurate results

Files Patch % Lines
...apache/celeborn/common/network/ssl/SSLFactory.java 73.28% 17 Missing and 14 partials ⚠️
...ache/celeborn/common/network/TransportContext.java 72.23% 8 Missing and 2 partials ⚠️
.../common/network/client/TransportClientFactory.java 77.78% 3 Missing and 1 partial ⚠️
...apache/celeborn/common/rpc/netty/NettyRpcEnv.scala 33.34% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2438      +/-   ##
==========================================
+ Coverage   39.24%   40.17%   +0.94%     
==========================================
  Files         217      218       +1     
  Lines       13432    13600     +168     
  Branches     1173     1195      +22     
==========================================
+ Hits         5270     5463     +193     
+ Misses       7863     7823      -40     
- Partials      299      314      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mridulm
Copy link
Contributor Author

mridulm commented Apr 5, 2024

@SteNicholas, please do let me know if I can help clarify better ... hopefully the details above help.

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thank for @mridulm explain and updates.

Copy link
Contributor

@otterc otterc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits. Looks good to me.

@otterc
Copy link
Contributor

otterc commented Apr 7, 2024

I will merge it tomorrow if it is still open.

@SteNicholas
Copy link
Member

LGTM. Merging to main(v0.5.0).

@mridulm
Copy link
Contributor Author

mridulm commented Apr 8, 2024

Thanks for the reviews @SteNicholas, @otterc !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants