Skip to content

Conversation

@NicoK
Copy link
Contributor

@NicoK NicoK commented Feb 12, 2019

What is the purpose of the change

Netty has the ability to run with different SSLEngine implementations but with our current setup, we are fixed to the JDK implementation, although one based on OpenSSL is expected to be faster [1].
We should make this configurable and ideally also provide everything needed to run with OpenSSL in the future (the last part is not part of this PR).

[1] https://netty.io/wiki/requirements-for-4.x.html#benefits-of-using-openssl

This PR subsumes #6328.

Brief change log

  • netty-fy SSL configuration in SSLUtils by using Netty's SslContextBuilder (only a few places do not use netty SSL setups - provide a workaround there)
  • allow selecting the SSL engine provider via security.ssl.provider
  • add openSSL-based SSL tests (if available) - some may currently fail due to different behaviour (this may need to be fixed once the second part is done)
  • use OpenSslX509KeyManagerFactory for openSSL back-end

Verifying this change

This change can be verified as follows:

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? docs, JavaDocs

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 12, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NicoK for the change. Couple of rather minor comments. The only bigger one is about running tests "if openSSL is available".

final SSLHandlerFactory serverSSLHandlerFactory = SSLUtils.createInternalServerSSLEngineFactory(serverConfig);
final SslHandler sslHandler = serverSSLHandlerFactory.createNettySSLHandler();
// note: a 'null' allocator seems to work here (probably only because we do not use the ssl engine!)
final SslHandler sslHandler = serverSSLHandlerFactory.createNettySSLHandler(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.netty.buffer.UnpooledByteBufAllocator#DEFAULT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, a later commit in the PR already fixed that :)

// SSL should be the first handler in the pipeline
if (sslFactory != null) {
ch.pipeline().addLast("ssl", sslFactory.createNettySSLHandler());
ch.pipeline().addLast("ssl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this [FLINK-9816][network] netty-fy SSL configuration commit. What does it do? There are no new tests, no changes in the existing tests, no documentation and nothing in the commit message :(

Is it refactor? If so please explain in the commit message what are you refactoring and why.

.list(
TextElement.text("%s: default Java-based SSL engine", TextElement.code("JDK")),
TextElement.text("%s: openSSL-based SSL engine using system libraries"
+ " (falls back to JDK if not available)", TextElement.code("OPENSSL"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be safer to fail instead of fall back if users specifies the openSSL explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think, it would then be a good idea to also have AUTO with the behaviour that it chooses openSSL if available, otherwise Java SSL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be dangerous though if this would result in a heterogeneous cluster where some nodes had to fall back to JavaSSL while others are using openSSL. Maybe let's leave it with the user having to decide manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. I think an AUTO mode (even as the default value) would make sense. But I leave it up to you to decide whether this is worth the extra effort or not.

Problem in non heterogeneous shouldn't be that frequent. I was mainly concerned with the situation where somebody is explicitly setting the value to OpenSSL and it is silently not being respected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it without AUTO since, especially with standalone clusters, it may actually quite simple to accidentally get into a heterogeneous SSL setup by missing some dependencies on some nodes. I fear that debugging such scenarios would be painful

} else if (providerString.equalsIgnoreCase("JDK")) {
return JDK;
} else {
throw new IllegalArgumentException("Unknown SSL provider: " + providerString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalArgumentException runtime exception that doesn't use our exception hierarchy? Did you mean IllegalConfigurationException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, why not

public static final List<String> AVAILABLE_SSL_PROVIDERS;
static {
if (OpenSsl.isAvailable()) {
AVAILABLE_SSL_PROVIDERS = Arrays.asList("JDK", "OPENSSL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test for this somehow on travis? Manually? Is there some kind of instruction how to make OpenSsl available during testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we'd have to have a custom flink-shaded build around*. There are licensing issues that prevent us from going the easy way and the actual way of providing openSSL in a safe way for the Flink distribution is not set in stone yet. I'll create a follow-up PR with proper documentation once this has been defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enriched this PR with a commit that enables the custom flink-shaded builds with static openSSL builds for Travis, just to see whether the tests run through. We would have to drop that commit during the merge though.

// session context is only be available after a session was setup -> this should be true after data was sent
SSLSessionContext sessionContext = sslHandler.engine().getSession().getSessionContext();
assertNotNull("bug in unit test setup: session context not available", sessionContext);
// TODO: sessionContext is from the client side which may not have a session cache at all (with openSSL it behaves that way)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - by a little extra work here and there :(

@NicoK NicoK force-pushed the f9816-master branch 2 times, most recently from 2255050 to 7c4cfd0 Compare May 7, 2019 13:56
@NicoK
Copy link
Contributor Author

NicoK commented May 7, 2019

Thanks for the review, I addressed all issues you found and rebased onto the latest master so we don't have conflicts and non-related commits anymore (except for one testing enrichment marked with DO-NOT-MERGE)

@NicoK NicoK force-pushed the f9816-master branch 5 times, most recently from 7b7bcb6 to 486102a Compare May 8, 2019 12:51
Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM however to avoid having a dead/un-tested code in the master I think it would be better to merge this only once we resolve flink-shaded-netty issue and once we will be able to test this on travis.

@NicoK NicoK force-pushed the f9816-master branch 3 times, most recently from 8232b85 to 8922340 Compare May 23, 2019 07:32
@NicoK
Copy link
Contributor Author

NicoK commented May 23, 2019

hi @pnowojski I did some updates to this PR:

  • rebased onto latest master
  • changed unit tests to work with the latest flink-shaded openSSL jars
  • set nightly e2e tests to use both, dynamically and statically linked openSSL variants for the test_streaming_file_sink
  • updated docs
  • a few hotfixes
    -> everything starting b21a1d0 are new commits

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one comment

source "$(dirname "$0")"/common_s3.sh

# randomly set up openSSL with dynamically/statically linked libraries
OPENSSL_LINKAGE=$(if (( RANDOM % 2 )) ; then echo "dynamic"; else echo "static"; fi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an echo message directly after the RANDOM what was chosen to be tested

Copy link
Contributor Author

@NicoK NicoK May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you think it is enough to print this one (during setup, in common_ssl.sh)?
echo "Setting up SSL with: ${type} ${provider} ${provider_lib}"
(I added this line there)

Copy link
Contributor

@pnowojski pnowojski May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently yes, but it would be safer to tie those two things together (best hidden behind some function like:

String x = new Random().choice("dynamic", "static");
System.out.println(format("Executing test with following value of x = [%s] that was randomly selected", x));

), so that there is obvious connection between randomness and printing the random result.

Otherwise:

  1. someone might remove the logging message, without realising the implication of that fact
  2. there is a higher chance of someone executing the test and not realising why is not deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced but also don't object and therefore added a fixup
-> I'm now waiting for the flink-shaded 7.0 release before merging

@NicoK NicoK force-pushed the f9816-master branch 3 times, most recently from 549cb87 to 71323e8 Compare June 13, 2019 16:25
NicoK added 6 commits June 15, 2019 07:52
This allows line breaks in text block elements and may be useful, for example,
in starting a new line inside a list description element.
Refactor the SSL configuration done for Netty to have it more like the way
Netty intends it to be: using its SslContextBuilder. This will make it much
easier to set a different Netty SSL engine provider.

[hotfix][network] extract key and trust manager factory creation
[FLINK-9816][network][tests] add openSSL-based SSL tests if available

[FLINK-9816][network] use OpenSslX509KeyManagerFactory for openSSL back-end

According to https://netty.io/news/2018/07/10/4-1-26-Final.html, this will
vastly reduce handshake latency and CPU use.

[FLINK-9816][network][tests] allow forcing openSSL tests to run

For forcing openSSL tests to run (or fail if not available), specify the
following system property: '-D flink.tests.force-openssl'
…opt/

Please note that there is also a static version of netty-tcnative but we
currently do not distribute it due to licensing issues. Once openSSL completes
its switch to Apache License v2, we can provide this as well and maybe even
make that one default (by putting it into lib/). Since there are to many things
which may go wrong with the dynamically-linked library (based on the system you
run on), we provide this only in opt/.
…shaded version

This uses the dynamically-linked openSSL for the unit tests since this is the
artifact that we distribute.

TODO: e2e tests for dynamically and statically linked openSSL
NicoK added 3 commits June 15, 2019 07:52
This test (run nightly) will use a dynamically or statically linked openSSL
library at random during runtime, in order to eventually verify both.
Mention all steps necessary  to get openSSL-based SSL running, based on
flink-shaded 7.0.
@NicoK NicoK merged commit 11a7b3f into apache:master Jun 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants