Skip to content

Conversation

@hlteoh37
Copy link
Contributor

@hlteoh37 hlteoh37 commented Nov 28, 2024

Purpose of the change

Add example for using Flink filecache

Verifying this change

Run the example on MSF:

  1. Set up S3 file with cacerts copied directly from JVM cacerts
  2. Run up the app and validate that the logs show the trust cert configured correctly.
  3. Checked the TM instances and verified that they have the file copied to /tmp/cacerts.

Verified that the S3 download happens successfully on the JM
SCR-20241128-owfp

Verified that the reconfiguring of the trust store occurs on TM successfully.
SCR-20241128-owcz

Verified that on the TM host, the cacerts file is added.

image

Significant changes

(Please check any boxes [x] if the answer is "yes". You can first publish the PR and check them afterward, for convenience.)

  • Completely new example
  • Updated an existing example to newer Flink version or dependencies versions
  • Improved an existing example
  • Modified the runtime configuration of an existing example (i.e. added/removed/modified any runtime properties)
  • Modified the expected input or output of an existing example (e.g. modified the source or sink, modified the record schema)

Copy link
Contributor

@nicusX nicusX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example has a problem: it shows a hack without explaining when it's useful and how to use it. This is going to be very confusing for a user.
This was a problem of the old example repos: showing hacks without real explanations.

We set a truststore but we do not show how it is used.

Also, is it guaranteed that javax.net.ssl.trustStore points to the downloaded truststore when another operator is going to use it?

* Flink connectors: DataGeneratorSource, DiscardingSink


This example demonstrate how to use Flink's distributed cache to copy files over from JobManager and distribute them to the TaskManager worker nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explain clearly when this is useful, because this example doesn't show using the certificate. It just copies it.
In particular, we are showing downloading a cacert and not just random file. We need to explain when is useful


In this example, we download a `cacerts` file for custom trust store from S3, and configure the TaskManager JVM to use this new trust store location.

We could also download directly from S3 to the taskmanager locations. However, using the filecache is a more Flink-native way to do this, and will ensure that the file is available even on job restarts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example actually downloads the file from S3, in the main().
Why is this different from downloading it from S3 in the open() of the RichMapFunction, downloading into each TM?

// Perform any custom operations on the copied file here. For example, reconfiguring the trust store of the JVM
LOG.info("Trust store location before modification : " + System.getProperty("javax.net.ssl.trustStore"));
if (!TASKMANAGER_LOCAL_FILE_LOCATION.equals(System.getProperty("javax.net.ssl.trustStore"))) {
System.setProperty("javax.net.ssl.trustStore", TASKMANAGER_LOCAL_FILE_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this truststore is supposed to be used to initialize another operator, like a source or a sink.
How does this guarantee that the open() of this operator happens before the open() of operator that expects the certificate?

Comment on lines +29 to +37
// Skip the copy if the file already exists locally
if (!localFileDir.exists()) {
localFileDir.mkdirs();
File cachedFile = getRuntimeContext().getDistributedCache().getFile(DISTRIBUTED_CACHE_RESOURCE_NAME);
Files.copy(cachedFile.toPath(), localFileDir.toPath(), StandardCopyOption.REPLACE_EXISTING);
LOG.info("Copied over resource file {} to taskmanager location {}", DISTRIBUTED_CACHE_RESOURCE_NAME, TASKMANAGER_LOCAL_FILE_LOCATION);
} else {
LOG.info("Skipping resource copy for resource {} on taskmanager as the file already exists: {}", DISTRIBUTED_CACHE_RESOURCE_NAME, TASKMANAGER_LOCAL_FILE_LOCATION);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we come up with a utility class that encapsulates downloading of the necessary certificates? This way the initialisation logic could be a one liner that can be included by any operator that requires the files.

Ideally that should be an idempotent process that could be run multiple times without sideeffects.

@hlteoh37
Copy link
Contributor Author

hlteoh37 commented Feb 4, 2025

@nicusX The purpose for this PR so far is just an example of how customers can do dynamic file loading onto the taskmanagers as a better way than "loading in the main function", which we have seen customers frequently do and end up with apps in a broken state.

I'm not too concerned whether this "fits" into the example repo, as the PR itself is useful to share with customers as a better way of loading files.

Do let me know if you prefer this example to be structured differently, as I am sharing this PR with customers who want a good example of setting up file caching.

@nicusX
Copy link
Contributor

nicusX commented Feb 4, 2025

@hlteoh37 this example has multiple dangers

  1. replacing the JVM truststore opens dangerous failure scenarios, if the replacement truststore does not also contain the certificates normally available with the JVM
  2. you cannot use the truststore in an operator different from the one where you copy it into the local file system, because the order of initialization of operators is unknown

This limit the use of this pattern a lot, at least for fetching truststore/keystore.

A more robust example would be showing how to use the downloaded truststore to initialize and https client in a RichAsyncFunction. This is useful when the https endpoint uses a self-signed server-side certificate.
The new truststore should be passed directly to the http-client (TBD whether this is possible, depending on the client) rather than replacing the JVM truststore. Otherwise, you may fall in the failure scenario (1).

Then of course clearly explaining what are the limitations and risks of this pattern (essentially the point 1 and 2 above)

@nicusX
Copy link
Contributor

nicusX commented Apr 29, 2025

Archiving this PR for the time being, not finding a way to show this pattern using a robust example

@nicusX nicusX closed this Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants