-
Notifications
You must be signed in to change notification settings - Fork 46
Add example for using Flink File cache #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nicusX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example has a problem: it shows a hack without explaining when it's useful and how to use it. This is going to be very confusing for a user.
This was a problem of the old example repos: showing hacks without real explanations.
We set a truststore but we do not show how it is used.
Also, is it guaranteed that javax.net.ssl.trustStore points to the downloaded truststore when another operator is going to use it?
| * Flink connectors: DataGeneratorSource, DiscardingSink | ||
|
|
||
|
|
||
| This example demonstrate how to use Flink's distributed cache to copy files over from JobManager and distribute them to the TaskManager worker nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to explain clearly when this is useful, because this example doesn't show using the certificate. It just copies it.
In particular, we are showing downloading a cacert and not just random file. We need to explain when is useful
|
|
||
| In this example, we download a `cacerts` file for custom trust store from S3, and configure the TaskManager JVM to use this new trust store location. | ||
|
|
||
| We could also download directly from S3 to the taskmanager locations. However, using the filecache is a more Flink-native way to do this, and will ensure that the file is available even on job restarts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example actually downloads the file from S3, in the main().
Why is this different from downloading it from S3 in the open() of the RichMapFunction, downloading into each TM?
| // Perform any custom operations on the copied file here. For example, reconfiguring the trust store of the JVM | ||
| LOG.info("Trust store location before modification : " + System.getProperty("javax.net.ssl.trustStore")); | ||
| if (!TASKMANAGER_LOCAL_FILE_LOCATION.equals(System.getProperty("javax.net.ssl.trustStore"))) { | ||
| System.setProperty("javax.net.ssl.trustStore", TASKMANAGER_LOCAL_FILE_LOCATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this truststore is supposed to be used to initialize another operator, like a source or a sink.
How does this guarantee that the open() of this operator happens before the open() of operator that expects the certificate?
| // Skip the copy if the file already exists locally | ||
| if (!localFileDir.exists()) { | ||
| localFileDir.mkdirs(); | ||
| File cachedFile = getRuntimeContext().getDistributedCache().getFile(DISTRIBUTED_CACHE_RESOURCE_NAME); | ||
| Files.copy(cachedFile.toPath(), localFileDir.toPath(), StandardCopyOption.REPLACE_EXISTING); | ||
| LOG.info("Copied over resource file {} to taskmanager location {}", DISTRIBUTED_CACHE_RESOURCE_NAME, TASKMANAGER_LOCAL_FILE_LOCATION); | ||
| } else { | ||
| LOG.info("Skipping resource copy for resource {} on taskmanager as the file already exists: {}", DISTRIBUTED_CACHE_RESOURCE_NAME, TASKMANAGER_LOCAL_FILE_LOCATION); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we come up with a utility class that encapsulates downloading of the necessary certificates? This way the initialisation logic could be a one liner that can be included by any operator that requires the files.
Ideally that should be an idempotent process that could be run multiple times without sideeffects.
|
@nicusX The purpose for this PR so far is just an example of how customers can do dynamic file loading onto the taskmanagers as a better way than "loading in the main function", which we have seen customers frequently do and end up with apps in a broken state. I'm not too concerned whether this "fits" into the example repo, as the PR itself is useful to share with customers as a better way of loading files. Do let me know if you prefer this example to be structured differently, as I am sharing this PR with customers who want a good example of setting up file caching. |
|
@hlteoh37 this example has multiple dangers
This limit the use of this pattern a lot, at least for fetching truststore/keystore. A more robust example would be showing how to use the downloaded truststore to initialize and https client in a RichAsyncFunction. This is useful when the https endpoint uses a self-signed server-side certificate. Then of course clearly explaining what are the limitations and risks of this pattern (essentially the point 1 and 2 above) |
|
Archiving this PR for the time being, not finding a way to show this pattern using a robust example |
Purpose of the change
Add example for using Flink filecache
Verifying this change
Run the example on MSF:
cacertscopied directly from JVM cacerts/tmp/cacerts.Verified that the S3 download happens successfully on the JM

Verified that the reconfiguring of the trust store occurs on TM successfully.

Verified that on the TM host, the cacerts file is added.
Significant changes
(Please check any boxes [x] if the answer is "yes". You can first publish the PR and check them afterward, for convenience.)