Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-31656][runtime][security] Obtain delegation tokens early to support external file system usage in HA services #22298

Merged
merged 1 commit into from
Apr 3, 2023

Conversation

gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Mar 29, 2023

What is the purpose of the change

At the moment there are no delegation tokens available when HA services is starting. If the HA services uses an external file system where the authentication type is delegation token based (typically S3) then it throws and exception since there are no credentials.

In this PR I've moved the delegation token manager initialization before HA services and trigger a manual token obtain + local JVM receiver propagation. Additionally deferred base directory creation in FileSystemBlobStore and FileSystemJobResultStore.

Brief change log

  • The delegation token manager initialization moved before HA services and trigger a manual token obtain + local JVM receiver propagation. This is the solution for the job manager side.
  • Deferred base directory creation in FileSystemBlobStore and FileSystemJobResultStore. This is the solution for the task manager side.
  • Changed the DelegationTokenManager API documentation for better clarity
  • Changed a log message for better clarity

Verifying this change

Manually on cluster.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 29, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@@ -213,6 +213,20 @@ public void obtainDelegationTokens(DelegationTokenContainer container) throws Ex
LOG.info("Delegation tokens obtained successfully");
}

@Override
public void obtainDelegationTokens() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to the immediately propagate the existing tokens (fetched here) to Task managers once start function is called. Otherwise, I see No credential error in task manager. I guess it is due to the latency of additional token fetching in startTokenUpdate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the one-time token obtain stage there are no task managers but I see that there is an issue so I'll take a look...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added further changes to solve the TM issue. We should test it on cluster in-depth because that's modifying Flink's critical path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HuangZhenQiu please confirm back that you see working cluster tests on your side just to double check.

Copy link
Contributor

@HuangZhenQiu HuangZhenQiu Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaborgsomogyi Sorry for late reply. The feature is fully tested end to end.

@gaborgsomogyi gaborgsomogyi force-pushed the FLINK-31656 branch 2 times, most recently from f3d83f1 to 9daabcd Compare March 31, 2023 09:04
@gaborgsomogyi gaborgsomogyi changed the title [FLINK-31656][runtime][security] Obtain delegation tokens early to support external file system usage in blob server [FLINK-31656][runtime][security] Obtain delegation tokens early to support external file system usage in HA services Mar 31, 2023
@gaborgsomogyi
Copy link
Contributor Author

I've just fixed the unit tests + changed the title and description. The cluster tests are green.

@gaborgsomogyi
Copy link
Contributor Author

cc @gyfora @mbalassi

// Obtaining delegation tokens and propagating them to the local JVM receivers in a
// one-time fashion is required because BlobServer may connect to external file
// systems
delegationTokenManager.obtainDelegationTokens();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't seem to find any test for this new behaviour, would be good to add something to guard against accidental regressions in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that we obtain tokens in ClusterEntrypointTest but it would be overkill to check that token obtain happens before HA services.

…pport external file system usage in blob server
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants