Skip to content

Conversation

@krishan1390
Copy link

Description of PR

JIRA - https://issues.apache.org/jira/browse/YARN-11448

Currently router secret manager requires routers to be stateful & with clients using sticky sessions.

Otherwise, there are several issues mentioned below which lead to the delegation token functionality not working across router instances

Eg:

  • allKeys needs to be consistently updated across all router instances
  • DB update exceptions are swallowed & returned as a success if just in memory variables are updated
  • Purging Delegation Token / Master key on expiry assumes all tokens are available in memory
  • APIs like get all tokens return only in memory data which is incorrect

A more scalable & maintainable framework for Router would be to be design it as a stateless service. Given database KV lookups are in the order of < 10 ms, it doesn't add any latency overhead and makes router easier to maintain. Plus a stateless router setup, with no assumptions of stickiness makes the router framework more generic.

Additionally, some of the functionality around master key ids, delegation token sequence numbers is implemented as globally autoincrement ids which too isn't feasible across all datastores. The actual requirement is to generate unique keys for master key ids / delegation tokens which is a much more simpler & generic solution. Plus certain apis like get sequence no / set sequence no aren't applicable for router and we can avoid providing them to make things much more simpler.

This patch addresses these functional concerns while working within the interfaces of AbstractDelegationTokenSecretManager.

As a later patch, we can create better delegation token interfaces to support both stateful & stateless secret managers.

How was this patch tested?

Unit Tests - Currently implemented 1 test - testNewTokenVerification - adding more test cases

@krishan1390 krishan1390 changed the title Stateless Router Secret Manager YARN-11448 [Federation] Stateless Router Secret Manager Mar 1, 2023
@krishan1390
Copy link
Author

@goiri @slfan1989 can you please help review this PR ?

@slfan1989
Copy link
Contributor

slfan1989 commented Mar 1, 2023

@krishan1390 Thank you very much for your contribution, I will take time to look at this pr.

I took a quick look at your description.

allKeys needs to be consistently updated across all router instances

Multiple Routers will share and store the Delegation token, there is no updated across all router instances.

DB update exceptions are swallowed & returned as a success if just in memory variables are updated

MemeoryStateStore is only used for verification and should not be used for production. SQLServerFederationStateStore will not swallow exceptions, and the client cannot complete verification with the old token.

Purging Delegation Token / Master key on expiry assumes all tokens are available in memory

We only cache tokens in MemeoryStateStore, but MemeoryStateStore is not a distributed storage and can only be used for test verification. It is recommended to use ZKFederationStateStore Or SQLServerFederationStateStore.

APIs like get all tokens return only in memory data which is incorrect.

getAllToken is only used for test verification.

Sorry, I have been busy recently, I will add a design document, DB storage Delegation Token, I refer to the design of Hive storage Delegation Token, this is a stable capability.

@slfan1989 slfan1989 self-requested a review March 1, 2023 14:13
public abstract
class AbstractDelegationTokenSecretManager<TokenIdent
extends AbstractDelegationTokenIdentifier>
public abstract
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worried that changes in this class may affect many sub classes
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this is a fairly fundamental class in Hadoop in general.
I would propose to do a separate JIRA to clean this class up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of these changes will affect any other functionality in any sub class - I verified the same .

And the changes are either very generic like encapsulation, exception handling, concurrency handling, etc - let me know if this still warrants a seperate JIRA.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 53s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 9 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 19m 2s Maven dependency ordering for branch
+1 💚 mvninstall 28m 27s trunk passed
+1 💚 compile 25m 20s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 21m 51s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 4m 2s trunk passed
+1 💚 mvnsite 4m 22s trunk passed
+1 💚 javadoc 3m 23s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 2m 47s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 7m 20s trunk passed
+1 💚 shadedclient 23m 41s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 24m 3s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 23s Maven dependency ordering for patch
-1 ❌ mvninstall 0m 38s /patch-mvninstall-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 ❌ mvninstall 0m 17s /patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt hadoop-yarn-server-router in the patch failed.
-1 ❌ compile 1m 6s /patch-compile-root-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt root in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
-1 ❌ javac 1m 6s /patch-compile-root-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt root in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
-1 ❌ compile 0m 58s /patch-compile-root-jdkPrivateBuild-1.8.0_352-8u352-ga-1~20.04-b08.txt root in the patch failed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08.
-1 ❌ javac 0m 58s /patch-compile-root-jdkPrivateBuild-1.8.0_352-8u352-ga-1~20.04-b08.txt root in the patch failed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 40s /results-checkstyle-root.txt root: The patch generated 55 new + 47 unchanged - 8 fixed = 102 total (was 55)
-1 ❌ mvnsite 0m 41s /patch-mvnsite-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 ❌ mvnsite 0m 19s /patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt hadoop-yarn-server-router in the patch failed.
-1 ❌ javadoc 0m 46s /patch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt hadoop-common in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
-1 ❌ javadoc 0m 21s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt hadoop-yarn-server-router in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
-1 ❌ javadoc 0m 19s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router-jdkPrivateBuild-1.8.0_352-8u352-ga-1~20.04-b08.txt hadoop-yarn-server-router in the patch failed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08.
-1 ❌ spotbugs 0m 38s /patch-spotbugs-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 ❌ spotbugs 0m 18s /patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt hadoop-yarn-server-router in the patch failed.
-1 ❌ shadedclient 3m 38s patch has errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 0m 38s /patch-unit-hadoop-common-project_hadoop-common.txt hadoop-common in the patch failed.
-1 ❌ unit 3m 2s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt hadoop-yarn-server-common in the patch passed.
+1 💚 unit 102m 11s hadoop-yarn-server-resourcemanager in the patch passed.
-1 ❌ unit 0m 18s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt hadoop-yarn-server-router in the patch failed.
+1 💚 asflicense 0m 29s The patch does not generate ASF License warnings.
272m 54s
Reason Tests
Failed junit tests hadoop.yarn.server.federation.store.impl.TestMemoryFederationStateStore
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/1/artifact/out/Dockerfile
GITHUB PR #5443
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 8334126eaa5c 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / eb8856d
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/1/testReport/
Max. process+thread count 940 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

int newCurrentId;
synchronized (this) {
newCurrentId = incrementCurrentKeyId();
newCurrentId = generateNewKeyId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we change the method name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual requirement is to generate new keys and not to increment keys - this greatly simplifies implementation for various DB systems

Copy link
Contributor

@slfan1989 slfan1989 Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use an incremental method to generate new keys. For the database, we use for...update to ensure the increment. For zk, we use the global generator(SharedCount).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but there is no requirement to only increment keys 1 by 1. we only require to generate new unique keys. hence the name change to clearly reflect that requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but there is no requirement to only increment keys 1 by 1. we only require to generate new unique keys. hence the name change to clearly reflect that requirement.

The performance of the database (Mysql) should be better, increase one by one, the performance should be ok.

ZK We are applying for multiple ids at one time.

Users can configure the parameter zk-dt-secret-manager.token.seqnum.batch.size to apply for multiple sequenceNum at a time.

ZookeeperFederationStateStore#incrSharedCount

private int incrSharedCount(SharedCount sharedCount, int batchSize)
      throws Exception {
    while (true) {
      // Loop until we successfully increment the counter
      VersionedValue<Integer> versionedValue = sharedCount.getVersionedValue();
      if (sharedCount.trySetCount(versionedValue, versionedValue.getValue() + batchSize)) {
        return versionedValue.getValue();
      }
    }
  }


@Override
public synchronized byte[] retrievePassword(TokenIdent identifier)
public byte[] retrievePassword(TokenIdent identifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove synchronized?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not required because currentTokens is a concurrenthashmap so satisfies the happens before memory visibility requirements for this method.

@krishan1390
Copy link
Author

krishan1390 commented Mar 2, 2023

Thanks for your feedback @slfan1989 . Please find my response below

allKeys needs to be consistently updated across all router instances

Multiple Routers will share and store the Delegation token, there is no updated across all router instances.

That is not the actual behaviour currently. Each router instance has its own set of master keys (allKeys & currentKey - these are setup on service startup through startThreads() & updated in rollMasterKey()). Even though they are stored in database, master key isn't looked up from database but just returned from the in memory variables (allKeys & currentKey). So a router instance can't renew tokens generated from another router instance.

And even delegation tokens are not consistently updated across router instances. If a delegation token is present in currentTokens variable in multiple router instances but updated in one router instance (on token renewal), the other router instances will use their own in memory variable currentTokens rather than look up the database and thus can say the token is expired.

DB update exceptions are swallowed & returned as a success if just in memory variables are updated

MemeoryStateStore is only used for verification and should not be used for production. SQLServerFederationStateStore will not swallow exceptions, and the client cannot complete verification with the old token.

Yes. By in memory variables I meant the class instance variables like currentTokens, currentKey, allKeys, etc. Even though SQLFederationStateStore throws an exception, all the current methods in RouterDelegationTokenSecretManager catches these exceptions and either returns or terminates the app rather than throwing it back & handling appropriately.

Purging Delegation Token / Master key on expiry assumes all tokens are available in memory

We only cache tokens in MemeoryStateStore, but MemeoryStateStore is not a distributed storage and can only be used for test verification. It is recommended to use ZKFederationStateStore Or SQLServerFederationStateStore.

Yes by in memory variables I meant the class instance variables like currentTokens, currentKey, allKeys. Currently, it doesn't look at all variables in the database.

APIs like get all tokens return only in memory data which is incorrect.

getAllToken is only used for test verification.

Yes currently thats the case, but since its a public API its an incorrect design to return the in memory instance variables in case of router - since anyone can rely on this.

Sorry, I have been busy recently, I will add a design document, DB storage Delegation Token, I refer to the design of Hive storage Delegation Token, this is a stable capability.

No problem. I will provide the test cases to correctly show these issues better

My main thought at a high level, is that by design a stateless router should make no assumptions of underlying state management (AbstractDelegationTokenSecretManager functionalities are heavily relying on it being a single source of truth rather than being distributed)

@slfan1989
Copy link
Contributor

Yes. By in memory variables I meant the class instance variables like currentTokens, currentKey, allKeys, etc. Even though SQLFederationStateStore throws an exception, all the current methods in RouterDelegationTokenSecretManager catches these exceptions and either returns or terminates the app rather than throwing it back & handling appropriately

For the client, the RM is blocked. For the client, the Router is the RM. When implementing this function, we should be close to the realization of the RM. I think Router's implementation is reasonable.

RMDelegationTokenSecretManager#storeNewMasterKey

protected void storeNewMasterKey(DelegationKey newKey) {
    try {
      LOG.info("storing master key with keyID " + newKey.getKeyId());
      rm.getRMContext().getStateStore().storeRMDTMasterKey(newKey);
    } catch (Exception e) {
      if (!shouldIgnoreException(e)) {
        LOG.error(
            "Error in storing master key with KeyID: " + newKey.getKeyId());
        ExitUtil.terminate(1, e);
      }
    }
  }

RouterDelegationTokenSecretManager#storeNewMasterKey

  public void storeNewMasterKey(DelegationKey newKey) {
    try {
      federationFacade.storeNewMasterKey(newKey);
    } catch (Exception e) {
      if (!shouldIgnoreException(e)) {
        LOG.error("Error in storing master key with KeyID: {}.", newKey.getKeyId());
        ExitUtil.terminate(1, e);
      }
    }
  }

@slfan1989
Copy link
Contributor

slfan1989 commented Mar 2, 2023

Purging Delegation Token / Master key on expiry assumes all tokens are available in memory

We only cache tokens in MemeoryStateStore, but MemeoryStateStore is not a distributed storage and can only be used for test verification. It is recommended to use ZKFederationStateStore Or SQLServerFederationStateStore.

Yes by in memory variables I meant the class instance variables like currentTokens, currentKey, allKeys. Currently, it doesn't look at all variables in the database.

  • currentKey
    We implement getDelegationTokenSeqNum and getCurrentKeyId of AbstractDelegationTokenSecretManager in RouterDelegationTokenSecretManager.
    These information will not return local variables, but return after querying shared storage.

  • currentTokens
    We store the RMDelegationTokenIdentifier in the local currentToken. AbstractDelegationTokenSecretManager has token RemoveThread(Generally, execute once every 5s), and will delete expired Tokens. If the client's token expires, we will query null.

RouterDelegationTokenSecretManager#getTokenInfo

// First check if I have this..
    DelegationTokenInformation tokenInfo = currentTokens.get(ident);
    if (tokenInfo == null) {
      try {
        RouterRMTokenResponse response = federationFacade.getTokenByRouterStoreToken(ident);
        RouterStoreToken routerStoreToken = response.getRouterStoreToken();
        String tokenStr = routerStoreToken.getTokenInfo();
        byte[] tokenBytes = Base64.getUrlDecoder().decode(tokenStr);
        tokenInfo = RouterDelegationTokenSupport.decodeDelegationTokenInformation(tokenBytes);
      } catch (Exception e) {
        LOG.error("Error retrieving tokenInfo [" + ident.getSequenceNumber()
            + "] from StateStore.", e);
        throw new YarnRuntimeException(e);
      }
    }
    return tokenInfo;

AbstractDelegationTokenSecretManager#removeExpiredToken

private void removeExpiredToken() throws IOException {
    long now = Time.now();
    Set<TokenIdent> expiredTokens = new HashSet<>();
    synchronized (this) {
      Iterator<Map.Entry<TokenIdent, DelegationTokenInformation>> i =
          currentTokens.entrySet().iterator();
      while (i.hasNext()) {
        Map.Entry<TokenIdent, DelegationTokenInformation> entry = i.next();
        long renewDate = entry.getValue().getRenewDate();
        if (renewDate < now) {
          expiredTokens.add(entry.getKey());
          removeTokenForOwnerStats(entry.getKey());
          i.remove();
        }
      }
    }
    // don't hold lock on 'this' to avoid edit log updates blocking token ops
    logExpireTokens(expiredTokens);
  }
  • allKey
    We don't need to use allKey, just for testing. So there is no need to guarantee that allKey is accurate, and I also don't recommend implementing this method, because if our cluster is large, it will waste a lot of memory and not have much value.

@slfan1989
Copy link
Contributor

slfan1989 commented Mar 2, 2023

allKeys needs to be consistently updated across all router instances

Multiple Routers will share and store the Delegation token, there is no updated across all router instances.

That is not the actual behaviour currently. Each router instance has its own set of master keys (allKeys & currentKey - these are setup on service startup through startThreads() & updated in rollMasterKey()). Even though they are stored in database, master key isn't looked up from database but just returned from the in memory variables (allKeys & currentKey). So a router instance can't renew tokens generated from another router instance.

And even delegation tokens are not consistently updated across router instances. If a delegation token is present in currentTokens variable in multiple router instances but updated in one router instance (on token renewal), the other router instances will use their own in memory variable currentTokens rather than look up the database and thus can say the token is expired.

In the previous comment, I have already explained that we get data from shared storage, so this described situation should not happen.

Example:

We have 3 routers, namely routerA, routerB, and routerC, and we have 1 client client1

  • Client1 applies for a token from RouterA, RouterA stores the token in memory, and writes the token into zk or db at the same time, RouterA returns this token to the client, we call it tokenA(user=Client1,expireDate=2023-03-02 16:02:00...)
  • Client1 finds RouterB to query the token. At this time, RouterB does not have token in memory. RouterB goes to zk or db to query, and then stores tokenA in memory.
  • Client1 renewToken to RouterC, at this time, after RouterC renewToken, TokenA will be updated, we call it TokenB(user=Client1,expireDate=2023-03-02 17:02:00...)(The expiration time is different from TokenA), and then store tokenB in zk or db.
  • The TokenA has expired, the original TokenA has been removed by the cleaning thread of RouterA and RouterB.
  • As long as tokenB does not expire, Client1 can query any Router(A\B\C) and get an accurate response.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 17m 59s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 9 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 51s Maven dependency ordering for branch
+1 💚 mvninstall 28m 15s trunk passed
+1 💚 compile 25m 25s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 21m 55s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 4m 4s trunk passed
+1 💚 mvnsite 4m 18s trunk passed
+1 💚 javadoc 3m 21s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 46s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 7m 17s trunk passed
+1 💚 shadedclient 23m 17s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 23m 38s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 24s Maven dependency ordering for patch
+1 💚 mvninstall 2m 54s the patch passed
+1 💚 compile 24m 31s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 24m 31s the patch passed
+1 💚 compile 21m 50s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 21m 50s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 51s /results-checkstyle-root.txt root: The patch generated 89 new + 47 unchanged - 8 fixed = 136 total (was 55)
+1 💚 mvnsite 4m 14s the patch passed
-1 ❌ javadoc 1m 0s /patch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.txt hadoop-common in the patch failed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.
-1 ❌ javadoc 0m 36s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router-jdkUbuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.txt hadoop-yarn-server-router in the patch failed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.
-1 ❌ javadoc 0m 33s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router-jdkPrivateBuild-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09.txt hadoop-yarn-server-router in the patch failed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09.
+1 💚 spotbugs 7m 52s the patch passed
+1 💚 shadedclient 23m 45s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 18m 5s /patch-unit-hadoop-common-project_hadoop-common.txt hadoop-common in the patch passed.
-1 ❌ unit 3m 17s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt hadoop-yarn-server-common in the patch passed.
+1 💚 unit 101m 43s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 unit 0m 43s hadoop-yarn-server-router in the patch passed.
+1 💚 asflicense 0m 54s The patch does not generate ASF License warnings.
377m 33s
Reason Tests
Failed junit tests hadoop.security.token.delegation.TestDelegationToken
hadoop.yarn.server.federation.store.impl.TestMemoryFederationStateStore
hadoop.yarn.server.federation.utils.TestFederationStateStoreFacade
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/2/artifact/out/Dockerfile
GITHUB PR #5443
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux ca19a0485fad 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 0e8f950
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/2/testReport/
Max. process+thread count 1244 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

private long tokenMaxLifetime;
private long tokenRemoverScanInterval;
private long tokenRenewInterval;
private volatile DelegationKey currentKey;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have set it to volatile to make sure currentKey changes are visible across threads


Set<DelegationKey> rmDTMasterKeyState = routerRMSecretManagerState.getMasterKeyState();
if (rmDTMasterKeyState.contains(delegationKey)) {
Map<Integer, DelegationKey> rmDTMasterKeyState = routerRMSecretManagerState.getMasterKeyState();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced set with a map as we need to lookup objects from their unique key ids, rather than the object itself

RMDelegationTokenIdentifier tokenIdentifier =
(RMDelegationTokenIdentifier) storeToken.getTokenIdentifier();
Map<RMDelegationTokenIdentifier, RouterStoreToken> rmDTState =
Map<Integer, RouterStoreToken> rmDTState =
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the key for the map with the id (sequence number) of the delegation token rather than the token identifier object itself - because token identifier objects can be generated repeatedly (as part of different requests) for the same sequence number

*
* @return CurrentKeyId.
*/
int getCurrentKeyId();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have removed these methods because they aren't required for a stateless secret manager. they are only required to support recovery in RM / NN like systems with only 1 node generating tokens .

with stateless secret manager, there is no explicit recovery mechanism because by design the secret manager stores all data in database and thus these methods aren't required.

@@ -1,201 +0,0 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have moved this class to router.security package rather than router.secure package & refactored tests to test the public methods of router secret manager rather than testing it internally.

these test cases, validate the database storing / retrieval logic internally and this makes the test cases independent of the database implementations..

@krishan1390
Copy link
Author

@slfan1989 @goiri let me try to summarise my changes better

  1. I am refactoring router secret manager to a completely stateless setup which now provides read after write consistency. As part of the change, I am doing away with instance variables and only relying on database reads/writes. This is to better handle edge cases like
  • cancel token is called on 1 router instance and cancelled but isn't cancelled on another router instance because its already cached as an instance variables.
  • token cleanup on expiry requires all tokens to be in instance variables across routers. This isn't true in a stateless setup where nodes are autoscaled up/down on demand.
  • instance variables are updated but database update fails. This means subsequent requests can respond differently in 1 router instance (where instance variables are updated) and differently in other router instances (which don't see the database update)

This also makes the design more extendible for future use cases where Delegation token object contains more mutable data. I have added a bunch of test cases to showcase these edge cases better. It will be useful to comment on these individual test cases if any concerns.

  1. I have explicitly handled methods which don't apply to stateless routers like setSequenceNo, getSequenceNo - these methods were primarily required in RM / NN to support recovery but in stateless setup they aren't required because by design multiple router instances can serve tokens.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 53s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 11 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 23s Maven dependency ordering for branch
+1 💚 mvninstall 28m 30s trunk passed
+1 💚 compile 25m 8s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 21m 44s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 4m 4s trunk passed
+1 💚 mvnsite 4m 19s trunk passed
+1 💚 javadoc 3m 22s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 47s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 7m 20s trunk passed
+1 💚 shadedclient 23m 15s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 23m 36s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 23s Maven dependency ordering for patch
+1 💚 mvninstall 2m 54s the patch passed
+1 💚 compile 24m 31s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 24m 31s the patch passed
+1 💚 compile 21m 49s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 21m 49s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 56s /results-checkstyle-root.txt root: The patch generated 87 new + 82 unchanged - 8 fixed = 169 total (was 90)
+1 💚 mvnsite 4m 16s the patch passed
-1 ❌ javadoc 0m 34s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router-jdkUbuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.txt hadoop-yarn-server-router in the patch failed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.
-1 ❌ javadoc 0m 33s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router-jdkPrivateBuild-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09.txt hadoop-yarn-server-router in the patch failed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09.
+1 💚 spotbugs 7m 51s the patch passed
+1 💚 shadedclient 20m 23s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 37s hadoop-common in the patch passed.
-1 ❌ unit 3m 22s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt hadoop-yarn-server-common in the patch passed.
+1 💚 unit 102m 55s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 unit 0m 44s hadoop-yarn-server-router in the patch passed.
+1 💚 asflicense 0m 54s The patch does not generate ASF License warnings.
359m 30s
Reason Tests
Failed junit tests hadoop.yarn.server.federation.store.impl.TestMemoryFederationStateStore
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/3/artifact/out/Dockerfile
GITHUB PR #5443
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 26951b800d79 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 7a73e38
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/3/testReport/
Max. process+thread count 1234 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 11 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 49s Maven dependency ordering for branch
+1 💚 mvninstall 28m 41s trunk passed
+1 💚 compile 25m 12s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 21m 48s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 4m 4s trunk passed
+1 💚 mvnsite 4m 19s trunk passed
+1 💚 javadoc 3m 19s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 45s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 7m 16s trunk passed
+1 💚 shadedclient 23m 14s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 23m 35s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 23s Maven dependency ordering for patch
+1 💚 mvninstall 2m 49s the patch passed
+1 💚 compile 24m 32s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 24m 32s the patch passed
+1 💚 compile 23m 33s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 23m 33s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 21s /results-checkstyle-root.txt root: The patch generated 86 new + 82 unchanged - 8 fixed = 168 total (was 90)
+1 💚 mvnsite 4m 45s the patch passed
+1 💚 javadoc 3m 11s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 2m 46s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 8m 4s the patch passed
+1 💚 shadedclient 23m 25s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 18s hadoop-common in the patch passed.
+1 💚 unit 3m 18s hadoop-yarn-server-common in the patch passed.
+1 💚 unit 102m 19s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 unit 0m 41s hadoop-yarn-server-router in the patch passed.
+1 💚 asflicense 0m 52s The patch does not generate ASF License warnings.
363m 41s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/4/artifact/out/Dockerfile
GITHUB PR #5443
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 2cf92e293a7b 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6ff0d7d
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/4/testReport/
Max. process+thread count 3134 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5443/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@slfan1989
Copy link
Contributor

@krishan1390 Thank you very much for your contribution, I will take some time to read the code.

//Log must be invoked outside the lock on 'this'
logUpdateMasterKey(newKey);
synchronized (this) {
storeDelegationKey(newKey);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand that we shouldn't update currentKey when some other thread is using it, but does the storeDelegationKey(newKey) also need to be inside the synchronized block?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually even updating currentKey needn't be in synchronized block because currentKey is volatile so the update will be visible.

I had originally removed synchronized here but added it back to avoid the change in this PR and to make it seperately.

wanted to keep the PR focused specifically on whats required for reliable stateless setup - in this particular case, trying to store in DB first before updating currentKey

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't updating currentKey need to be synchronized? Because if some other thread is using this in createPassword() and in between lines identifier.setMasterKeyId(currentKey.getKeyId()) and createPassword(identifier.getBytes(), currentKey.getKey()); the currentKey changes, won't that result in inconsistencies?

* Current master key will be stored in memory on each instance and will be used to generate new tokens.
* Master key will be looked up from the state store for Validation / renewal, etc of tokens.
*
* 2) Token Expiry - It doesn't take care of token removal on expiry.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also mention about master key expiry?

public void storeNewToken(RMDelegationTokenIdentifier identifier,
long renewDate) throws IOException {
try {
federationFacade.storeNewToken(identifier, renewDate);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SQLFederationStateStore I see we are making 2 queries, one to add the new token, 2nd again to get the token. Is the 2nd query needed? (Not able to add review comment there so added here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea thats not required. was planning to raise a seperate PR for changes in SQLFederationStateStore - will keep this PR focused on changes required for stateless delegation token management

public void storeNewMasterKey(DelegationKey newKey) {
protected void storeNewMasterKey(DelegationKey newKey) throws IOException {
try {
federationFacade.storeNewMasterKey(newKey);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the federationFacade being initialized with actual conf (that points to Sql state store let's say). Because FederationStateStoreFacade.getInstance(); will always get an instance that is initialized with default conf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea thats fair. let me correct it

* @throws IOException raised on errors performing I/O.
*/
protected DelegationTokenInformation getTokenInfo(TokenIdent ident) {
protected DelegationTokenInformation getTokenInfo(TokenIdent ident) throws IOException {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not just a KV lookup but actually compares all attributes of TokenIdent (maxDate, masterKeyId, owner, etc) - this is important because if we provide just a KV lookup, any user can create a TokenIdent object with a random key (sequence no) & get authenticated (RM just checks for presence of token for authentication).

Corresponding change needs to be done in stateless secret manager

@krishan1390
Copy link
Author

@slfan1989 @goiri do you have any concerns on the PR ? we plan to merge it soon unless any concerns.

@github-actions
Copy link
Contributor

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again.
Thanks all for your contribution.

@github-actions github-actions bot added the Stale label Oct 25, 2025
@github-actions github-actions bot closed this Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants