Skip to content

Conversation

@Kehrlann
Copy link
Contributor

Description

Fix for #1146. Tested with the "load test" application below ; using logging.level.cloudfoundry-client.token=DEBUG to observe which tokens are being requested.

Details in the commit message.

Commit message

  • The "add a UAA token to HTTP request headers" flow works like this:
    • User makes a request using AbstractReactorOperations -> Operator
    • This adds an "Authorization" header, using Provider#getToken mono ; if a value is cached it uses that, otherwise it uses whatever flow is available to request a token.
    • If the response is unauthorized, it means the access token is expired, and the Operator calls Provider#invalidate ; and then retries the request, which will trigger another #getToken call.
  • There was a race condition, when an access_token was cached and multiple request used it concurrently, they would all call AbstractUaaTokenProvider#invalidate, and all reuse the same refresh token. This is an issue when the UAA is configured with non-reusable refresh tokens (revocable + rotating or unique), only the first refresh token request succeeds, and all other refresh token requests fail.
  • This PR addresses this by ensuring that the cached refresh token is removed from the cache right before being used. Any other call to #invalidate will be a no-op.
  • This is NOT a perfect fix, and there are some smaller scale race conditions happening. For example, #invalidate calls refreshTokens.remove and accessTokens.put sequentially. It is possible that a concurrent request calls invalidate, finds the refreshTokens cache empty, and then will populate accessTokens through #getToken ; in that case there could be a race condition and two tokens fetched.
  • Re-architecting the whole token logic is too big of a lift for the project, so we accept that this solution is not perfect - as long as the issues are recoverable.
  • Fixes Parallel Requests with Expired Access Tokens triggering Refresh Token Flow leads to Broken State (no further requests possible) #1146

Appendix: load test app

This is not optimal and does too many things, but it showcases the bug. It does the following:

  1. Display uaa identity zone config
  2. Delete orgs starting with test-load-
  3. Create 20 orgs test-load-*, and a space in each
  4. Wait for 5 minutes so that the token expires (in my own UAA setup)
  5. Call list space on all 20 orgs, concurrently, making 20 requests with an expired token in parallel.

Run with logging.level.cloudfoundry-client.token=DEBUG to observe when tokens are requested.

@SpringBootApplication(proxyBeanMethods = true)
public class LoadtestApplication {

    private static final String ORG_PREFIX = "test-load-";

    public static void main(String[] args) {
        SpringApplication.run(LoadtestApplication.class, args);
    }

    private final Logger logger = LoggerFactory.getLogger("load-test");

    @Bean
    DefaultConnectionContext connectionContext(
            @Value("${test.apiHost}") String apiHost,
            @Value("${test.proxy.host:}") String proxyHost,
            @Value("${test.proxy.password:}") String proxyPassword,
            @Value("${test.proxy.port:8080}") Integer proxyPort,
            @Value("${test.proxy.username:}") String proxyUsername,
            @Value("${test.skipSslValidation:false}") Boolean skipSslValidation) {

        DefaultConnectionContext.Builder connectionContext =
                DefaultConnectionContext.builder()
                        .apiHost(apiHost)
                        .skipSslValidation(skipSslValidation)
                        .sslHandshakeTimeout(Duration.ofSeconds(30));

        if (StringUtils.hasText(proxyHost)) {
            ProxyConfiguration.Builder proxyConfiguration =
                    ProxyConfiguration.builder().host(proxyHost).port(proxyPort);

            if (StringUtils.hasText(proxyUsername)) {
                proxyConfiguration.password(proxyPassword).username(proxyUsername);
            }

            connectionContext.proxyConfiguration(proxyConfiguration.build());
        }

        return connectionContext.build();
    }


    @Bean
    ReactorCloudFoundryClient cfClient(
            ConnectionContext connectionContext,
            @Value("${test.admin.password}") String password,
            @Value("${test.admin.username}") String username) {
        return ReactorCloudFoundryClient.builder()
                .connectionContext(connectionContext)
                .tokenProvider(
                        PasswordGrantTokenProvider.builder()
                                .password(password)
                                .username(username)
                                .build())
                .build();
    }

    @Bean
    CloudFoundryOperations cfOps(CloudFoundryClient cfClient) {
        return DefaultCloudFoundryOperations.builder()
                .cloudFoundryClient(cfClient)
                .build();
    }

    @Bean
    @Qualifier("admin")
    ReactorUaaClient adminUaaClient(
            ConnectionContext connectionContext,
            @Value("${test.admin.clientId}") String clientId,
            @Value("${test.admin.clientSecret}") String clientSecret) {
        return ReactorUaaClient.builder()
                .connectionContext(connectionContext)
                .tokenProvider(
                        ClientCredentialsGrantTokenProvider.builder()
                                .clientId(clientId)
                                .clientSecret(clientSecret)
                                .build())
                .build();
    }

    @Bean
    ApplicationRunner appRunner(CloudFoundryClient client, CloudFoundryOperations ops, ReactorUaaClient uaaClient) {
        return args -> {
            // Ensure that refresh tokens are revocable and do rotate (eyeballing it)
            var identityZones = uaaClient.identityZones().list(ListIdentityZonesRequest.builder().build()).block().getIdentityZones();
            System.out.println(identityZones);

            System.out.println("👋 hiiiii\n\n");
            System.out.println(client.info().get(GetInfoRequest.builder().build()).block());
            System.out.println("\n\n👋 hooo\n\n");
            var deletedOrgs = cleanOrgs(client).count().block();
            System.out.printf("Deleted %s orgs.%n%n", deletedOrgs);
            // Delete is not fully synchronous / consistent, so we wait for a bit
            for (int i = 0; i < 5; i++) {
                Thread.sleep(1000);

                Long remainingOrgs = countMatchingOrgs(ops).block();
                System.out.printf("Waiting for org deletion... %s remaining.%n", remainingOrgs);
                if (remainingOrgs == 0) break;
            }
            var createdOrgs = seedOrgs(client).collectList().block();
            System.out.println(createdOrgs);

            var startTime = Instant.now();
            var waitDuration = Duration.ofMinutes(5).plusSeconds(10);
            var cutoffTime = startTime.plus(waitDuration);

            while (Instant.now().isBefore(cutoffTime)) {
                System.out.printf("Waiting for token to expire, %s left%n", Duration.between(Instant.now(), cutoffTime));
                Thread.sleep(10_000);
            }

            var spaces = listSpaces(client, createdOrgs).collectList().block();
            System.out.println(spaces);
            System.out.printf("Deleted %s orgs.%n%n", deletedOrgs);
        };
    }

    static Flux<String> listSpaces(CloudFoundryClient client, List<String> orgIds) {
        return Flux.fromIterable(orgIds)
                .flatMap(id -> client.spacesV3().list(ListSpacesRequest.builder()
                        .organizationIds(id)
                        .build())
                )
                .flatMapIterable(ListSpacesResponse::getResources)
                .map(SpaceResource::getName);
    }

    static Flux<String> cleanOrgs(CloudFoundryClient client) {
        return PaginationUtils.requestClientV3Resources(page -> client.organizationsV3()
                        .list(
                                ListOrganizationsRequest.builder()
                                        .page(page)
                                        .build()
                        ))
                .filter(org -> org.getName().startsWith(ORG_PREFIX))
                .doOnNext(r -> System.out.printf("Deleting ... %s%n", r.getName()))
                .flatMap(o ->
                        client.organizationsV3().delete(
                                        DeleteOrganizationRequest.builder()
                                                .organizationId(o.getId())
                                                .build()
                                )
                                .doOnNext(r -> System.out.printf("Deleted org %s%n", o.getName()))
                )
                .switchIfEmpty(Mono.just("No orgs found, no deletion required"));
    }

    static Mono<Long> countMatchingOrgs(CloudFoundryOperations ops) {
        return ops.organizations()
                .list()
                .filter(o -> o.getName().startsWith(ORG_PREFIX))
                .count();
    }


    static Flux<String> seedOrgs(CloudFoundryClient client) {
        return Flux.fromStream(IntStream.range(1, 21).boxed())
                .flatMap(i -> client.organizationsV3().create(
                        CreateOrganizationRequest.builder()
                                .name(ORG_PREFIX + i)
                                .build()
                ))
                .doOnNext(o -> System.out.printf("Created org: %s%n", o.getName()))
                //
                .flatMap(o -> client.spacesV3().create(
                                        CreateSpaceRequest.builder()
                                                .relationships(
                                                        SpaceRelationships.builder()
                                                                .organization(ToOneRelationship.builder()
                                                                        .data(Relationship.builder().id(o.getId()).build())
                                                                        .build()
                                                                )
                                                                .build()
                                                )
                                                .name("space-" + o.getName())
                                                .build()
                                )
                                .doOnNext(s -> System.out.printf("      Created space %s%n", s.getName()))
                                .then(Mono.just(o))
                )
                .map(CreateOrganizationResponse::getId);
    }

}

- The "add a UAA token to HTTP request headers" flow works like this:
  - User makes a request using AbstractReactorOperations -> Operator
  - This adds an "Authorization" header, using Provider#getToken mono ;
    if a value is cached it uses that, otherwise it uses whatever flow
    is available to request a token.
  - If the response is unauthorized, it means the access token is
    expired, and the Operator calls Provider#invalidate ; and then
    retries the request, which will trigger another #getToken call.
- There was a race condition, when an access_token was cached and
  multiple request used it concurrently, they would all call
  AbstractUaaTokenProvider#invalidate, and all reuse the same refresh
  token. This is an issue when the UAA is configured with non-reusable
  refresh tokens (revocable + rotating or unique), only the first
  refresh token request succeeds, and all other refresh token requests
  fail.
- This PR addresses this by ensuring that the cached refresh token is
  removed from the cache right before being used. Any other call to
  #invalidate will be a no-op.
- This is NOT a perfect fix, and there are some smaller scale race
  conditions happening. For example, #invalidate calls
  refreshTokens.remove and accessTokens.put sequentially. It is possible
  that a concurrent request calls invalidate, finds the refreshTokens
  cache empty, and then will populate accessTokens through #getToken ;
  in that case there could be a race condition and two tokens fetched.
- Re-architecting the whole token logic is too big of a lift for the
  project, so we accept that this solution is not perfect - as long as
  the issues are recoverable.
- Fixes #1146

Signed-off-by: Daniel Garnier-Moiroux <daniel.garnier-moiroux@broadcom.com>
@Kehrlann
Copy link
Contributor Author

@eaglerainbow here's a different take on #1300, that leverages the refreshTokens concurrent hashmap as a signal for "currently refreshing a token".

If you could test this code with your own scenarios, that'd be great. I've tested it against your AbstractUaaTokenProviderConcurrencyTest tests for #1300, and it passes, but unfortunately AbstractUaaTokenProviderConcurrencyTest passes on my machine with the code that is in 5.14 (no changes), so it does not provide extra confidence.

I've spent some time with a Project Reactor contributor discussing #1146 . Note that this is NOT perfect, but achieving perfect here would require a significant rework of the entire Operator architecture.

@eaglerainbow
Copy link

eaglerainbow commented Oct 20, 2025

Affirm.
I will try to do so. Please expect that I cannot begin with that before the weekend, as too many other things are currently on my plate.
I will report back as soon as possible.

@Kehrlann
Copy link
Contributor Author

@eaglerainbow please take your time, it took me over a month to get to this issue 😅

@eaglerainbow
Copy link

@eaglerainbow please take your time, it took me over a month to get to this issue 😅

I just built a first new version of the main project where cf-java-client gets used.
It still contains both fixes: Your approach here and mine. It successfully passed my integration test suite (which does not have an explicit test for #1146, but otherwise verifies a lot of things that the normal business flow did not suffer). A first brief test in a live environment of mine also indicates that we haven't broken anything vital. What may be noteworthy, though, is that I believe to see roughly 3-5 times higher CPU load than I am used to see (we are still in manageable ranges in my setup, though). So, perhaps it makes sense to have a second look into this direction (Note: I bumped directly from 5.12.0 to 7fa43c8 - so the increased CPU consumption could also be caused by something else!). Please let me know if I should try to triangulate more details on this.

Having built a second version that does no longer contain my workaround, I observe even slightly higher CPU loads (all other environmental things are exactly the same).

#1146 will not surface right away in my environment. I keep the installed second version running for now. It will take at least a week of runtime until we can be confident enough that we have caught the bug.

@Kehrlann I keep you posted on the results.

@Kehrlann
Copy link
Contributor Author

If I understand correctly:

  1. Without the fix, you get 1 CPU usage
  2. With both our fixes, you get 3-5 CPU usage
  3. With just my fix, you get 5+ CPU usage?

Could you please take a look at the CPU consumption with a build using 5.14, but not our fixes?

@eaglerainbow
Copy link

That's correct - at least this was my initial impression.
After now 48+hrs of runtime, see that the CPU usage peaks are reduced a little and we are down to roughly factor of 2. Perhaps we are also observing an initial effect, e.g. I have some logic in there which should ensure that load gets evenly distributed during an epoch. It may take a bit of time until this really works out well.

For clarity: Item 1 as of your list was with a cf-java-client:5.12, with my fix, without your fix here.
(This shouldn't imply that I think that my change would have been better! Just for the purpose of clear communication and documentation)

Could you please take a look at the CPU consumption with a build using 5.14, but not our fixes?

Will do. I will only do that towards the end of this week. Background is that I don't want to interrupt the ongoing experiment by stopping the application (and for comparable values, I would have to do).

Intermediate report: No major outage yet.
What is very prominent in the logs, though, is the following exception:

2025-10-27T12:11:46.060Z ERROR 9 --- [-client-epoll-3] cloudfoundry-client.token                : Refresh token grant error.
 
org.cloudfoundry.uaa.UaaException: invalid_token: The token expired, was revoked, or the token ID is incorrect.
        at org.cloudfoundry.reactor.util.ErrorPayloadMappers.lambda$uaa$8(ErrorPayloadMappers.java:104) ~[cloudfoundry-client-reactor-5.15.0.BUILD-fix-1146.jar!/:na]
        at org.cloudfoundry.reactor.util.ErrorPayloadMappers.lambda$mapToError$11(ErrorPayloadMappers.java:140) ~[cloudfoundry-client-reactor-5.15.0.BUILD-fix-1146.jar!/:na]
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:129) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:224) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:113) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:194) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.Operators$BaseFluxToMonoOperator.completePossiblyEmpty(Operators.java:2097) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:118) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:144) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onComplete(FluxDoFinally.java:128) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:144) ~[reactor-core-3.6.15.jar!/:3.6.15]
        at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:413) ~[reactor-netty-core-1.1.28.jar!/:1.1.28]
        at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:455) ~[reactor-netty-core-1.1.28.jar!/:1.1.28]
        at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:509) ~[reactor-netty-core-1.1.28.jar!/:1.1.28]
        at reactor.netty.http.client.HttpClientOperations.onInboundNext(HttpClientOperations.java:815) ~[reactor-netty-http-1.1.28.jar!/:1.1.28]
        at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:115) ~[reactor-netty-core-1.1.28.jar!/:1.1.28]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:107) ~[netty-codec-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[netty-handler-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1519) ~[netty-handler-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1377) ~[netty-handler-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1428) ~[netty-handler-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530) ~[netty-codec-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469) ~[netty-codec-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[netty-codec-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) ~[netty-transport-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) ~[netty-transport-classes-epoll-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) ~[netty-transport-classes-epoll-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) ~[netty-transport-classes-epoll-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) ~[netty-common-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.119.Final.jar!/:4.1.119.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.119.Final.jar!/:4.1.119.Final]
        at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]

It first appeared a little after 24hrs of the start - and the next instance was another 24hrs later (which is the one I have copied here).
Besides the long log entry, though, I do not see any observable negative impact on the application at that point in time. Let's keep the fingers crossed.

@Kehrlann
Copy link
Contributor Author

thanks for the detailed report!

I’m surprised at the CPU usage, as the only change I’ve added triggers when we invalidate a token, so, worst case scenario, it should only impact the app every X hours on token refresh. I’ll think about it.

This log error is probably because the refresh token does not rotate in the UAA, and expires after 1 day.

@eaglerainbow
Copy link

eaglerainbow commented Nov 1, 2025

@Kehrlann I keep you posted on the results.

It's been a whole week since the installation. I did not have a single hanging of the test-promregator during that time frame. That makes me confident that this PR here really caught the issue.
Moreover, I screened the logs of the entire test period, and I could not spot any unexpected behavior. This makes me confident that this PR here is a "Good Idea"™️.

I’m surprised at the CPU usage, as the only change I’ve added triggers when we invalidate a token, so, worst case scenario, it should only impact the app every X hours on token refresh. I’ll think about it.

If you started, don't continue on this.
Luckily I remembered this morning that my application exposes some metrics from the underlying Java virtual machine. Amongst those metrics, there is also one that measures the CPU cycles the application itself consumes (promregator_process_cpu_seconds_total fwiw). Analyzing that a little, I came up with the following graph:

grafik

The yellow bar splits the values at the point in time when I installed the test version (i.e. to the left, it's the "old version", to the right it's the version under test).
I read the diagram that the overall average of CPU consumption roughly stays the same. Yes, there are some spikes, yes, we have some new lows and highs, but we are not talking about "factors of more load than before".

What I might have seen before is within the block of the blue box in the diagram above. Zooming in there,

image002

you observe quite a series of unusual peaks (note that the way of calculating the diagram implies a certain "moving average effect", so individual readings may even be significant higher). In my scenario these peaks could have all kind of different root causes. I would not bet a dollar to a doughnut that this even originates from cf-java-client. Instead I could imagine dozens of other effects that are more likely to produce this (in particular as it was "right after" starting the application).
Moreover, I also checked the absolute value of the metric (i. e. "seconds of CPU compute time consumed since last restart") and within a 48hrs interval I could not spot any significant difference.
Essentially, with all these objective facts (and comparing them to my subjective perception of spot measurements before) I don't think it's worth to dig any much deeper in there.

This log error is probably because the refresh token does not rotate in the UAA, and expires after 1 day.

Yep, sounds like that. As you can auto-recover from there using the Password Grant again, the error message in the log is a little ugly, but essentially harmless.

Given all this, I suggest to merge this PR. I am looking forward to a new release so that I also can officially bring this into promregator so that all its users can benefit from it.

Thanks for your support!

@eaglerainbow
Copy link

eaglerainbow commented Nov 1, 2025

If merged, this will

closes #1300

@Kehrlann
Copy link
Contributor Author

Kehrlann commented Nov 3, 2025

Thank you very much for running a real, live experiment; and for thoroughly reporting the results and observations 🙇

Fixing some pipelines issues on my side, and then I'll merge this. I have a couple of PR to get merged in as well before I cut a release, but it shouldn't take too long.

@eaglerainbow
Copy link

before I cut a release, but it shouldn't take too long.

Take your time. The issue has already celebrated its third anniversary, so a couple of days or weeks more or less doesn't make a big difference anymore 😝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel Requests with Expired Access Tokens triggering Refresh Token Flow leads to Broken State (no further requests possible)

3 participants