KAFKA-16029: Fix "Unable to find FetchSessionHandler for node X" bug #15186

kirktrue · 2024-01-13T00:01:03Z

Change AbstractFetcher/Fetcher to not clear the sessionHandlers cache during prepareCloseFetchSessionRequests().

During close(), Fetcher calls maybeCloseFetchSessions() which, in turn, calls prepareCloseFetchSessionRequests() and then calls NetworkClient.poll() to complete the requests. Since prepareCloseFetchSessionRequests() (erroneously) clears the sessionHandlers cache, when the response is processed, the sessions are missing, and the warning is logged.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

This change restores the previous behavior for the Fetcher to keep its internal sessionHandlers cache even after close() is invoked. During the close() of a Consumer, the Fetcher attempts to send out FetchRequests to signal to the relevant brokers that the consumer is closing. This allows the broker to clean up relevant resources on its end. The Fetcher was recently changed (3.5) so that it cleared its sessionHandlers cache immediately after it finished creating the final FetchRequests. The reasoning was that since the fetcher is being closed, those session handlers were no longer needed, and thus the cache could safely be cleared. However, it is evidently possible that the fetcher's response handler can still be invoked if a response for an in-flight fetch is received shortly after that cache was cleared. (There doesn't appear to be anything in the Fetcher that blocks response handling after it is closed.) In this case, the Fetcher's response handler aborts with the error log once it cannot find the FetchSessionHandler in the sessionHandlers cache.

kirktrue · 2024-01-15T18:29:09Z

Test failures from most recent build are known flaky tests.

philipnee · 2024-01-15T18:29:50Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java

-            sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {
-                // set the session handler to notify close. This will set the next metadata request to send close message.
-                sessionHandler.notifyClose();
+        sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {


seems like you took out the sessionHandlers.clear(); in the finally block - is there any reason why we don't need it?

Yes, if we clear the sessions, we potentially end up with inflight response handlers that fail to find their corresponding session handler. This lookup failure results in the error log line reported in the Jira.

There are some potential alternatives to solve this problem, but the path I took was to simply revert to the previous behavior which did not clear the cache.

Thanks Kirk, for the explanation - It seems like there are cases where we want to clear the cache - one I can think of is when there's a topology change but this is probably an unnoticeable optimization - i guess the size of the handler lookup never grows so large that becomes a problem.

I think that's worth exploring, because I don't see that we ever remove any of the FetchSessionHandler entries once they're created 😬

Don't we remove the map and that eventually leads to their GC, or does something else hold a reference?

We don't remove the sessionHandlers map explicitly, no. It's garbage collected when the AbstractFetch/Fetcher object itself is garbage collected.

@philipnee—it doesn't look like there's actually an issue with respect to stale partition data in the FetchSessionHandler. It's reset on every call to prepareFetchRequests().

The FetchSessionHandler uses an inner Builder class to keep track of the partitions the user is requesting. For each distinct fetch cycle a new Builder is created. Each partition that is requested is added to the newly-created Builder instance. When all the fetchable partitions have been added, the Builder.build() method is called which creates the FetchRequestData.

So the data kept in the FetchSessionHandler is ephemeral. Any partition leadership changes would happen somewhere upstream and get updated in the SubscriptionState, well before we get to the fetch request generation.

There are some potential alternatives to solve this problem, but the path I took was to simply revert to the previous behavior which did not clear the cache.

I think that the cache was actually cleared in Fetcher.close() before the refactoring. I suppose that we could bring it back too even if it does not bring much in the end.

ijuma · 2024-01-15T22:27:47Z

Are there any network resources held by the sessionHandlers that would no longer be closed due to this? If not, then this seems like a reasonable fix.

kirktrue · 2024-01-16T16:46:13Z

@ijuma—

Are there any network resources held by the sessionHandlers that would no longer be closed due to this?

No. The FetchSessionHandler has a session ID, epoch, node ID, and set of PartitionData (itself offsets and epochs).

If not, then this seems like a reasonable fix.

Thanks!

kirktrue · 2024-01-17T01:39:33Z

@ijuma / @stanislavkozlovski are either of you able to review? Thanks!

AndrewJSchofield

lgtm

dajac

@kirktrue Thanks for the patch. I left a few comments/questions.

dajac · 2024-01-22T09:03:50Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java

-            sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {
-                // set the session handler to notify close. This will set the next metadata request to send close message.
-                sessionHandler.notifyClose();
+        sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {


There are some potential alternatives to solve this problem, but the path I took was to simply revert to the previous behavior which did not clear the cache.

I think that the cache was actually cleared in Fetcher.close() before the refactoring. I suppose that we could bring it back too even if it does not bring much in the end.

dajac · 2024-01-22T09:05:24Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java

-                fetchable.put(fetchTarget, sessionHandler.newBuilder());
-            });
-        } finally {
-            sessionHandlers.clear();


For the reference, sessionHandlers.clear() is indeed incorrect here. The issue is that the caller of prepareCloseFetchSessionRequests calls client.poll() to complete the requests created here. When they do complete, the sessions are not there anymore so the warning is logged.

dajac · 2024-01-22T09:09:38Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java

@@ -376,25 +376,21 @@ protected Map<Node, FetchSessionHandler.FetchRequestData> prepareCloseFetchSessi
        final Cluster cluster = metadata.fetch();


On a slightly different topic, should prepareCloseFetchSessionRequests be synchronized too? We use to have a synchronize in the close method but it was removed during the refactoring.

I made the closeInternal() method synchronized as that more closely emulates the synchronization in <= 3.5. It covers prepareCloseFetchSessionRequests() too.

…n behavior

dajac

LGTM. Thanks @kirktrue!

dajac · 2024-01-23T13:47:53Z

@kirktrue I think that the description is not quite right. First, I think that the session map was actually cleared previously. Second, the root cause is more along the line of what I said here. Could you please update the description of the PR?

kirktrue · 2024-01-23T16:32:22Z

@kirktrue I think that the description is not quite right. First, I think that the session map was actually cleared previously. Second, the root cause is more along the line of what I said here. Could you please update the description of the PR?

I will update the description today. Thanks @dajac!

kirktrue · 2024-01-23T19:43:58Z

@dajac—the PR description was updated.

Let me know if there's anything left before we're ready to merge.

Thanks!

…15186) Change `AbstractFetcher`/`Fetcher` to _not_ clear the `sessionHandlers` cache during `prepareCloseFetchSessionRequests()`. During `close()`, `Fetcher` calls `maybeCloseFetchSessions()` which, in turn, calls `prepareCloseFetchSessionRequests()` and then calls `NetworkClient.poll()` to complete the requests. Since `prepareCloseFetchSessionRequests()` (erroneously) clears the `sessionHandlers` cache, when the response is processed, the sessions are missing, and the warning is logged. Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>

dajac · 2024-01-24T12:57:29Z

Merged it to trunk and 3.7. cc @stanislavkozlovski

kirktrue · 2024-01-30T00:37:21Z

Thanks @dajac!

…pache#15186) Change `AbstractFetcher`/`Fetcher` to _not_ clear the `sessionHandlers` cache during `prepareCloseFetchSessionRequests()`. During `close()`, `Fetcher` calls `maybeCloseFetchSessions()` which, in turn, calls `prepareCloseFetchSessionRequests()` and then calls `NetworkClient.poll()` to complete the requests. Since `prepareCloseFetchSessionRequests()` (erroneously) clears the `sessionHandlers` cache, when the response is processed, the sessions are missing, and the warning is logged. Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>

kirktrue closed this Jan 13, 2024

kirktrue reopened this Jan 13, 2024

kirktrue closed this Jan 13, 2024

kirktrue reopened this Jan 13, 2024

kirktrue marked this pull request as ready for review January 14, 2024 07:02

philipnee reviewed Jan 15, 2024

View reviewed changes

AndrewJSchofield approved these changes Jan 18, 2024

View reviewed changes

dajac reviewed Jan 22, 2024

View reviewed changes

Changed closeInternal to more closely emulate previous synchronizatio…

7e7a39a

…n behavior

dajac approved these changes Jan 23, 2024

View reviewed changes

dajac merged commit bf49fb2 into apache:trunk Jan 24, 2024
1 check failed

kirktrue deleted the KAFKA-16029-fix-unable-to-find-fetchsessionhandler-bug branch February 16, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16029: Fix "Unable to find FetchSessionHandler for node X" bug #15186

KAFKA-16029: Fix "Unable to find FetchSessionHandler for node X" bug #15186

kirktrue commented Jan 13, 2024 •

edited

Loading

kirktrue commented Jan 15, 2024

philipnee Jan 15, 2024

kirktrue Jan 15, 2024

kirktrue Jan 15, 2024

philipnee Jan 16, 2024

kirktrue Jan 19, 2024

stanislavkozlovski Jan 19, 2024

kirktrue Jan 19, 2024

kirktrue Jan 19, 2024

dajac Jan 22, 2024

ijuma commented Jan 15, 2024

kirktrue commented Jan 16, 2024

kirktrue commented Jan 17, 2024

AndrewJSchofield left a comment

dajac left a comment

dajac Jan 22, 2024

dajac Jan 22, 2024

dajac Jan 22, 2024

kirktrue Jan 23, 2024

dajac left a comment

dajac commented Jan 23, 2024

kirktrue commented Jan 23, 2024

kirktrue commented Jan 23, 2024

dajac commented Jan 24, 2024

kirktrue commented Jan 30, 2024

		@@ -376,25 +376,21 @@ protected Map<Node, FetchSessionHandler.FetchRequestData> prepareCloseFetchSessi
		final Cluster cluster = metadata.fetch();

KAFKA-16029: Fix "Unable to find FetchSessionHandler for node X" bug #15186

KAFKA-16029: Fix "Unable to find FetchSessionHandler for node X" bug #15186

Conversation

kirktrue commented Jan 13, 2024 • edited Loading

Committer Checklist (excluded from commit message)

kirktrue commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma commented Jan 15, 2024

kirktrue commented Jan 16, 2024

kirktrue commented Jan 17, 2024

AndrewJSchofield left a comment

Choose a reason for hiding this comment

dajac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dajac left a comment

Choose a reason for hiding this comment

dajac commented Jan 23, 2024

kirktrue commented Jan 23, 2024

kirktrue commented Jan 23, 2024

dajac commented Jan 24, 2024

kirktrue commented Jan 30, 2024

kirktrue commented Jan 13, 2024 •

edited

Loading