Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-16029: Fix "Unable to find FetchSessionHandler for node X" bug #15186

Conversation

kirktrue
Copy link
Collaborator

@kirktrue kirktrue commented Jan 13, 2024

Change AbstractFetcher/Fetcher to not clear the sessionHandlers cache during prepareCloseFetchSessionRequests().

During close(), Fetcher calls maybeCloseFetchSessions() which, in turn, calls prepareCloseFetchSessionRequests() and then calls NetworkClient.poll() to complete the requests. Since prepareCloseFetchSessionRequests() (erroneously) clears the sessionHandlers cache, when the response is processed, the sessions are missing, and the warning is logged.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

This change restores the previous behavior for the Fetcher to keep its internal sessionHandlers cache even after close() is invoked.

During the close() of a Consumer, the Fetcher attempts to send out FetchRequests to signal to the relevant brokers that the consumer is closing. This allows the broker to clean up relevant resources on its end.

The Fetcher was recently changed (3.5) so that it cleared its sessionHandlers cache immediately after it finished creating the final FetchRequests. The reasoning was that since the fetcher is being closed, those session handlers were no longer needed, and thus the cache could safely be cleared.

However, it is evidently possible that the fetcher's response handler can still be invoked if a response for an in-flight fetch is received shortly after that cache was cleared. (There doesn't appear to be anything in the Fetcher that blocks response handling after it is closed.) In this case, the Fetcher's response handler aborts with the error log once it cannot find the FetchSessionHandler in the sessionHandlers cache.
@kirktrue kirktrue closed this Jan 13, 2024
@kirktrue kirktrue reopened this Jan 13, 2024
@kirktrue kirktrue closed this Jan 13, 2024
@kirktrue kirktrue reopened this Jan 13, 2024
@kirktrue kirktrue marked this pull request as ready for review January 14, 2024 07:02
@kirktrue
Copy link
Collaborator Author

Test failures from most recent build are known flaky tests.

sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {
// set the session handler to notify close. This will set the next metadata request to send close message.
sessionHandler.notifyClose();
sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you took out the sessionHandlers.clear(); in the finally block - is there any reason why we don't need it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we clear the sessions, we potentially end up with inflight response handlers that fail to find their corresponding session handler. This lookup failure results in the error log line reported in the Jira.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some potential alternatives to solve this problem, but the path I took was to simply revert to the previous behavior which did not clear the cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kirk, for the explanation - It seems like there are cases where we want to clear the cache - one I can think of is when there's a topology change but this is probably an unnoticeable optimization - i guess the size of the handler lookup never grows so large that becomes a problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's worth exploring, because I don't see that we ever remove any of the FetchSessionHandler entries once they're created 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we remove the map and that eventually leads to their GC, or does something else hold a reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't remove the sessionHandlers map explicitly, no. It's garbage collected when the AbstractFetch/Fetcher object itself is garbage collected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philipnee—it doesn't look like there's actually an issue with respect to stale partition data in the FetchSessionHandler. It's reset on every call to prepareFetchRequests().

The FetchSessionHandler uses an inner Builder class to keep track of the partitions the user is requesting. For each distinct fetch cycle a new Builder is created. Each partition that is requested is added to the newly-created Builder instance. When all the fetchable partitions have been added, the Builder.build() method is called which creates the FetchRequestData.

So the data kept in the FetchSessionHandler is ephemeral. Any partition leadership changes would happen somewhere upstream and get updated in the SubscriptionState, well before we get to the fetch request generation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some potential alternatives to solve this problem, but the path I took was to simply revert to the previous behavior which did not clear the cache.

I think that the cache was actually cleared in Fetcher.close() before the refactoring. I suppose that we could bring it back too even if it does not bring much in the end.

@ijuma
Copy link
Contributor

ijuma commented Jan 15, 2024

Are there any network resources held by the sessionHandlers that would no longer be closed due to this? If not, then this seems like a reasonable fix.

@kirktrue
Copy link
Collaborator Author

@ijuma

Are there any network resources held by the sessionHandlers that would no longer be closed due to this?

No. The FetchSessionHandler has a session ID, epoch, node ID, and set of PartitionData (itself offsets and epochs).

If not, then this seems like a reasonable fix.

Thanks!

@kirktrue
Copy link
Collaborator Author

@ijuma / @stanislavkozlovski are either of you able to review? Thanks!

Copy link
Collaborator

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirktrue Thanks for the patch. I left a few comments/questions.

sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {
// set the session handler to notify close. This will set the next metadata request to send close message.
sessionHandler.notifyClose();
sessionHandlers.forEach((fetchTargetNodeId, sessionHandler) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some potential alternatives to solve this problem, but the path I took was to simply revert to the previous behavior which did not clear the cache.

I think that the cache was actually cleared in Fetcher.close() before the refactoring. I suppose that we could bring it back too even if it does not bring much in the end.

fetchable.put(fetchTarget, sessionHandler.newBuilder());
});
} finally {
sessionHandlers.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reference, sessionHandlers.clear() is indeed incorrect here. The issue is that the caller of prepareCloseFetchSessionRequests calls client.poll() to complete the requests created here. When they do complete, the sessions are not there anymore so the warning is logged.

@@ -376,25 +376,21 @@ protected Map<Node, FetchSessionHandler.FetchRequestData> prepareCloseFetchSessi
final Cluster cluster = metadata.fetch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a slightly different topic, should prepareCloseFetchSessionRequests be synchronized too? We use to have a synchronize in the close method but it was removed during the refactoring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the closeInternal() method synchronized as that more closely emulates the synchronization in <= 3.5. It covers prepareCloseFetchSessionRequests() too.

Copy link
Contributor

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @kirktrue!

@dajac
Copy link
Contributor

dajac commented Jan 23, 2024

@kirktrue I think that the description is not quite right. First, I think that the session map was actually cleared previously. Second, the root cause is more along the line of what I said here. Could you please update the description of the PR?

@kirktrue
Copy link
Collaborator Author

@kirktrue I think that the description is not quite right. First, I think that the session map was actually cleared previously. Second, the root cause is more along the line of what I said here. Could you please update the description of the PR?

I will update the description today. Thanks @dajac!

@kirktrue
Copy link
Collaborator Author

@dajac—the PR description was updated.

Let me know if there's anything left before we're ready to merge.

Thanks!

@dajac dajac merged commit bf49fb2 into apache:trunk Jan 24, 2024
1 check failed
dajac pushed a commit that referenced this pull request Jan 24, 2024
…15186)

Change `AbstractFetcher`/`Fetcher` to _not_ clear the `sessionHandlers` cache during `prepareCloseFetchSessionRequests()`.

During `close()`, `Fetcher` calls `maybeCloseFetchSessions()` which, in turn, calls `prepareCloseFetchSessionRequests()` and then calls `NetworkClient.poll()` to complete the requests. Since `prepareCloseFetchSessionRequests()` (erroneously) clears the `sessionHandlers` cache, when the response is processed, the sessions are missing, and the warning is logged.

Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
@dajac
Copy link
Contributor

dajac commented Jan 24, 2024

Merged it to trunk and 3.7. cc @stanislavkozlovski

@kirktrue
Copy link
Collaborator Author

Thanks @dajac!

yyu1993 pushed a commit to yyu1993/kafka that referenced this pull request Feb 15, 2024
…pache#15186)

Change `AbstractFetcher`/`Fetcher` to _not_ clear the `sessionHandlers` cache during `prepareCloseFetchSessionRequests()`.

During `close()`, `Fetcher` calls `maybeCloseFetchSessions()` which, in turn, calls `prepareCloseFetchSessionRequests()` and then calls `NetworkClient.poll()` to complete the requests. Since `prepareCloseFetchSessionRequests()` (erroneously) clears the `sessionHandlers` cache, when the response is processed, the sessions are missing, and the warning is logged.

Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
@kirktrue kirktrue deleted the KAFKA-16029-fix-unable-to-find-fetchsessionhandler-bug branch February 16, 2024 17:53
clolov pushed a commit to clolov/kafka that referenced this pull request Apr 5, 2024
…pache#15186)

Change `AbstractFetcher`/`Fetcher` to _not_ clear the `sessionHandlers` cache during `prepareCloseFetchSessionRequests()`.

During `close()`, `Fetcher` calls `maybeCloseFetchSessions()` which, in turn, calls `prepareCloseFetchSessionRequests()` and then calls `NetworkClient.poll()` to complete the requests. Since `prepareCloseFetchSessionRequests()` (erroneously) clears the `sessionHandlers` cache, when the response is processed, the sessions are missing, and the warning is logged.

Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
Phuc-Hong-Tran pushed a commit to Phuc-Hong-Tran/kafka that referenced this pull request Jun 6, 2024
…pache#15186)

Change `AbstractFetcher`/`Fetcher` to _not_ clear the `sessionHandlers` cache during `prepareCloseFetchSessionRequests()`.

During `close()`, `Fetcher` calls `maybeCloseFetchSessions()` which, in turn, calls `prepareCloseFetchSessionRequests()` and then calls `NetworkClient.poll()` to complete the requests. Since `prepareCloseFetchSessionRequests()` (erroneously) clears the `sessionHandlers` cache, when the response is processed, the sessions are missing, and the warning is logged.

Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants