Skip to content

Conversation

@carsonip
Copy link
Member

@carsonip carsonip commented Nov 3, 2025

Handle DeadlineExceeded properly without stopping consumer, as these can happen when network is unstable.

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

⚠️ Warning

System-tests will not be executed for this PR because it is from a forked repository.
But it will still create a successful run-system-tests status check.

@carsonip carsonip requested a review from lahsivjar November 3, 2025 17:54
Comment on lines 357 to 360
exponentialBackoff := ExponentialBackoff{
base: 1 * time.Second,
max: 1 * time.Minute,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run's main ctx is ignored. If that is canceled we should exit.

c.mu.Unlock()
var attempt int
for {
exp := exponentialBackoff{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need exponential backoff? Isn't polling dependent on accumulating fetches until maxPollRecords is available introducing an inbuilt delay.

Copy link
Member Author

@carsonip carsonip Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you imply we can retry immediately? Do you expect the next fetch call to block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you imply we can retry immediately?

Yes, poll records doesn't seem to actually issue requests so there is no issue in overloading kafka. As per the godocs,

// PollRecords waits for fetches to be available, returning as soon as any
// broker returns a fetch. If the context is nil, this function will return
// immediately with any currently buffered fetches.

This to me means there is no point in doing a backoff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i wonder what happens if there's a network disruption that causes a client connection error. Will it keep trying to establish new connection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it keep trying to establish new connection?

Connection to what? It doesn't seem like PollRecords issues any request, it just waits for the fetches from kafka to get the required number of records.

if errors.Is(clientCtx.Err(), context.Canceled) {
return nil // Return no error if client context is canceled.
}
backoff := exp.Backoff(attempt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All errors will be retried, I don't think this is a good idea given there can be unrecoverable errors too, for example failing to commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we want to do in that case? Propagate upwards to crash it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe only retry DeadlineExceeded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is possible for the context to be canceled while request context (clientCtx) isn't canceled, and in those cases how we want to handle those.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying both context canceled and deadline exceeded, as long as client context is not canceled

@carsonip carsonip requested a review from lahsivjar November 3, 2025 18:21
@carsonip carsonip changed the title Fix stuck kafka consumer in incident 2191 Fix indefinitely stuck consumer on DeadlineExceeded Nov 3, 2025
lahsivjar
lahsivjar previously approved these changes Nov 3, 2025
Copy link
Contributor

@lahsivjar lahsivjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's wait for @marclop 's review too

@carsonip carsonip changed the title Fix indefinitely stuck consumer on DeadlineExceeded Fix indefinitely stuck consumer on fetch DeadlineExceeded Nov 3, 2025
Copy link
Contributor

@lahsivjar lahsivjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@carsonip carsonip merged commit 13c6572 into elastic:main Nov 3, 2025
4 checks passed
if fetchErr := fetches.Err0(); errors.Is(fetchErr, kgo.ErrClientClosed) ||
errors.Is(fetchErr, context.Canceled) ||
errors.Is(fetchErr, context.DeadlineExceeded) {
return fetchErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix, we were masking the underlying error here....

continue
}

return fmt.Errorf("consumer fetch error: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some follow-up comments on this code block? I find it pretty confusing to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marclop pushed a commit that referenced this pull request Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants