Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread #13361

Open
wants to merge 19 commits into
base: trunk
Choose a base branch
from

Conversation

rohits64
Copy link

@rohits64 rohits64 commented Mar 8, 2023

Here WorkThread dies if any unexpected exception is encountered. And the connectors/tasks are not aware of this and they would keep submitting callbacks to KafkaBasedLog with nobody processing them. I have added a uncaught exception handler in case if any uncaught exception comes in and it is detected and the Worker fails.

EDIT: Instead of starting a new thread, I have added retry with exponential backoff with limited number of retries(currently at 10)

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@rohits64 rohits64 marked this pull request as draft March 8, 2023 07:39
@rohits64 rohits64 changed the title KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies [WIP] Mar 8, 2023
Copy link
Contributor

@mukkachaitanya mukkachaitanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rohits64! Thanks for picking this up. It can be particularly painful when the WorkThread dies and there is no feedback to the Herder/operator. It seems clear, failures esp in Offset storage can cause no offset commits and Status storage can lead to stale statuses.

First, let's make sure you are signed up on Apache Jira, and if you could assign this ticket to yourself that would be great. (just send a mail to dev@kafka.apache.org and ask for a Jira account with the preferred username).

It would also be nice if we could discuss what is expected when the WorkThread is killed.

Currently, we already retry for both TimeException and RetriableException forever. In several environments, when the exception is persistent, for example when there are persistent timeouts the work thread might get into an infinite retry mode (we have seen cases of this on production systems). This is dangerous as (1) there is no visibility to the users other than logs on these retries and (2) overtime the callbacks might fill the queue. However, it seems this is an issue we need to address separately but sets a good context.

Since the KafkaBasedLog mainly consumes/produces the Kafka topics, we are already retrying the exceptions mentioned above. However, any other runtime exception here IMO should be treated as fatal and should be treated accordingly. Two options seem obvious:

  1. We can possibly retry the exception some static number of times with exponential back-off and then fail the WorkThread. The exception must be propagated back to the Herder and fail the Worker.

  2. We can also fail the worker as soon as we see the exception. Again we propagate the error back to the Herder and fail it.

In either case, since on worker failures rebalances will be triggered the connector/tasks are back healthy than stay stuck in a weird state.

WDYT of this @rohits64, @vamossagar12? Are there any concerns I am missing here?

Comment on lines 564 to 566
// resume the thread in case of any unexpected exception encountered
thread = new WorkThread();
thread.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be an acceptable solution to restart forever. However, if we were to retry we could keep using the same thread, and need not spawn a new thread. We have precedence for retrying for RetriableException and TimeoutException since the thread itself is stateless.

@rohits64
Copy link
Author

Hi @mukkachaitanya, thanks for the inputs. I think it we can go with exponential backoff with some static number of retries. I have made the changes. Currently I have made the number of retry as 10. That should be sufficient number of retries.

@rohits64 rohits64 changed the title KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies [WIP] KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies Mar 21, 2023
@rohits64 rohits64 marked this pull request as ready for review March 21, 2023 06:07
@gharris1727
Copy link
Contributor

Since the KafkaBasedLog mainly consumes/produces the Kafka topics, we are already retrying the exceptions mentioned above. However, any other runtime exception here IMO should be treated as fatal and should be treated accordingly. Two options seem obvious:

We can possibly retry the exception some static number of times with exponential back-off and then fail the WorkThread. The exception must be propagated back to the Herder and fail the Worker.

I don't think that we should add retries when we already know that the exceptions that would be caught here are non-retriable. Additionally, it may be unsafe or incorrect to retry on an arbitrary exception, and may produce unexpected behavior in the consumer or in the worker.

@rohits64 I also think that the PR as-is does not address the latter point that @mukkachaitanya made. We need to propagate these failures to the asynchronous callers, and not just let this thread die (with or without retries). As-is, this PR does not address the problem in the title.

Thanks for looking into this, it's certainly not good for these failures to silently stall the worker indefinitely!

@mukkachaitanya
Copy link
Contributor

mukkachaitanya commented Mar 23, 2023

I don't think that we should add retries when we already know that the exceptions that would be caught here are non-retriable. Additionally, it may be unsafe or incorrect to retry on an arbitrary exception, and may produce unexpected behavior in the consumer or in the worker.

Fair point @gharris1727! I agree. I think the retries came into the picture because we were retrying on TimeoutException for example. However, they wouldn't make sense in cases where we know the exception is something non-retriable.

@vamossagar12
Copy link
Collaborator

Thanks for the PR @rohits64 . I would agree with @gharris1727 and @mukkachaitanya in this case that we shouldn't add the overhead of retries in this case. Also, as pointed out, the original problem of notifying the producers to the queue about the fact that the WorkerThread is dead and can't handle anymore requests isn't handled.
I would also factor in failing any inflight requests for reading from this topic. Because they might also stay in this loop forever.

@rohits64
Copy link
Author

Thanks @mukkachaitanya, @gharris1727 and @vamossagar12 for looking into this. As pointed out it should be fair to fail instead of retrying. As there might be unexpected behaviour with unhandled exceptions. I will make further changes so that it would fail the worker not just die silently.

@vamossagar12
Copy link
Collaborator

I don't think that we should add retries when we already know that the exceptions that would be caught here are non-retriable. Additionally, it may be unsafe or incorrect to retry on an arbitrary exception, and may produce unexpected behavior in the consumer or in the worker.

Fair point @gharris1727! I agree. I think the retries came into the picture because we were retrying on TimeoutException for example. However, they wouldn't make sense in cases where we know the exception is something non-retriable.

Yeah TimeoutException is a tricky one because it is generally thrown after having waited for some amount of time (like default.api.timeout.ms when fetching position from a Consumer or listing topics.) So in sense the waiting has already happened post which we can ascertain that something is broken and yet it extends RetriableException which suggests it's transient. In this case, I would still lean on not retrying and failing everything.

@rohits64 rohits64 changed the title KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread Mar 27, 2023
Comment on lines 319 to 321
if (offsetLog.unexpectedExceptionCaught.get()) {
throw new ConnectException("Unexpected Exception caught", offsetLog.exception);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ - what if ConnectException is thrown inside readToEnd method instead in next line ? We may not need to access variables from KafkaBasedLog class explicitly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect Exception can be thrown inside the readToEnd method, we need to handle it at all place where it is called, instead of just in KafkaOffsetBackingStore. It would be better approach as it would handle all cases of this bug.

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants