KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread #13361

rohits64 · 2023-03-08T07:32:00Z

Here WorkThread dies if any unexpected exception is encountered. And the connectors/tasks are not aware of this and they would keep submitting callbacks to KafkaBasedLog with nobody processing them. I have added a uncaught exception handler in case if any uncaught exception comes in and it is detected and the Worker fails.

EDIT: ~~Instead of starting a new thread, I have added retry with exponential backoff with limited number of retries(currently at 10)~~

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

mukkachaitanya

Hi @rohits64! Thanks for picking this up. It can be particularly painful when the WorkThread dies and there is no feedback to the Herder/operator. It seems clear, failures esp in Offset storage can cause no offset commits and Status storage can lead to stale statuses.

First, let's make sure you are signed up on Apache Jira, and if you could assign this ticket to yourself that would be great. (just send a mail to dev@kafka.apache.org and ask for a Jira account with the preferred username).

It would also be nice if we could discuss what is expected when the WorkThread is killed.

Currently, we already retry for both TimeException and RetriableException forever. In several environments, when the exception is persistent, for example when there are persistent timeouts the work thread might get into an infinite retry mode (we have seen cases of this on production systems). This is dangerous as (1) there is no visibility to the users other than logs on these retries and (2) overtime the callbacks might fill the queue. However, it seems this is an issue we need to address separately but sets a good context.

Since the KafkaBasedLog mainly consumes/produces the Kafka topics, we are already retrying the exceptions mentioned above. However, any other runtime exception here IMO should be treated as fatal and should be treated accordingly. Two options seem obvious:

We can possibly retry the exception some static number of times with exponential back-off and then fail the WorkThread. The exception must be propagated back to the Herder and fail the Worker.
We can also fail the worker as soon as we see the exception. Again we propagate the error back to the Herder and fail it.

In either case, since on worker failures rebalances will be triggered the connector/tasks are back healthy than stay stuck in a weird state.

WDYT of this @rohits64, @vamossagar12? Are there any concerns I am missing here?

mukkachaitanya · 2023-03-13T06:51:19Z

connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

+                // resume the thread in case of any unexpected exception encountered
+                thread = new WorkThread();
+                thread.start();


It might not be an acceptable solution to restart forever. However, if we were to retry we could keep using the same thread, and need not spawn a new thread. We have precedence for retrying for RetriableException and TimeoutException since the thread itself is stateless.

rohits64 · 2023-03-21T05:49:36Z

Hi @mukkachaitanya, thanks for the inputs. I think it we can go with exponential backoff with some static number of retries. I have made the changes. Currently I have made the number of retry as 10. That should be sufficient number of retries.

gharris1727 · 2023-03-22T18:17:03Z

Since the KafkaBasedLog mainly consumes/produces the Kafka topics, we are already retrying the exceptions mentioned above. However, any other runtime exception here IMO should be treated as fatal and should be treated accordingly. Two options seem obvious:

We can possibly retry the exception some static number of times with exponential back-off and then fail the WorkThread. The exception must be propagated back to the Herder and fail the Worker.

I don't think that we should add retries when we already know that the exceptions that would be caught here are non-retriable. Additionally, it may be unsafe or incorrect to retry on an arbitrary exception, and may produce unexpected behavior in the consumer or in the worker.

@rohits64 I also think that the PR as-is does not address the latter point that @mukkachaitanya made. We need to propagate these failures to the asynchronous callers, and not just let this thread die (with or without retries). As-is, this PR does not address the problem in the title.

Thanks for looking into this, it's certainly not good for these failures to silently stall the worker indefinitely!

mukkachaitanya · 2023-03-23T11:02:27Z

I don't think that we should add retries when we already know that the exceptions that would be caught here are non-retriable. Additionally, it may be unsafe or incorrect to retry on an arbitrary exception, and may produce unexpected behavior in the consumer or in the worker.

Fair point @gharris1727! I agree. I think the retries came into the picture because we were retrying on TimeoutException for example. However, they wouldn't make sense in cases where we know the exception is something non-retriable.

vamossagar12 · 2023-03-24T09:18:16Z

Thanks for the PR @rohits64 . I would agree with @gharris1727 and @mukkachaitanya in this case that we shouldn't add the overhead of retries in this case. Also, as pointed out, the original problem of notifying the producers to the queue about the fact that the WorkerThread is dead and can't handle anymore requests isn't handled.
I would also factor in failing any inflight requests for reading from this topic. Because they might also stay in this loop forever.

rohits64 · 2023-03-24T09:23:28Z

Thanks @mukkachaitanya, @gharris1727 and @vamossagar12 for looking into this. As pointed out it should be fair to fail instead of retrying. As there might be unexpected behaviour with unhandled exceptions. I will make further changes so that it would fail the worker not just die silently.

vamossagar12 · 2023-03-24T09:26:55Z

I don't think that we should add retries when we already know that the exceptions that would be caught here are non-retriable. Additionally, it may be unsafe or incorrect to retry on an arbitrary exception, and may produce unexpected behavior in the consumer or in the worker.

Fair point @gharris1727! I agree. I think the retries came into the picture because we were retrying on TimeoutException for example. However, they wouldn't make sense in cases where we know the exception is something non-retriable.

Yeah TimeoutException is a tricky one because it is generally thrown after having waited for some amount of time (like default.api.timeout.ms when fetching position from a Consumer or listing topics.) So in sense the waiting has already happened post which we can ascertain that something is broken and yet it extends RetriableException which suggests it's transient. In this case, I would still lean on not retrying and failing everything.

sudeshwasnik · 2023-04-15T10:49:52Z

connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java

+        if (offsetLog.unexpectedExceptionCaught.get()) {
+            throw new ConnectException("Unexpected Exception caught", offsetLog.exception);
+        }


QQ - what if ConnectException is thrown inside readToEnd method instead in next line ? We may not need to access variables from KafkaBasedLog class explicitly.

Connect Exception can be thrown inside the readToEnd method, we need to handle it at all place where it is called, instead of just in KafkaOffsetBackingStore. It would be better approach as it would handle all cases of this bug.

Remove redundant commit

github-actions · 2023-08-04T03:32:58Z

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.

added functionality to resume thread and test

c10b3dc

rohits64 marked this pull request as draft March 8, 2023 07:39

rohits64 changed the title ~~KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies~~ KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies [WIP] Mar 8, 2023

C0urante added the connect label Mar 8, 2023

mukkachaitanya suggested changes Mar 13, 2023

View reviewed changes

rohits64 and others added 4 commits March 19, 2023 19:02

Merge branch 'apache:trunk' into KAFKA-14401

ff3f012

added exponential backoff in retry

36383bc

added exponential backoff for unexpected exceptions

fa14ded

reset retry counter

8d40ada

Update KafkaBasedLog.java

5e51ccb

rohits64 changed the title ~~KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies [WIP]~~ KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies Mar 21, 2023

rohits64 marked this pull request as ready for review March 21, 2023 06:07

rohits64 and others added 2 commits March 27, 2023 13:42

Merge branch 'apache:trunk' into KAFKA-14401

81ec956

added uncaught exception handler

bc19cef

rohits64 changed the title ~~KAFKA-14401: Resume WorkThread if Connector/Tasks reading offsets get stuck when underneath WorkThread dies~~ KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread Mar 27, 2023

rohits64 and others added 9 commits March 29, 2023 10:41

Merge branch 'apache:trunk' into KAFKA-14401

86ab741

Merge branch 'apache:trunk' into KAFKA-14401

541d1f9

Merge branch 'apache:trunk' into KAFKA-14401

d289f6f

reset changes

ec65df4

Merge branch 'KAFKA-14401' of github.com:rohits64/kafka into KAFKA-14401

86a314e

added unhandled exception for kafkaoffsetbackingstore

ddd9f6e

Merge branch 'KAFKA-14401' of github.com:rohits64/kafka into KAFKA-14401

7b796d5

checkstyle fix

5d95122

initialized atomic boolean

a70d736

sudeshwasnik reviewed Apr 15, 2023

View reviewed changes

rohits64 and others added 2 commits May 2, 2023 10:07

Update KafkaBasedLog.java

be25e67

Remove redundant commit

throw exception in KafkaBasedLog

3754d61

github-actions bot added the stale Stale PRs label Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread #13361

KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread #13361

rohits64 commented Mar 8, 2023 •

edited

mukkachaitanya left a comment •

edited

mukkachaitanya Mar 13, 2023

rohits64 commented Mar 21, 2023

gharris1727 commented Mar 22, 2023

mukkachaitanya commented Mar 23, 2023 •

edited

vamossagar12 commented Mar 24, 2023

rohits64 commented Mar 24, 2023

vamossagar12 commented Mar 24, 2023

sudeshwasnik Apr 15, 2023

rohits64 May 2, 2023

github-actions bot commented Aug 4, 2023

KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread #13361

Are you sure you want to change the base?

KAFKA-14401: Fail the worker if any uncaught exception comes in WorkThread #13361

Conversation

rohits64 commented Mar 8, 2023 • edited

Committer Checklist (excluded from commit message)

mukkachaitanya left a comment • edited

Choose a reason for hiding this comment

mukkachaitanya Mar 13, 2023

Choose a reason for hiding this comment

rohits64 commented Mar 21, 2023

gharris1727 commented Mar 22, 2023

mukkachaitanya commented Mar 23, 2023 • edited

vamossagar12 commented Mar 24, 2023

rohits64 commented Mar 24, 2023

vamossagar12 commented Mar 24, 2023

sudeshwasnik Apr 15, 2023

Choose a reason for hiding this comment

rohits64 May 2, 2023

Choose a reason for hiding this comment

github-actions bot commented Aug 4, 2023

rohits64 commented Mar 8, 2023 •

edited

mukkachaitanya left a comment •

edited

mukkachaitanya commented Mar 23, 2023 •

edited