-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Affected Version
Druid 25.0.0
Description
We are facing a bug in SeekableStreamSupervisor.java .
We are running a Kafka Ingestion Supervisor, with task replication 2. Task group [3] consuming from partition [P1, P2] contains two tasks [A1, A2] , each one consuming from same partitions (Replica tasks) . When we submit the supervisor config, [A1, A2] are to be added in PendingCompletionTaskgroup , as { groupId:3, [ TaskGroup[A1, A2] ] } . But what we observed from some logs like
1 . Creating new pending completion task group [3] for discovered task [A1] overlord-0
2 . Creating new pending completion task group [3] for discovered task [A2] overlord-0
3 . Added discovered task [A2] to existing pending task group [3]
Which makes the state of PendingCompletionTaskGroup as { groupId:3, [ TaskGroup[A1, A2], TaskGroup[A2] ] } . We have seen the logs from sequence 1 and 2 are from two different process threads, and probably due to some Synchronisation issues in CopyOnWriteArrayList , we are getting two task groups with duplicate taskIds, instead of one in a taskGroup.
We have further Rootcaused this issue due to forEach loop here, which instead of calling addDiscoveredTaskToPendingCompletionTaskGroups for A1, A2 , permutes it with partitions too, and calls for A1_P1, A2_P1, A1_P2, A2_P2 , causing these multiple task groups to be created in a lot of cases. (Almost 17% of cases)
The problem that arises due to this is, as we have [ TaskGroup[A1, A2], TaskGroup[A2] ] in PendingCompletionGroup for a GroupId. We see in rare cases where even if A1 passes, and A2 fails . It can mark entireTaskGroupFailed = true . and not only kill [A1,A2] but kill all the actively running tasks which are running, causing ingestion lags.
We wanted to have a discussion on a potential fix for this issue, as this bug can cause ingestion lag incidents when we submit supervisor. On that, I also wanted a clarity on is having task group like [ TaskGroup[A1, A2], TaskGroup[A2] ] in pendingCompletionGroup is an expected behaviour ? Should the taskId not be permuted with partitions Or we should add some synchronisation locks on preventing two replica tasks to be added in two different groups for a single groupId. Or is it a case to be considered in void checkPendingCompletionTasks() to avoid killing Actively running tasks.
Any help on understanding it better would be appreciated.
Please include as much detailed information about the problem as possible.
I have included everything I can in the issue description, if any more logs or debugging is required, I can add in comments on ad-hoq requests
- Cluster size
- Configurations in use
- Steps to reproduce the problem
- The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
- Any debugging that you have already done