Actively Reading tasks fails even though Pending completion tasks are a success

### Affected Version
Druid 25.0.0



### Description

We are facing a bug in [SeekableStreamSupervisor.java](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2067) .
We are running a Kafka Ingestion Supervisor, with task replication 2. Task group [3] consuming from partition [P1, P2] contains two tasks `[A1, A2]`  , each one consuming from same partitions (Replica tasks) . When we submit the supervisor config, [A1, A2] are to be added in `PendingCompletionTaskgroup` , as `{ groupId:3, [ TaskGroup[A1, A2] ]  }` . But what we observed from some logs like
1 . Creating new pending completion task group [3] for discovered task [A1]	overlord-0
2 .	Creating new pending completion task group [3] for discovered task [A2]	overlord-0
3 .	Added discovered task [A2] to existing pending task group [3]
Which makes the state of PendingCompletionTaskGroup as `{ groupId:3, [ TaskGroup[A1, A2], TaskGroup[A2] ] }` . We have seen the logs from sequence `1` and `2` are from two different process threads, and probably due to some Synchronisation issues in `CopyOnWriteArrayList` , we are getting two task groups with duplicate taskIds, instead of one in a taskGroup.
We have further Rootcaused this issue due to `forEach` loop [here](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2067), which instead of calling `addDiscoveredTaskToPendingCompletionTaskGroups` for A1, A2 , permutes it with partitions too, and calls for `A1_P1, A2_P1,  A1_P2, A2_P2` , causing these multiple task groups to be created in a lot of cases. (Almost 17% of cases)
The problem that arises due to this is, as we have `[ TaskGroup[A1, A2], TaskGroup[A2] ]` in PendingCompletionGroup for a GroupId. We see in rare cases where even if A1 passes, and A2 fails . It can mark  [entireTaskGroupFailed = true](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L3491) . and not only kill [A1,A2] but kill all the[ actively running tasks](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L3540) which are running, causing ingestion lags.
We wanted to have a discussion on a potential fix for this issue, as this bug can cause ingestion lag incidents when we submit supervisor. On that, I also wanted a clarity on is having task group like `[ TaskGroup[A1, A2], TaskGroup[A2] ]`  in pendingCompletionGroup is an expected behaviour ? Should the taskId not be permuted with partitions Or we should add some synchronisation locks on preventing two replica tasks to be added in two different groups for a single groupId. Or is it a case to be considered in  [void checkPendingCompletionTasks()](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L3450) to avoid killing Actively running tasks.
Any help on understanding it better would be appreciated.

Please include as much detailed information about the problem as possible. 
I have included everything I can in the issue description, if any more logs or debugging is required, I can add in comments on ad-hoq requests
- Cluster size
- Configurations in use
- Steps to reproduce the problem
- The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
- Any debugging that you have already done


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actively Reading tasks fails even though Pending completion tasks are a success #16727

Affected Version

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Actively Reading tasks fails even though Pending completion tasks are a success #16727

Description

Affected Version

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions