Fix race condition in KubernetesTaskRunner when task is added to the map by YongGang · Pull Request #14643 · apache/druid

YongGang · 2023-07-24T03:46:44Z

Description

Seems there is a multi-threading issue introduced from this change to KubernetesTaskRunner #14435
Following exception was thrown under high load:

org.apache.druid.java.util.common.ISE: Task [partial_dimension_cardinality_xxx] disappeared
	at org.apache.druid.k8s.overlord.KubernetesTaskRunner.doTask(KubernetesTaskRunner.java:167) ~[?:?]
	at org.apache.druid.k8s.overlord.KubernetesTaskRunner.runTask(KubernetesTaskRunner.java:151) ~[?:?]
	at org.apache.druid.k8s.overlord.KubernetesTaskRunner.lambda$null$0(KubernetesTaskRunner.java:138) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]

It's due to task is added to tasks map from the main thread and in doTask (called by runTask) it will check task existence from a pool thread thus caused race condition as shown in the following code:

return tasks.computeIfAbsent(
        task.getId(), k -> new KubernetesWorkItem(task, exec.submit(() -> runTask(task)))
    ).getResult();

In this PR we changed the KubernetesWorkItem constructor to allow TaskStatusFuture set by a method after the instance has been initialized.

Release note

fix race condition in K8s task runner.

This PR has:

kfaraz · 2023-07-24T04:41:48Z

I feel that re-introducing locks is a step backwards. If the changes in #14435 seem to be causing trouble, we should just revert that commit rather than introduce a new kind of locking. This would also make sense given the upcoming release of Druid 27.

Alternatively,

IIUC, the problem here is that while one thread is in the middle of adding the work item using tasks.computeIfAbsent, the executor has already picked it up for running.

There are two easy ways to avoid that (unless I am missing something):

Option 1: Set `result` in `KubernetesWorkItem` only after work item has been added to map:

@Override
  public ListenableFuture<TaskStatus> run(Task task)
  {
    final KubernetesWorkItem workItem = tasks.computeIfAbsent(
        task.getId(), k -> new KubernetesWorkItem(task))
    );
    workItem.setResultIfRequired(
        exec.submit(() -> runTask(task))
    );
    return workItem.getResult();
  }

This requires creating a new method synchronized void setResultIfRequired() inside KubernetesWorkItem.
Something similar would need to be done for joinAsync method too.

Option 2: Start `doTask` only after `computeIfAbsent` has finished.

private TaskStatus runTask(Task task)
  {
    final AtomicReference<TaskStatus> taskStatus = new AtomicReference<>();
    tasks.compute(
        task.getId(), (taskId, workItem) -> {
          taskStatus.set(doTask(task, workItem, true));
          return workItem;
        }
    );
    return taskStatus.get();
  }

Something similar would have to be done for joinTask method too.

I personally prefer option 1 as it doesn't unncessarily block an executor thread (even if it is for a little bit) and is logically simpler to follow.
@YongGang , @georgew5656 , what do you think?

YongGang · 2023-07-24T05:29:22Z

The problem of option 1 is the constructor of KubernetesWorkItem will call the superclass' constructor as well. And even if we opt to change both KubernetesWorkItem and TaskRunnerWorkItem class to add setResultIfRequired method is still not ideal and error prone as developers need to remember to call setResultIfRequired right after instance being created.

public class KubernetesWorkItem extends TaskRunnerWorkItem
{
  public KubernetesWorkItem(Task task, ListenableFuture<TaskStatus> statusFuture)
  {
    super(task.getId(), statusFuture);
    this.task = task;
  }

In this case I prefer to revert last commit. Though synchronized on tasks in all the operations seems overkill as it's already a ConcurrentHashMap but it makes the class more thread-safe (in theory).

kfaraz · 2023-07-24T05:55:06Z

I agree, we should not update the super class. The result field is not serializable and is there in the super class just for convenience. You can pass a null to it.

Since we have a specific use case, it is okay to override the default behaviour. You would need to maintain the result field at the KubernetesWorkItem level and override the getResult method to return this instead of the one in the super class.

If KubernetesWorkItem does not take a future in its constructor, then users would naturally know to have to set it. You can add javadocs to that effect and throw an exception in getResult if not set. So I am not sure the error prone concern is really valid.

kfaraz · 2023-07-24T05:57:23Z

Though synchronized on tasks in all the operations seems overkill as it's already a ConcurrentHashMap but it makes the class more thread-safe (in theory).

Yeah, it is certainly thread-safe (even in practice) to have everything be synchronized under the same lock but it also limits performance. A single bad call can potentially block everything.

We are trying to revisit all usages of locks in the Druid code base, and the original PR #14435 was in this same vein of improving performance.

georgew5656 · 2023-07-24T19:15:59Z

@kfaraz @YongGang i think it it makes sense to revert removing the locks since this is a valid synchronization use case (the main thread inserting into the map has to happen before the worker thread reads from the map).

i think we can keep the non synchronized getRunningTasks/getKnownTasks/getPendingTasks from the other PR though.

edit: hmm just saw your other comment, let me look at that solution first.

i think solution 1 makes sense. there is still a concurrent access where the main thread is trying to set result while the worker thread is trying to get/set kubernetesPeonLifecycle but I afaik i don't think that should cause any issues.

i am wondering if there will be a issue with having a work item with no result future attached to it though? that might be the reason the constructor is the way it is. this could be tested out though i think.

georgew5656 · 2023-07-24T20:01:40Z

it seems to me like there are some methods in TaskQueue that assume that statusFuture will not be null, if we want to go with option 1 i think we need to filter getKnownTasks, getPendingTasks to check that result is not null.

georgew5656 · 2023-07-24T21:02:57Z

after thinking about this some more I feel like it's simpler to just leave the synchronized block in for run/joinAsync (run only happens once when tasks are run and joinAsync only happens once when the overlord is restarted), and in doTask (only happens once in worker threads). we can leave getPendingTasks, getKnownTasks, getRunningTasks unsynchronized and I think that will prevent most of the lock contention.

i think this is a safer solution for druid 27 and we can maybe investigate this change to remove synchronization entirely after doing some more scale testing to see how much it helps.

@kfaraz what do you think?

kfaraz · 2023-07-25T02:41:29Z

I feel like it's simpler to just leave the synchronized block in for run/joinAsync

@georgew5656 , @YongGang, if you prefer having the synchronized for now, then we can proceed with it. 👍🏻

i think this is a safer solution for druid 27 and we can maybe investigate this change to remove synchronization entirely after doing some more scale testing to see how much it helps.

This is not a blocker for Druid 27 as the bug is in a contrib extension. But yes, it would be better to have it addressed.

we can maybe investigate this change to remove synchronization entirely after doing some more scale testing to see how much it helps.

Removing this synchronization might not help a lot in performance as the critical section is very small and threads would not be blocked for long. The primary reason I did not prefer it was to have homogeneity in the code. It gets confusing to have synchronization in some places and not in the other places. And if we have synchronization in all places, then it does start affecting performance.

YongGang · 2023-07-25T05:47:43Z

I feel like it's simpler to just leave the synchronized block in for run/joinAsync

@georgew5656 , @YongGang, if you prefer having the synchronized for now, then we can proceed with it. 👍🏻

If we agree on this then I will discard the changes in this PR (the Option 1 solution) and work on partially revert this change #14435

YongGang · 2023-07-25T16:44:30Z

I feel like it's simpler to just leave the synchronized block in for run/joinAsync

@georgew5656 , @YongGang, if you prefer having the synchronized for now, then we can proceed with it. 👍🏻

If we agree on this then I will discard the changes in this PR (the Option 1 solution) and work on partially revert this change #14435

Done. Please have a look @georgew5656 @kfaraz

kfaraz · 2023-07-26T09:12:31Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

+      tasks.computeIfAbsent(task.getId(), k -> new KubernetesWorkItem(task, exec.submit(() -> doTask(task, run))));
+      return tasks.get(task.getId()).getResult();


this can be a single chained statement. Also it would look more readable if the arguments to computeIfAbsent were on a different line as in the original method.

kfaraz · 2023-07-26T09:21:43Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

-      throw new ISE("Task [%s] disappeared", task.getId());
-    }
+        if (workItem == null) {
+          throw new ISE("Task [%s] disappeared", task.getId());


Should we maybe just return a failed TaskStatus here instead of throwing an exception? The exception thrown by this method may or may not be handled by the calling code, but no point depending on that if we already know that the reason for the task failure.

We should do the same thing in the catch block too.

But this doesn't need to be done as a part of this PR, just wanted to call it out.

I looked at what other TaskRunners do so we can have consistent behavior (throw error or return task failure).
In ThreadingTaskRunner, seems the code is similar to what we have here.

Makes sense, we can revisit this later.

kfaraz · 2023-07-26T09:26:31Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

      try {
        Task task = adapter.toTask(job);
-        tasks.add(Pair.of(task, joinAsync(task)));
+        restoredTasks.add(Pair.of(task, runOrJoinTask(task, false)));


I preferred the original separation between joinTask and runTask. Passing a boolean is cryptic and makes the code less readable.

Updated. My thinking was since we added synchronized block (or lock in the previous commit), it's better to have a single place/method to have this complexity. But agree this may make the code less readable.

Yeah, that is why I didn't give this feedback originally. But upon reading the code again, it felt cleaner to have them separate.

kfaraz · 2023-07-26T09:27:25Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

  }

-  protected ListenableFuture<TaskStatus> joinAsync(Task task)
+  protected ListenableFuture<TaskStatus> runOrJoinTask(Task task, boolean run)


It was better to have this as two separate methods, seemed more readable and easy to understand.

yeah agree, we can just put the synchronized block back in in (revert parts of that previous pr i made to remove them) without changing any function names

Done. My comment as above:

My thinking was since we added synchronized block (or lock in the previous commit), it's better to have a single place/method to have this complexity. But agree this may make the code less readable.

kfaraz

Thanks a lot for your patience with the back and forth on this PR, @YongGang ! 🙂

kfaraz · 2023-07-27T04:51:40Z

Some checks were skipped, re-triggering them.

…map (#14643) Changes: - Fix race condition in KubernetesTaskRunner introduced by #14435 - Perform addition and removal from map inside a synchronized block - Update tests

…map (apache#14643) Changes: - Fix race condition in KubernetesTaskRunner introduced by apache#14435 - Perform addition and removal from map inside a synchronized block - Update tests

Fix race condition in KubernetesTaskRunner when task is added to the map

c4be868

kfaraz modified the milestone: 27.0 Jul 24, 2023

Remove lock and introduce Result setter

f270ec5

Add back synchronized block in run and doTask methods

1b57061

YongGang added 2 commits July 25, 2023 14:13

move task put to try block

182da42

fix checkstyle

38cc459

kfaraz reviewed Jul 26, 2023

View reviewed changes

separate run and join methods

0a78d11

kfaraz approved these changes Jul 27, 2023

View reviewed changes

kfaraz closed this Jul 27, 2023

kfaraz reopened this Jul 27, 2023

kfaraz merged commit 9b88b78 into apache:master Jul 27, 2023

YongGang deleted the fix-race-task branch August 2, 2023 17:30

LakshSingla added this to the 28.0 milestone Oct 12, 2023

		tasks.computeIfAbsent(task.getId(), k -> new KubernetesWorkItem(task, exec.submit(() -> doTask(task, run))));
		return tasks.get(task.getId()).getResult();

Conversation

YongGang commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Uh oh!

kfaraz commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Option 1: Set result in KubernetesWorkItem only after work item has been added to map:

Option 2: Start doTask only after computeIfAbsent has finished.

Uh oh!

YongGang commented Jul 24, 2023

Uh oh!

kfaraz commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgew5656 commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgew5656 commented Jul 24, 2023

Uh oh!

georgew5656 commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YongGang commented Jul 25, 2023

Uh oh!

YongGang commented Jul 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgew5656 Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Jul 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YongGang commented Jul 24, 2023 •

edited

Loading

kfaraz commented Jul 24, 2023 •

edited

Loading

Option 1: Set `result` in `KubernetesWorkItem` only after work item has been added to map:

Option 2: Start `doTask` only after `computeIfAbsent` has finished.

kfaraz commented Jul 24, 2023 •

edited

Loading

kfaraz commented Jul 24, 2023 •

edited

Loading

georgew5656 commented Jul 24, 2023 •

edited

Loading

georgew5656 commented Jul 24, 2023 •

edited

Loading

kfaraz commented Jul 25, 2023 •

edited

Loading

georgew5656 Jul 26, 2023 •

edited

Loading