You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reason this happens is that we are getting the information about the "running" and "done" tasks from two different sources. We first consult the disco_server and get the information about the "running" tasks and then we consult the job event handlers and get the information about the "done" tasks. If any of the tasks finishes in this small window of time, it will be counted both as a running and as a done task which results in the inconsistency.
One way to avoid this problem is to first get the "done" tasks and then the running tasks. In that case, the inconsistencies will be counted as "waiting" tasks and is more acceptable.
Your explanation for the cause makes sense but seems to suggest its a UI issue. Do you have any idea why, when we see this, the job always seems to hang indefinitely with negative waiting count? No further progress is made and nothing is actually run on the job once we see the count go negative on the ui.
There was a bug in 0.5.2 with the same symptoms that caused the job to hang. Please upgrade to 0.5.3. If you still have this issue in 0.5.3, then it is a different issue and should be tracked and fixed separately.
This happens when we have a large number of workers (100 in this case) for each slave node.
The text was updated successfully, but these errors were encountered: