-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All workflow tasks should be monitored until they reach a final state #1861
Comments
@abaumann is the issue that Cromwell doesn't update the state of the "Running" task, which has failed (or stopped), or that it keeps running that task, which should have failed (or stopped)? |
@cjllanwarne do you know if this has been fixed? |
This hasn't been fixed |
Sorry @katevoss I didn't answer this before - it's the issue that the workflow fails because we are in fail fast state, but the calls inside that workflow don't get updated - so you can get a workflow that says "Failed" with a bunch of tasks that say "Running", which is confusing to users and they often ask if they are still running or not (the answer is that they are until they get to a final state, but no subsequent parts of the workflow continue after that point) |
This bug is blocking Cromwell's ability to call cache, due to the fact that Cromwell won't pull "Running" calls as "Succeeded" ones. This would be very helpful to fix soon. |
@danbills if you could get a chance to look at this in your bug rotation that would be great, sounds like a high priority for FC. |
@katevoss Roger will start today |
@knoblett is this still blocking Cromwell's call caching because calls aren't updated as succeeded? I haven't heard this come up in a while. |
Hi Guys, we're getting more heat on this issue in firecloud. See: @katevoss FYI |
@bradtaylor is this blocking people from running workflows? How often is it occurring? Is there a workaround? @geoffjentry or @kshakir (our current bug rotator), do you have an idea of the effort to fix this? |
just want to also be sure to make it clear there are more issues related to that forum post than this specific issue - this one is on making sure the task statuses reach a final state when the workflow ends in a terminal state - that's different than aborts not working - both are an issue, but different |
I don't think these issues block anybody - they just lead to constant questions and give a bad impression of our reliability. People are often worried they are still spending money because it looks that way. I could do a query to probably find how often aborts don't work if that helps. There isn't a workaround to either issue - only that we tell users it's ok after we dig in to find out that it is and they just deal with the inconsistency. |
Even though it's not technically the same, we talked about |
right now that's not the default in FC, nor do we expose it in the UI - people have used it and it does help for some circumstances where you need it, but it seems like overkill when all you want is reliable statuses. it also won't help with the aborting issue which is what the gatk post was |
@bradtaylor can you clarify which issue is more concerning? Is it the aborting issue or the status updating? |
May also be the cause of / related to #2526 |
There are two tables in the cromwell database that are out of sync. Unofficially, if one knows what they're looking for, one can edit the values directly in the database. Often the case is the A "reconciler" could write a final row into
|
This was reported by another FireCloud user |
Yes we can update the database manually but I hesitate to do that unless it's really serious. Since this specific ticket is not about aborts, should this be moved to a ticket about aborts? |
Sure. What's the ticket number? The issue this user posted is about both submitted workflows and aborted workflows getting stuck. I asked him/her to abort the submitted ones so that the workspace would stop showing as Running, but that didn't work. Can we confirm that they aren't getting charged for machines not aborting that they have requested to abort? |
Don't know the ticket for aborts - but yes we can confirm individual submissions/workflows, however it's tedious process and you need an admin to do it. Almost every time I've checked it's just a matter of statuses being incorrect and not that the machine is still running |
Fixed by #2808 |
Right now if a workflow fails and Cromwell is in fail fast mode, the workflow gets marked as failed and monitoring stops. What this means is if one task fails, but another is "Running", that other task will always be in a "Running" state. This leads to many user questions (e.g. "Is this still running and costing me money?").
Every task should be monitored to its final state rather than only workflows.
The text was updated successfully, but these errors were encountered: