All workflow tasks should be monitored until they reach a final state #1861

abaumann · 2017-01-17T19:23:30Z

Right now if a workflow fails and Cromwell is in fail fast mode, the workflow gets marked as failed and monitoring stops. What this means is if one task fails, but another is "Running", that other task will always be in a "Running" state. This leads to many user questions (e.g. "Is this still running and costing me money?").

Every task should be monitored to its final state rather than only workflows.

katevoss · 2017-03-31T15:14:59Z

@abaumann is the issue that Cromwell doesn't update the state of the "Running" task, which has failed (or stopped), or that it keeps running that task, which should have failed (or stopped)?

katevoss · 2017-03-31T17:12:36Z

@cjllanwarne do you know if this has been fixed?

Horneth · 2017-03-31T17:15:19Z

This hasn't been fixed

knoblett · 2017-06-15T15:20:45Z

Another user encountered this issue here. @katevoss Is there any plan to prioritize this bug fix soon? I'm not fully aware of what bugs you have to tackle at the moment.

abaumann · 2017-06-15T15:23:12Z

Sorry @katevoss I didn't answer this before - it's the issue that the workflow fails because we are in fail fast state, but the calls inside that workflow don't get updated - so you can get a workflow that says "Failed" with a bunch of tasks that say "Running", which is confusing to users and they often ask if they are still running or not (the answer is that they are until they get to a final state, but no subsequent parts of the workflow continue after that point)

knoblett · 2017-06-15T17:01:51Z

This bug is blocking Cromwell's ability to call cache, due to the fact that Cromwell won't pull "Running" calls as "Succeeded" ones. This would be very helpful to fix soon.

katevoss · 2017-06-29T06:03:27Z

@danbills if you could get a chance to look at this in your bug rotation that would be great, sounds like a high priority for FC.

danbills · 2017-06-29T12:30:19Z

@katevoss Roger will start today

katevoss · 2017-09-07T21:46:19Z

@knoblett is this still blocking Cromwell's call caching because calls aren't updated as succeeded? I haven't heard this come up in a while.

bradtaylor · 2017-10-04T20:49:46Z

Hi Guys, we're getting more heat on this issue in firecloud. See:
https://gatkforums.broadinstitute.org/firecloud/discussion/comment/42587#Comment_42587

@katevoss FYI

katevoss · 2017-10-05T14:13:09Z

@bradtaylor is this blocking people from running workflows? How often is it occurring? Is there a workaround?

@geoffjentry or @kshakir (our current bug rotator), do you have an idea of the effort to fix this?

abaumann · 2017-10-05T14:17:22Z

just want to also be sure to make it clear there are more issues related to that forum post than this specific issue - this one is on making sure the task statuses reach a final state when the workflow ends in a terminal state - that's different than aborts not working - both are an issue, but different

abaumann · 2017-10-05T14:22:09Z

I don't think these issues block anybody - they just lead to constant questions and give a bad impression of our reliability. People are often worried they are still spending money because it looks that way.

I could do a query to probably find how often aborts don't work if that helps.

There isn't a workaround to either issue - only that we tell users it's ok after we dig in to find out that it is and they just deal with the inconsistency.

Horneth · 2017-10-05T14:42:32Z

Even though it's not technically the same, we talked about ContinueWhilePossible as a possible temporary workaround for this. Has it been rejected since ?

abaumann · 2017-10-05T14:52:48Z

right now that's not the default in FC, nor do we expose it in the UI - people have used it and it does help for some circumstances where you need it, but it seems like overkill when all you want is reliable statuses. it also won't help with the aborting issue which is what the gatk post was

katevoss · 2017-10-05T15:22:58Z

@bradtaylor can you clarify which issue is more concerning? Is it the aborting issue or the status updating?

kshakir · 2017-10-05T16:46:22Z

May also be the cause of / related to #2526

kshakir · 2017-10-05T16:56:22Z

There are two tables in the cromwell database that are out of sync. Unofficially, if one knows what they're looking for, one can edit the values directly in the database. Often the case is the WorkflowStoreEntry doesn't have a record for the workflow, thus it cannot be aborted or resumed-on-restart, yet the last row in MetadataEntry says the workflow is still Running.

A "reconciler" could write a final row into MetadataEntry with Aborted/Success/Fail. This could be called:

Manually by a user/service
Automatically whenever a user requests status of a workflow
Automatically by a MetadataEntry sweeper looking for workflows w/o a finalization and no row in WorkflowStoreEntry
Other?

tmm211 · 2017-10-06T20:33:01Z

This was reported by another FireCloud user
Is there a way that we can force these jobs into Aborted so that the workspace stops appearing as "Running"

abaumann · 2017-10-06T20:52:06Z

Yes we can update the database manually but I hesitate to do that unless it's really serious. Since this specific ticket is not about aborts, should this be moved to a ticket about aborts?

tmm211 · 2017-10-10T16:48:16Z

Sure. What's the ticket number? The issue this user posted is about both submitted workflows and aborted workflows getting stuck. I asked him/her to abort the submitted ones so that the workspace would stop showing as Running, but that didn't work. Can we confirm that they aren't getting charged for machines not aborting that they have requested to abort?

abaumann · 2017-10-10T17:39:30Z

Don't know the ticket for aborts - but yes we can confirm individual submissions/workflows, however it's tedious process and you need an admin to do it. Almost every time I've checked it's just a matter of statuses being incorrect and not that the machine is still running

Horneth · 2017-11-07T19:04:16Z

Fixed by #2808

geoffjentry mentioned this issue Jan 17, 2017

please explain retry tasks reported in workflow timing diagram #1860

Closed

katevoss added the Fire Friends label Apr 5, 2017

katevoss added the Medium label May 18, 2017

katevoss added this to the User Driven Development (& Bug Fixes) milestone May 18, 2017

katevoss added High and removed Medium labels Jun 15, 2017

vdauwera added the User Requested Improvement label Aug 6, 2017

katevoss added the PO Cleanup label Sep 5, 2017

katevoss removed the High label Sep 7, 2017

katevoss mentioned this issue Oct 11, 2017

Workflow inputs don't override WDL defaults openwdl/wdl#141

Closed

Horneth closed this as completed Nov 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All workflow tasks should be monitored until they reach a final state #1861

All workflow tasks should be monitored until they reach a final state #1861

abaumann commented Jan 17, 2017

katevoss commented Mar 31, 2017

katevoss commented Mar 31, 2017

Horneth commented Mar 31, 2017 •

edited

knoblett commented Jun 15, 2017

abaumann commented Jun 15, 2017

knoblett commented Jun 15, 2017

katevoss commented Jun 29, 2017 •

edited

danbills commented Jun 29, 2017

katevoss commented Sep 7, 2017

bradtaylor commented Oct 4, 2017

katevoss commented Oct 5, 2017

abaumann commented Oct 5, 2017

abaumann commented Oct 5, 2017

Horneth commented Oct 5, 2017

abaumann commented Oct 5, 2017

katevoss commented Oct 5, 2017

kshakir commented Oct 5, 2017

kshakir commented Oct 5, 2017

tmm211 commented Oct 6, 2017

abaumann commented Oct 6, 2017

tmm211 commented Oct 10, 2017

abaumann commented Oct 10, 2017

Horneth commented Nov 7, 2017

All workflow tasks should be monitored until they reach a final state #1861

All workflow tasks should be monitored until they reach a final state #1861

Comments

abaumann commented Jan 17, 2017

katevoss commented Mar 31, 2017

katevoss commented Mar 31, 2017

Horneth commented Mar 31, 2017 • edited

knoblett commented Jun 15, 2017

abaumann commented Jun 15, 2017

knoblett commented Jun 15, 2017

katevoss commented Jun 29, 2017 • edited

danbills commented Jun 29, 2017

katevoss commented Sep 7, 2017

bradtaylor commented Oct 4, 2017

katevoss commented Oct 5, 2017

abaumann commented Oct 5, 2017

abaumann commented Oct 5, 2017

Horneth commented Oct 5, 2017

abaumann commented Oct 5, 2017

katevoss commented Oct 5, 2017

kshakir commented Oct 5, 2017

kshakir commented Oct 5, 2017

tmm211 commented Oct 6, 2017

abaumann commented Oct 6, 2017

tmm211 commented Oct 10, 2017

abaumann commented Oct 10, 2017

Horneth commented Nov 7, 2017

Horneth commented Mar 31, 2017 •

edited

katevoss commented Jun 29, 2017 •

edited