Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All workflow tasks should be monitored until they reach a final state #1861

Closed
abaumann opened this issue Jan 17, 2017 · 23 comments
Closed

All workflow tasks should be monitored until they reach a final state #1861

abaumann opened this issue Jan 17, 2017 · 23 comments

Comments

@abaumann
Copy link
Contributor

Right now if a workflow fails and Cromwell is in fail fast mode, the workflow gets marked as failed and monitoring stops. What this means is if one task fails, but another is "Running", that other task will always be in a "Running" state. This leads to many user questions (e.g. "Is this still running and costing me money?").

Every task should be monitored to its final state rather than only workflows.

@katevoss
Copy link

@abaumann is the issue that Cromwell doesn't update the state of the "Running" task, which has failed (or stopped), or that it keeps running that task, which should have failed (or stopped)?

@katevoss
Copy link

@cjllanwarne do you know if this has been fixed?

@Horneth
Copy link
Contributor

Horneth commented Mar 31, 2017

This hasn't been fixed

@knoblett
Copy link

Another user encountered this issue here. @katevoss Is there any plan to prioritize this bug fix soon? I'm not fully aware of what bugs you have to tackle at the moment.

@abaumann
Copy link
Contributor Author

Sorry @katevoss I didn't answer this before - it's the issue that the workflow fails because we are in fail fast state, but the calls inside that workflow don't get updated - so you can get a workflow that says "Failed" with a bunch of tasks that say "Running", which is confusing to users and they often ask if they are still running or not (the answer is that they are until they get to a final state, but no subsequent parts of the workflow continue after that point)

@knoblett
Copy link

This bug is blocking Cromwell's ability to call cache, due to the fact that Cromwell won't pull "Running" calls as "Succeeded" ones. This would be very helpful to fix soon.

@katevoss katevoss added High and removed Medium labels Jun 15, 2017
@katevoss
Copy link

katevoss commented Jun 29, 2017

@danbills if you could get a chance to look at this in your bug rotation that would be great, sounds like a high priority for FC.

@danbills
Copy link
Contributor

@katevoss Roger will start today

@katevoss
Copy link

katevoss commented Sep 7, 2017

@knoblett is this still blocking Cromwell's call caching because calls aren't updated as succeeded? I haven't heard this come up in a while.

@katevoss katevoss removed the High label Sep 7, 2017
@bradtaylor
Copy link

Hi Guys, we're getting more heat on this issue in firecloud. See:
https://gatkforums.broadinstitute.org/firecloud/discussion/comment/42587#Comment_42587

@katevoss FYI

@katevoss
Copy link

katevoss commented Oct 5, 2017

@bradtaylor is this blocking people from running workflows? How often is it occurring? Is there a workaround?

@geoffjentry or @kshakir (our current bug rotator), do you have an idea of the effort to fix this?

@abaumann
Copy link
Contributor Author

abaumann commented Oct 5, 2017

just want to also be sure to make it clear there are more issues related to that forum post than this specific issue - this one is on making sure the task statuses reach a final state when the workflow ends in a terminal state - that's different than aborts not working - both are an issue, but different

@abaumann
Copy link
Contributor Author

abaumann commented Oct 5, 2017

I don't think these issues block anybody - they just lead to constant questions and give a bad impression of our reliability. People are often worried they are still spending money because it looks that way.

I could do a query to probably find how often aborts don't work if that helps.

There isn't a workaround to either issue - only that we tell users it's ok after we dig in to find out that it is and they just deal with the inconsistency.

@Horneth
Copy link
Contributor

Horneth commented Oct 5, 2017

Even though it's not technically the same, we talked about ContinueWhilePossible as a possible temporary workaround for this. Has it been rejected since ?

@abaumann
Copy link
Contributor Author

abaumann commented Oct 5, 2017

right now that's not the default in FC, nor do we expose it in the UI - people have used it and it does help for some circumstances where you need it, but it seems like overkill when all you want is reliable statuses. it also won't help with the aborting issue which is what the gatk post was

@katevoss
Copy link

katevoss commented Oct 5, 2017

@bradtaylor can you clarify which issue is more concerning? Is it the aborting issue or the status updating?

@kshakir
Copy link
Contributor

kshakir commented Oct 5, 2017

May also be the cause of / related to #2526

@kshakir
Copy link
Contributor

kshakir commented Oct 5, 2017

There are two tables in the cromwell database that are out of sync. Unofficially, if one knows what they're looking for, one can edit the values directly in the database. Often the case is the WorkflowStoreEntry doesn't have a record for the workflow, thus it cannot be aborted or resumed-on-restart, yet the last row in MetadataEntry says the workflow is still Running.

A "reconciler" could write a final row into MetadataEntry with Aborted/Success/Fail. This could be called:

  • Manually by a user/service
  • Automatically whenever a user requests status of a workflow
  • Automatically by a MetadataEntry sweeper looking for workflows w/o a finalization and no row in WorkflowStoreEntry
  • Other?

@tmm211
Copy link

tmm211 commented Oct 6, 2017

This was reported by another FireCloud user
Is there a way that we can force these jobs into Aborted so that the workspace stops appearing as "Running"

@abaumann
Copy link
Contributor Author

abaumann commented Oct 6, 2017

Yes we can update the database manually but I hesitate to do that unless it's really serious. Since this specific ticket is not about aborts, should this be moved to a ticket about aborts?

@tmm211
Copy link

tmm211 commented Oct 10, 2017

Sure. What's the ticket number? The issue this user posted is about both submitted workflows and aborted workflows getting stuck. I asked him/her to abort the submitted ones so that the workspace would stop showing as Running, but that didn't work. Can we confirm that they aren't getting charged for machines not aborting that they have requested to abort?

@abaumann
Copy link
Contributor Author

Don't know the ticket for aborts - but yes we can confirm individual submissions/workflows, however it's tedious process and you need an admin to do it. Almost every time I've checked it's just a matter of statuses being incorrect and not that the machine is still running

@Horneth
Copy link
Contributor

Horneth commented Nov 7, 2017

Fixed by #2808

@Horneth Horneth closed this as completed Nov 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants